Music Library Ideas

ning-yu · February 28, 2015, 10:10pm

I’m currently working on the music library functionality, and have been thinking about what kind of data structure to use. This is a bit of a brain dump.

The classic data structure for a music library is probably a tree in with a hierarchy ‘genre’, ‘artist’, ‘album’, and ‘track’, with ‘track’ being the leaves at the bottom of the tree. After working with this a bit, however, there are some possible problems. What do you do when an album is associated with two artists? Or when a track is classified under two different genres?

One solution to this is to create the tree first, then iterate through it to remove such things as duplicate albums. The other solution, which I would like to try, is to turn the tree into a graph instead, where each link is bi-directional, and any node can link to any other node.

There is a convenient graph database system written for Node called LevelGraph. It handles basic manipulation operations such as getting and putting nodes, has a good system for searching through the database, and is supposedly pretty fast. In addition to letting us do simple things such as listing all tracks by a given artist, hopefully it will also be able to quickly give us things like “list all artists who have collaborated in recording song X” or “list all artists who have recorded piece Y by composer Z in the last 20 years”.

Meryn · March 16, 2015, 10:06pm

I have been thinking lately, and have searched a lot around, without much result. Then I thought, heck, why not just ‘hack’ MPD?

I dug into the files MPD saves, like the database and playlists etc, and found that the cache is just a plain file, including the ‘tree’ of your music (not by tag, just by directory). Why not let MPD do all the hard work, and we just use this file ‘read-only’ so we have our database? Might not be the best implementation but it sure saves us a lot of time implementing a scanner/database!

Heres the first few lines, with only 2 songs

info_begin
format: 2
mpd_version: 0.19.1
fs_charset: UTF-8
tag: Artist
tag: ArtistSort
tag: Album
tag: AlbumSort
tag: AlbumArtist
tag: AlbumArtistSort
tag: Title
tag: Track
tag: Name
tag: Genre
tag: Date
tag: Composer
tag: Performer
tag: Disc
tag: MUSICBRAINZ_ARTISTID
tag: MUSICBRAINZ_ALBUMID
tag: MUSICBRAINZ_ALBUMARTISTID
tag: MUSICBRAINZ_TRACKID
tag: MUSICBRAINZ_RELEASETRACKID
info_end
directory: USB
mtime: 1417893228
begin: USB
directory: Deep House
mtime: 1417893246
begin: USB/Deep House
song_begin: Androma - Dyami.mp3
Time: 376.464000
Title: Androma - Dyami
Track: 1
mtime: 1400863040
song_end
song_begin: Androma - Gunjule.mp3
Time: 399.480000
Title: Androma - Gunjule
Track: 1
mtime: 1400863047
song_end
end: USB/Deep House
end: USB
directory: WEBRADIO
mtime: 1378555326
begin: WEBRADIO
playlist_begin: AudiophileClassical.pls
mtime: 1378475429
playlist_end
playlist_begin: VeniceClassicRadioItalia.pls
mtime: 1378475430
playlist_end
end: WEBRADIO

(these 2 files dont have tags, but every Tag a file has, will be included in the cache, like the Genre)

I also looked into how MPD saves a playlist, it is just a .m3u file with the file-paths to the files.
Appearently MPD doesn’t need/uses special tags for track ID’s when storing stuff, and since we play these cached files with MPD only, I doubt we will need them either.

If you need to search songs by Genre, just look for ANY line that includes ‘genre’, and read everything in between the ‘song_begin’ above the line, and ‘song_end’ below the line, probably casting them to JSON.

If I can find some time soon I will create a little scipt that shows you exactly what I mean.

This does require us to have different caches for each service.

Meryn · March 17, 2015, 4:57pm

Forget about that. Had it somewhat working, but is was very slow.
135 files took ~200-600ms, it varied each time, but I think using the MPD command to get the database would be faster than reading its database directly. Parsing the data inside the database to the format you used in the dummy database took only 50ms or so (for 135 files), but opening the database takes very long, and opening it each search request will just result in a slow responses.
BTW, also forgot to convert track uri to id, and implement all the tags, but that part isn’t hard to do.

Back to your original post;
I dont think we should remove duplicates, if 1 file is associated with 2 genres, why dont we just put it in both?
For double albums from 2 different artists, can’t we just split them using the artist tag?

ning-yu · April 5, 2015, 4:55pm

I simulated using levelgraph as a music library database. See this repo for the code to evaluate performance. Clone into a folder and then run ‘npm install’ in order to install dependencies.

Run ‘node populate.js’ to insert new entries into the database called “yourdb”. Right now this is 1000 random dummy tracks distributed among 100 albums, 10 artists, and 3 genres. On my RPi, it takes 7800ms to insert 1000 tracks. You can see in the commented out sections of code that we are only currently using the bare minimum number of relationships to describe what artists, albums, and genres are associated with each track. This decreases the time it takes to populate the database, but increases the time it takes to perform advanced relationship searches.

Run ‘node query.js’. This first searches for all items in the database which are the subject of a “track-in-album” relationship (ie. is a track in an album). This should find all tracks in the database. The next search only finds tracks 899-999, to simulate performing a paged search and looking for the last page. On my RPi, it takes around 2100ms to perform a paged search to find the last 100 tracks, and 2600ms-4000ms to perform a search to find all 1000 tracks.

In short, looks like the levelgraph database runs a bit too slow on a Raspberry Pi to serve as a music library database system. I think it is best suited to running a database on a server or other machine with lots of processing power. For our purposes, however, I think I’ll stick to a simple structure or array for now.

ning-yu · April 5, 2015, 5:13pm

Meryn:

Forget about that. Had it somewhat working, but is was very slow.
135 files took ~200-600ms, it varied each time, but I think using the MPD command to get the database would be faster than reading its database directly. Parsing the data inside the database to the format you used in the dummy database took only 50ms or so (for 135 files), but opening the database takes very long, and opening it each search request will just result in a slow responses.
BTW, also forgot to convert track uri to id, and implement all the tags, but that part isn’t hard to do.
[attachment=0]tmp.png[/attachment]

Back to your original post;
I dont think we should remove duplicates, if 1 file is associated with 2 genres, why dont we just put it in both?
For double albums from 2 different artists, can’t we just split them using the artist tag?

Hey Meryn, good to know about the MPD database file, and thanks for looking into it! Like you mention, I think it’s a good idea to cross list a shared track to different artists. I had just meant that we don’t want that track to show up twice when you list tracks by all artists. I will try some different data structures and get back to you!

ning-yu · April 19, 2015, 3:25am

I came up with a possible data structure for the Volumio2 music library, and wrote some code to test it. Here is a Gist of the test code.

The test code first generates 105,000 “items”, each of which represents a single track on a given music service. These are organized into something which roughly approximates a real music collection - 7 genres, 50 artists per genre, 10 albums per artist, 10 tracks per album, and 3 services which offer each track. Some tracks have multiple artists, albums, or associated genres.

Then, it builds the music library by looking at each item and reading its metadata. On my RPi B+, it takes a little over 7 minutes to build the database with 105,000 items. The music library takes up only about 140mb, so I’m keeping the whole thing in RAM for now.

In the current library structure, the following tables are created:

Genres
Artists
Albums
Tracks
Items

For instance, the Artist Table has one entry for each artist. Within each entry, it lists the artist’s name, keys to the genres the artist is affiliated with, and keys to the albums he/she has collaborated on. This will allow the user to navigate up to the relevant genres, or down to the relevant albums.

<Artist key 1 (hash of artist name)>: { albumkeys: { '<Album key 1 (hash of album title)>': null, '<Album key 2 (hash of album title)>': null, '<Album key 3 (hash of album title)>': null, '<Album key 4 (hash of album title)>': null }, genrekeys: { '<Genre key 1 (hash of genre name)>': null, '<Genre key 2 (hash of genre name)>': null } metadata: { name: 'Artist Name' } }
The other tables share this same style, with similar keys to link up and down the tree. Search functions would also iterate on this tree-like structure.

At the bottom of the tree in the Item Table, each entry contains all the metadata of the item. For example, an entry in the Item Table might look like:

<Item key from service 1 (hash of service + URI)>: { service: 'spop', uri: '<spotify URI>', metadata: { title: 'Test Track', genres: [ 'Genre 1', 'Genre 2' ], artists: [ 'Artist 1', 'Artist 2', 'Artist 3' ], albums: [ 'Album 1', 'Album 2' ] } }, trackkey: { '<Track key 1 (hash of track title)>': null } }

Moving up a level, an entry in the Track Table might look like:

<Track key 1 (hash of track title)>: { itemkeys: { <Item key from service 1 (hash of service + URI)>: null, <Item key from service 2 (hash of service + URI)>: null, <Item key from service 3 (hash of service + URI)>: null }, albumkeys: { '<Album key 1 (hash of track title)>': null, '<Album key 2 (hash of track title)>': null }, metadata: { title: 'Track Title' } }

Moving up another level, an entry in the Album Table might be:

<Album key (hash of album name)>: { trackkeys: { '<Track key 1 (hash of track title)>': null, '<Track key 2 (hash of track title)>': null, '<Track key 3 (hash of track title)>': null }, artistkeys: { <Artist key 1 (hash of artist name)>: null, <Artist key 2 (hash of artist name)>: null } metadata: { title: 'Album Title' } }

The Artist Table is the next level up, an example entry for which is shown earlier. Then, at the top level is the Genre Table:

<Genre key 1 (hash of the genre name)>: { artistkeys: { <Artist key 1 (hash of artist name)>: null, <Artist key 2 (hash of artist name)>: null, <Artist key 3 (hash of artist name)>: null, <Artist key 4 (hash of artist name)>: null }, metadata: { name: 'Genre Name' } }

These tables use a key:value type organization. This may not be as fast as a straight array, but there are some advantages:

Each entry can be easily accessed by knowing its key, which is cross-referenced in other tables. Since the keys are just hashes of the genre names, artist names, etc, one can also access an entry directly without any searching if they know the full name
Duplication of entries is not possible. Attempting to write into a duplicate entry just replaces the contents of the existing entry. This removes the need for running deduplication routines on search results, for example.
The parts of each table which store links to other tables can be kept in memory for fast access, and portions which store larger objects (like album art) can be placed on disk. When complete data is needed, one can just merge the two portions using Object.assign() and present the complete result to the user.

I’m also looking into LevelDB, which can be used as the persistent storage for the music library tables. It is fast, well supported in Node, and can store and retrieve these tables directly as Javascript objects.

volumio · April 19, 2015, 9:41pm

Awesome! Well thought, and well designed.
I like the idea of hashes, to avoid dupes.
One thing: how are the performance of dynamic update? (if file on FS change, if tracks on services are added etc.)?

Testing as we speak

ning-yu · April 19, 2015, 9:46pm

For filesystem change monitoring, I’m looking into watchr. This generates events when files are updated, moved, or deleted. This will let us update the database of local files in a piecewise manner, as files are changed.

For online music services, you would probably have to rescan the whole online library for changes.

Edit: On an unrelated note, given how many unrelated tracks might share the same name, I might have to use track name + album name to generate each track key, instead of just track name.

volumio · April 19, 2015, 10:51pm

Watchr seems ideal. It reminds me a lot of inotify (kernel level native fs changes api), but with lot more configurability.

As for the hash, I would also include filetpye in the hash (or better: a num value to append to hash). This way we’ll have a transparent way to display file type (flac, mp3, dsd) in the frontend, and we’ll avoid marking as dupe the same song, with different encoding (I have the same albums both in mp3 and flac…).

Meryn · April 20, 2015, 3:45pm

Oh wow you beat me to it. Just today I was thinking about doing some work on the library, and calculating average memory usage to decide wether or not to keep it on RAM.

This looks great, going to test it right away!

I came across inotify a lot when I was looking into this, but Watchr looks better and easier to use for our purposes.

This probably depends on the API as well.

ning-yu · April 29, 2015, 5:20am

Sorry for the lack of updates, but I’m very close to getting the music library browser function working!

volumio · May 2, 2015, 2:25pm

NP! In the meanwhile I’m finishing the OS part…

chsims1 · May 30, 2020, 7:24pm