Working with the FreeDB dataset in Voron

time to read 26 min | 5099 words

Reminder, the FreeDB data set is a 3.32 million records. Containing most of the albums that came out in the past few decades. We created the following Voron database to handle it:

   1: _storageEnvironment = new StorageEnvironment(StorageEnvironmentOptions.ForPath("FreeDB"));
   2: using (var tx = _storageEnvironment.NewTransaction(TransactionFlags.ReadWrite))
   3: {
   4:     _storageEnvironment.CreateTree(tx, "albums");
   5:     _storageEnvironment.CreateTree(tx, "ix_diskids");
   6:     _storageEnvironment.CreateTree(tx, "ix_artists");
   7:     _storageEnvironment.CreateTree(tx, "ix_titles");
   8:     tx.Commit();
   9: }

The albums tree contains the actual information about the album, as a json string. And the ix_* trees contains back references to it. They are our indexes. For what it is worth, you might want to note that this is pretty much how most RDBMS implements their indexing. We’re now working at a pretty low level.

Note also that we need to define those trees whenever we start the database.

Now, let us do some queries, shall we?

We are going to start with the simplest option, given a disk id, give me the matching disk. Because disk ids are only nearly unique, we have the possibility of multiple results returning. 

Side note, the DB now is 9.00 GB in size.

Let us see how we query it. It pains me to write it, but I created a “repository like” interface, because Voron is way too low level for us to expose to user code. This is actually one of the few places where a repository like interface is good. Because it hides the extra complexity and the rigidity of the structure is justified.

The interface looks like:


Now, let us see how we actually implement this guy. We’ll start from the simplest thing, doing a search by a disk id, which is a nearly unique value that identify a disk.

   1: public IEnumerable<Disk> FindByDiskId(string diskId)
   2: {
   3:     using (var tx = _storageEnvironment.NewTransaction(TransactionFlags.Read))
   4:     {
   5:         var dix = tx.GetTree("ix_diskids");
   6:         var albums = tx.GetTree("albums");
   8:         using (var it = dix.MultiRead(tx, diskId))
   9:         {
  10:             if (it.Seek(Slice.BeforeAllKeys) == false)
  11:                 yield break;
  12:             do
  13:             {
  14:                 var readResult = albums.Read(tx, it.CurrentKey);
  15:                 using (var stream = readResult.Reader.AsStream())
  16:                 {
  17:                     yield return _serializer.Deserialize<Disk>(new JsonTextReader(new StreamReader(stream)));
  18:                 }
  19:             } while (it.MoveNext());
  20:         }
  22:         tx.Commit();
  23:     }
  24: }

Let us go over this code in detail. We are using a read transaction, because we aren’t doing any writes.

We are using two trees here, the ix_diskids, which is the index on the disks, and the albums tree, which contains the actual data.

On line 8, we do a multi read. This is done because a single disk id value may belong to multiple albums.

Lines 10 and 11 are needed in case there are no results. And the line 14 is where the important thing happen. The result of the MultiRead is an iterator that contains all the ids of the albums with this disk id. We then read it from the albums tree, deserialize it and hand it to the user. Pretty simple, overall.

Now, let us go to the more complex scenario, where we want to do a search by artist or album title.

   1: public IEnumerable<Disk> FindByArtist(string prefix)
   2: {
   3:     return FindByMultiValueIterator(prefix, "ix_artists");
   4: }
   6: public IEnumerable<Disk> FindByAlbumTitle(string prefix)
   7: {
   8:     return FindByMultiValueIterator(prefix, "ix_titles");
   9: }
  11: private IEnumerable<Disk> FindByMultiValueIterator(string prefix, string treeIndexName)
  12: {
  13:     using (var tx = _storageEnvironment.NewTransaction(TransactionFlags.Read))
  14:     {
  15:         var dix = tx.GetTree(treeIndexName);
  16:         var albums = tx.GetTree("albums");
  18:         using (var multiValueIterator = dix.Iterate(tx))
  19:         {
  20:             multiValueIterator.RequiredPrefix = prefix.ToLower();
  21:             if (multiValueIterator.Seek(multiValueIterator.RequiredPrefix) == false)
  22:                 yield break;
  23:             do
  24:             {
  25:                 using (var albumsIterator = multiValueIterator.CreateMutliValueIterator())
  26:                 {
  27:                     if (albumsIterator.Seek(Slice.BeforeAllKeys) == false)
  28:                         continue;
  29:                     do
  30:                     {
  31:                         var readResult = albums.Read(tx, albumsIterator.CurrentKey);
  32:                         using (var stream = readResult.Reader.AsStream())
  33:                         {
  34:                             yield return _serializer.Deserialize<Disk>(new JsonTextReader(new StreamReader(stream)));
  35:                         }
  36:                     } while (albumsIterator.MoveNext());
  37:                 }
  38:             } while (multiValueIterator.MoveNext());
  39:         }
  41:         tx.Commit();
  42:     }
  43: }

You can see that in both cases, we handle it the same, because the actual behavior is the same. We don’t want to do just an exact match. If we wanted that, we could use the exact same logic as we did in FindByDiskId. But we want to do something more, we want to be able to search by prefix, not just by exact match. That means that we have to iterate over the tree. The only difference between FindByAlbumTitle and FindByArtist is the tree index that they use.

We start out as before, iterating over the index (line 18). Note that in line 20, we defined a required prefix, and we use the lower case form of the prefix. We also entered that to the index as lower case. This is how we are able to get case insensitive searches.

Line 21 actually take us to the beginning of all the entries greater or equal to the prefix, and by setting RequiredPrefix we ensure that we can’t go to any entry that is doesn’t have this prefix. This is an interesting example, because we are now iterating over all the index entries that have the same prefix. But the values of the index is also a list. That is why we do in line 25. Get all the values that match a particular entry with that particular prefix.

The value of that is the actual id that we use to go into the albums tree. And there you have in, non trivial search with Voron.