Raw data sections in Voron

time to read 5 min | 810 words

When laying out the design work for RavenDB 4.0, it became clear that we needed more from our storage layer. That require us to modify the way we store the information on disk. Voron is inspired by the LMDB project, and that project basically has a single data type, the B+Tree.

Now, LMDB does some really crazy stuff with this, a single LMDB file can contain any number of B+Trees, and you can also have trees whose values are also trees, but that is basically the sole thing that you can use.

This is roughly what this looks like:

image

Note that there are several types of values here. Small items can be stored directly inside the B+Tree, while bigger values require separate storage, and we only store the reference to the data in the tree.

As it turns out, this is actually quite important aspect of the way you can optimize behavior inside B+Trees. If you wondered about the difference between clustered and non clustered indexes, that is exactly it. In a clustered index, the data is directly in the B+Tree. In a non clustered index, you don’t have the data directly, but need to follow the reference.

Using C#, here are the differences:

var entry = clusteredIndex.Find(key);
return entry.Data;

var nonClusteredEntry = nonClusteredIndex.Find(key);
var entry = clusteredIndex.Find(nonClusteredIndex.Data);
return entry.Data;

I’m skipping on quite a lot here, but that should give you the gist of it.

Now, where is the problem? The problem is that there if all you have is trees, this is pretty much all that you can do here. Now, let us consider the case of storing documents inside Voron. What kind of questions do we need to answer?

  • Load documents by id
  • Scan documents by id prefix
  • Scan documents by etag

As it stands, doing this using just trees would require us to:

  • Define a tree to hold the data, where the key is the document id. This allows us to both find a document by id and by key prefix.
  • Define an index for the etag, where the key is the etag, and the value is the document id.

So, let us assume that we have a piece of code that need to read documents by etag (for indexing, replication, export, etc):

var enumerator = etagsIndex.ScanFrom(etag); // O(logN)
foreach(var entry in enumerator) // O(~1) for each MoveNext(); 
{
	var actualDocument = documentsIndex.Find(entry);// O(logN);
	// do something with this
}

So, if I want to read 32,000 documents based on their etags in a database that has a hundred million documents, that is going to perform close to 900,000 comparison operations. (log2(100 M) = 27, and we issue 32K O(logN) queries on the documentsIndex).

That can get expensive. And this is a relatively simple example. In some cases, a data item might have five covering indexes, and the access rate and size get very high.

In order to resolve that, we are going to change how we layout the data on the disk. Instead of storing the data directly in the tree, we are going to move the data into separate storage. We call this raw data sections.

A raw data section can be small (for values that are less than 2KB in size) or large. Small values are grouped together in sections that are 2 – 8 MB in length. While each larger sized value is going to be place independently, rounded up to the next page size (4KB, by default). The trees will not store that data, but references to the data on disk (effectively, the file position for the data).

Note that this has a few implications:

  • Small data is grouped together, and the likely access method will group together commonly inserted values.
  • We won’t need to move the data if something changes (as long as it doesn’t grow too much), so updating the value will usually mean that we don’t need to update all the indexes (because the location on disk didn’t change), unless the value they are covering did change.
  • Trees are going to be smaller, allowing more of them to reside in memory, and it allows us to do interesting tricks with things like PrefetchVirtualMemory calls on the locations from the index before access them, so we only pay the I/O cost (if any, once).

This is also quite exciting, because it means that we can share indexes on the data very easily. For example, we can have a “etags per collection” index for each collection, which will be much faster to run because it doesn’t need all the extra lookups.