Raw data sections in Voron
When laying out the design work for RavenDB 4.0, it became clear that we needed more from our storage layer. That require us to modify the way we store the information on disk. Voron is inspired by the LMDB project, and that project basically has a single data type, the B+Tree.
Now, LMDB does some really crazy stuff with this, a single LMDB file can contain any number of B+Trees, and you can also have trees whose values are also trees, but that is basically the sole thing that you can use.
This is roughly what this looks like:
Note that there are several types of values here. Small items can be stored directly inside the B+Tree, while bigger values require separate storage, and we only store the reference to the data in the tree.
As it turns out, this is actually quite important aspect of the way you can optimize behavior inside B+Trees. If you wondered about the difference between clustered and non clustered indexes, that is exactly it. In a clustered index, the data is directly in the B+Tree. In a non clustered index, you don’t have the data directly, but need to follow the reference.
Using C#, here are the differences:
var entry = clusteredIndex.Find(key); return entry.Data; var nonClusteredEntry = nonClusteredIndex.Find(key); var entry = clusteredIndex.Find(nonClusteredIndex.Data); return entry.Data;
I’m skipping on quite a lot here, but that should give you the gist of it.
Now, where is the problem? The problem is that there if all you have is trees, this is pretty much all that you can do here. Now, let us consider the case of storing documents inside Voron. What kind of questions do we need to answer?
- Load documents by id
- Scan documents by id prefix
- Scan documents by etag
As it stands, doing this using just trees would require us to:
- Define a tree to hold the data, where the key is the document id. This allows us to both find a document by id and by key prefix.
- Define an index for the etag, where the key is the etag, and the value is the document id.
So, let us assume that we have a piece of code that need to read documents by etag (for indexing, replication, export, etc):
var enumerator = etagsIndex.ScanFrom(etag); // O(logN) foreach(var entry in enumerator) // O(~1) for each MoveNext(); { var actualDocument = documentsIndex.Find(entry);// O(logN); // do something with this }
So, if I want to read 32,000 documents based on their etags in a database that has a hundred million documents, that is going to perform close to 900,000 comparison operations. (log2(100 M) = 27, and we issue 32K O(logN) queries on the documentsIndex).
That can get expensive. And this is a relatively simple example. In some cases, a data item might have five covering indexes, and the access rate and size get very high.
In order to resolve that, we are going to change how we layout the data on the disk. Instead of storing the data directly in the tree, we are going to move the data into separate storage. We call this raw data sections.
A raw data section can be small (for values that are less than 2KB in size) or large. Small values are grouped together in sections that are 2 – 8 MB in length. While each larger sized value is going to be place independently, rounded up to the next page size (4KB, by default). The trees will not store that data, but references to the data on disk (effectively, the file position for the data).
Note that this has a few implications:
- Small data is grouped together, and the likely access method will group together commonly inserted values.
- We won’t need to move the data if something changes (as long as it doesn’t grow too much), so updating the value will usually mean that we don’t need to update all the indexes (because the location on disk didn’t change), unless the value they are covering did change.
- Trees are going to be smaller, allowing more of them to reside in memory, and it allows us to do interesting tricks with things like PrefetchVirtualMemory calls on the locations from the index before access them, so we only pay the I/O cost (if any, once).
This is also quite exciting, because it means that we can share indexes on the data very easily. For example, we can have a “etags per collection” index for each collection, which will be much faster to run because it doesn’t need all the extra lookups.
Comments
This idea makes sense to me. A few questions. <br><br>
Alex, * It is going to go into the same file, just structured differently. Basically we allocate some pages directly for the data, not through the trees. * We also have a tree for fixed size values, yes. That would give us much better behavior for certain types of indexes. * I don't understand what you mean about the value version. Do you mean for optimistic concurrency? * The raw data section is going to report the move of the data when that happens, and the indexes will update their references.
Holding direct reference to data will make your search faster, but any update or delete will be slower because you have to update reference on every nonclustered index. Same will happen on page split. I believe this is what Alex meant.
Igor, The idea is that most times, we won't have to move the data, so if the value didn't change, we won't have to update the non clustered indexes. It also means that if there is a page split, we don't care, the data doesn't reside in the tree, after all.
Data will only move if the size has changed (or if the page in which it reside has to be defrag, which should be rare). For the most part, you don't need to touch things, that is the whole point.
Your new approach is like SQL Server's heap, while the old approach was like clustered tables. RavenDB should let the user choose the pproach per table, as SQL Server does.
Bruno, I don't think that we'll allow users to do that. Even the SQL Server docs call this out as very advanced operation, and suggest avoiding this unless you know what you are doing. Since we don't allow users to define the low level storage details, we'll make the determination on what to use on the storage on our own.
Ayende, thanks for the answers. Re "value version number" I was indeed referring to the optimistic concurrency control version number (the 'Version' field of the tree node header). W.r.t. indexes updating their references when a value is relocated, I guess this requires tracking which indexes are defined for a value. Would this be managed at the storage engine level (i.e. as a new concept within the Voron library), or as something specific to its usage in RavenDB?
Igor, no that was not quite what I meant. I agree that because inserting and deleting will now touch both a B+Tree and a "Value" page these actions might be slower, but updates can most likely be done in-place (in the mvcc sense) in the majority of cases. One of the big advantages is that the tree pages become much more compact (more entries per page) and cache friendly (also because updates would only touch the value pages, not the B+Tree index pages).alex, Yes, that one we'll keep. Each table is going to have a primary key index, which will be used for optimistic concurrency checks. We want to do this directly inside Voron, instead of outside of it. Managing the indexes inside RavenDB has been a major pain point for us, and the cause of a few bugs. If we can get it right once for everything, that would be much better
I had considered doing this in LMDB too, a couple years ago. Overflow pages for records larger than 1/2 page already work this way, and that is clearly a win. http://symas.com/mdb/hyperdex/
For smaller data, it's much less likely to be a win in LMDB's COW scheme. There's a lot more work involved in insert/delete of small records sitting inside a larger block of pages, and much worse write amplification.
Howard, The problem with overflow is that you can't refer to that page. So if I have an tree of files, and I want to have an index by last modified. The value would be the date, and the key would be the file name. So then reading files by last modified becomes scan through one tree, and many lookups to the next
Comment preview