The design of RavenDB 4.0Voron takes flight
RavenDB has been using the low level Esent as our storage engine from day 1. We toyed with building our own storage engine in Munin, but it was only in 2013 that we started pay serious attention to that. A large part of that was the realization that Esent is great, but it isn’t without its own issues and bugs (it is relatively easy to get it into referencing invalid memory, for example, if you run multiple defrag operations), that we have to work around. But two major reasons were behind our decision to invest a lot of time and effort into building Voron.
When a customer has an issue, they don’t care that this is some separate component that is causing it, we need to be able to provide them with an answer, fast. And escalating to Microsoft works, but it is cumbersome and in many cases result in a game of telephone. This can be amusing when you see kids do it, but not so much fun when it is an irate customer with a serious problem on the phone.
The second reason is that Esent was not designed for our workload. It has done great by us, but it isn’t something that we can take and tweak and file away all the rough edges in our own usage. In order to provide the level of performance and flexibility we need, we have to own every critical piece in our stack, and Voron was the result of that. Voron was released as part of RavenDB 3.0, and we have some customers running the biggest databases on it, battle testing it in production for the past two+ years.
With RavenDB 4.0, we have made Voron the only storage engine we support, and have extended it further. In particular, we added low level data structures that changed some operations from O(M * logN) to O(logN), push more functionality to the storage engine itself and simplified the concurrency model.
In Voron 3.0, the only data structure that we had was the B+Tree, we had multiple of those, and you could recurse them, but that was it. Data was stored in the following manner:
- Documents tree (key: document id, value: document json)
- Etag tree (key: etag, value: document key)
We had one B+Tree as the primary, which contained the actual data, and a number of additional indexes, which allow us to search on additional fields, then find the relevant data by lookin up in the data tree.
This had two issues. The first was that our code needed to manually make sure that we were updating all the index trees whenever we updated/deleted/created any values. The second was that a scan over an index would result in the code first doing an O(logN) search over the index tree, then for each matching result it do another lookup to the actual data tree.
In practice, this only showed up as a problem when you have over 200 million entries, in which case the performance cost was noticeable. But for that purpose, and for a bunch of other (which we’ll discuss in the next post) we made the following changes to Voron.
We now have 4 data structures supported.
- B+Tree – same as before, variable size key & value.
- Fixed Sized B+Tree – int64 key, and fixed size (can be configured at creation time) size of value. Same as the one above, but let us take advantage of various optimization when we know what the total size is.
- Raw data section – allow to store data, and give an opaque id to access the data later.
- Table – combination of raw data sections with any number of indexes.
The raw data section is interesting, because it allows us to just ask it to store a bunch of data, and get an id back (int64), and it has an O(1) cost for accessing those values using the id.
We then use this id as the value in the B+Tree, which means that our structure now looks like this:
- Raw data sections – [document json, document json, etc]
- Documents tree (key: document id, value: raw data section id)
- Etags tree (key: etag, value: raw data section id)
So now getting the results from the etags tree is an seek into the tree O(logN), and then just O(1) cost for each document that we need to get from there.
Another very important aspect is the fact that Voron is based on memory mapped files, which allows some really interesting optimization opportunities. But that will be the subject of the next post.
More posts in "The design of RavenDB 4.0" series:
- (26 May 2016) The client side
- (24 May 2016) Replication from server side
- (20 May 2016) Getting RavenDB running on Linux
- (18 May 2016) The cost of Load Document in indexing
- (16 May 2016) You can’t see the map/reduce from all the trees
- (12 May 2016) Separation of indexes and documents
- (10 May 2016) Voron has a one track mind
- (05 May 2016) Physically segregating collections
- (03 May 2016) Making Lucene reliable
- (28 Apr 2016) The implications of the blittable format
- (26 Apr 2016) Voron takes flight
- (22 Apr 2016) Over the wire protocol
- (20 Apr 2016) We already got stuff out there
- (18 Apr 2016) The general idea
Comments
Sounds great. Maybe it's worth a look at how SQL Server manages heap tables. Heaps plus b-tree indexes there appear to be similar to the new Voron design. The typical heap issue is that SQL Server cannot compact a heap well because it requires a row to never move to a new page. I don't see why that has to be the case, though. It's a SQL Server thing.
Mark, We can do compaction of a raw data section. And we explicitly handle the case of moving a record. The rough of it is that a section is up to about 8MB or so, and we are always going to be writing to a single active section. When it is full, we create a new one. When a section become empty enough, we merge it with the active one and free it.
Enjoying the series and very much looking forward to Raven 4.0.... quick question on this. I assume polymorphic indexes will still be ok? We use those in a number of places.
Ian, Yes, we support indexes on multiple collections. We just made it much more efficient
Comment preview