The design of RavenDB 4.0The implications of the blittable format
I have written extensively about the blittable format already, so I’ll not get into that again. But what I wanted to do in this post is to discuss the implication of the intersection of two very important features:
- The blittable format requires no further action to be useful.
- Voron is based on a memory mapped file concept.
Those two, brought together, are quite interesting.
To see why, let us consider the current state of affairs. In RavenDB 3.0, we store the data as json directly. Whenever we need to read a document, we need to load the document from disk, parse the json, load it into .NET objects, and only then do something with it. When we just got started with RavenDB, it didn’t actually matter to us. Our main concern was I/O, and that dominated all our costs. We spent multiple releases improving on that, and the solution was the prefetcher.
- Prefetcher will load documents from the disk and make them ready to be indexed.
- The prefetcher is running concurrently to indexing, so we can parallelize I/O and CPU work.
That allow us to reduce most of the I/O wait times, but it still left us with problems. If two indexes are working, and they each use their own prefetcher, then we have double the I/O cost, double the parsing cost, double the memory cost, double the GC cost. So in order to avoid that, we group indexes together that are roughly at the same space in their indexing. But that lead to a different set of problems, if we have one slow index, that would impact all the other indexes, so we need to have a way to “abandon” an index while it is indexing, to let the other indexes in the group the chance to run.
There is also another issue, when inserting documents into the database, we want to index them, but it seems stupid to take the index, write it to the disk, only to then load them from the disk, parse them, etc. So when we insert a new document, we add it to the prefetcher directly, saving us some work in the common case where indexes are caught up and only need to index new things. That, too, have a cost, it means that the lifetime of such objects tend to be much longer, which means that they are more likely to be pushed into Gen1 or Gen2, so they will not be collected for a while, and when they do, it will be a more expensive collection run.
Oh, and to top it off, all of the structure above need to consider available memory, load on the server, time for indexing batch, I/O rates, liveliness and probably a dozen other factors that don’t pop to mind right now. In short, this is complex.
With RavenDB 4.0, we set out to remove all of this complexity. A large part of the motivation for the blittable format and using Voron are driven by the reasoning below.
If we can get to a point where we can just access the values, and reading documents won’t incur a heavy penalty in CPU/memory, we could radically shift the cost structure. Let us see how. Now, the only cost for indexing is going to be pure I/O, paging the documents to memory when we access them. Actually indexing them is done by merely access the mapped memory directly, so we don’t actually need to allocate much memory during indexing.
Optimizing the actual I/O is pretty easily done by just asking the operating system, we can do that explicitly using PrefetchVirtualMemory or madvise(MADV_WILLNEED), or just let the OS handle that based on actual access pattern. So those are two separate issues that just went away completely. And without needing to spread the cost of loading the documents among all indexes, we no longer have a good reason to go with grouping indexes. So that is out the window, as well as all the complexity that is required to handle a slow index slowing down everyone.
And because newly written documents are likely to be memory resident (they have just been accessed, after all), we can just skip the whole “let us remember recently written documents for the indexes”, because by the time we index them, we are expecting them to still be in memory.
What is interesting here is that by using the right infrastructure we have been able to remove quite a lot of code. Now, the major part here is that being able to remove a lot of code is almost always great, the major change here is that all of the code we removed had to deal with a very large number of factors (if new documents are coming in, but indexing isn’t caught up to them, we need to stop putting the new documents into the perfetcher cache and clear it) that are hard to predict and sometimes interact in funny ways. By moving a lot of that complexity to “let us manage what parts of the file are memory resident”, we can simplify a lot of that complexity and even push much of it directly to the operation system.
This has other implications, because we now no longer need to run indexes in groups, and they can each run and do their own thing, we can now split them so each index has their own dedicated thread. Which mean, in turn, that if we have a very busy index, it is going to be very easy to point which one is the culprit. It also make it much easier for us to handle priorities. Because each index is a thread, it means that we can now rely on the OS prioritization. If you have an index that you really care about running as soon as possible, we can bump its priority higher. And by default, we can very easily mark the indexing thread as lower priority, so we can prioritize answer incoming requests over processing indexes.
Doing it in this manner means that we are able to ask the OS to handle the problem of starvation in the system, where an index doesn’t get to run because it has a lower priority. All of that is already handled in the OS scheduler, so we can lean on that.
Probably the hardest part in the design of RavenDB 4.0 is that we are thinking very hard about how to achieve our goals (and in many cases exceed them) not by writing code, but by not writing code. But by arranging things so the right thing would happen. Architecture and optimization by omission, so to speak.
As a reminder, we have the RavenDB Conference in Texas in a few months, which would be an excellent opportunity to learn about RavenDB 4.0 and the direction in which we are going.
More posts in "The design of RavenDB 4.0" series:
- (26 May 2016) The client side
- (24 May 2016) Replication from server side
- (20 May 2016) Getting RavenDB running on Linux
- (18 May 2016) The cost of Load Document in indexing
- (16 May 2016) You can’t see the map/reduce from all the trees
- (12 May 2016) Separation of indexes and documents
- (10 May 2016) Voron has a one track mind
- (05 May 2016) Physically segregating collections
- (03 May 2016) Making Lucene reliable
- (28 Apr 2016) The implications of the blittable format
- (26 Apr 2016) Voron takes flight
- (22 Apr 2016) Over the wire protocol
- (20 Apr 2016) We already got stuff out there
- (18 Apr 2016) The general idea
Comments
First of all I'd like to say the work you guys have been doing for 4.0 is really amazing.
For the blittable format do you also have an heap allocation free way to iterate over? If that's the case I'm assuming the overall GC pressure will be reduced significantly, and I would love to see some differences between Raven 3.x and 4.0 in regards to number of collections during operation.
That explains why you abandoned ESENT.
With a heavier lean on OS infrastrcuture it would be interesting to see performance characterstics of ravendb across different operating systems.
Pop Catalin, If you mean if we need to do allocations to iterate over the indexes, then no, we can do that without any allocations (except the iterating object which is allocated once).
If you mean to iterate through the properties of a document, then the answer is that we have to allocate the actual values (typically .NET strings) that are being accessed. But we have to do that only for the things that are being accessed, and stuff like property names is cached and only materialized once.
Jesus, Yeah, that was a large part of it. We can try to still take benefit of this using Esent, but that would involve a LOT of copying memory around, buffers management, etc. This way the code is stupidly simple and performs really well
Stan, We try very hard to make sure that we give the OS the right hints, and arrange things so it would have a lot less to do, and more information, but yes, that would have an impact. So far, we are seeing roughly same performance on Linux and Windows
Yes, that's what I meant, iterating without creating in heap structures like JObject, JProperty or similar.
Pop Catalin, Then yes, we create an object per document, but it just handles where to access it, and is very small (in the order of a few dozens bytes) and short lived, and there aren't the whole cost of parsing it.
this is a great design, seems like v4 is going to be splendid. I like the idea of splitting indexes, I would even consider the option to partition indexes, so there will be many sub-indexes per indexes. (similar to elk concept). in this manner large sets of queries / indexing would be able to run in parallel.
@Uri except for the ingestion mechanism and finding ways to improve Lucene reliability and some internal indexing mechanisms, we are not going to be able to tackle a major indexing overhaul at the infrastructure level for the 4.x timeframe (aka Lucene is here to stay even if we would like to have that piece also in very tight control).
First of all, this is an amazing feat guys. Splendid design. Two questions for you: 1) have you noticed some, even minor, perf regression in any circumstance in respect of Esent? 2) @Federico so are you thinking about eventually rolling out your own Lucene-like substitute? Like Voron for Esent or the other pieces you've replaced? I mean, replacing Lucene seems like madness, but at the same time you guys are you, and Oren is Oren so, you know :-)
Njy, Esent tend to be slower for the kind of things we now do. In particular, because we can access the memory directly, it is much more expensive to use, and that blocks certain avenues of optimization
Njy, Also, note that we didn't replace Lucene, that is something that I dearly want, but we cant for 4.0 What we did was make it reliable by putting it on Voron
@Oren: about replacing Lucene, yeah that was clear you guys didn't replace Lucene (in the v4 timeframe, I should've been more clear on that). My question was if you thought about replacing it in the future, let's say "post v4". But I think I have my answer, post v4 that may happen. Good luck with that!
@njy, Lucene is one of the greatest peace of software exists in modern software development, so I wouldn't rush to replace it, but I would improve it. for instance, porting the latest version (or at least the almost ready 4.8) and then optimize it, using better memory allocations, async operations, simd instructions and many more. in this way Lucene can actually be much faster than the original java version.
Comment preview