The design of RavenDB 4.0: The implications of the blittable format

architecture (618) rss
bugs (451) rss
challanges (123) rss
community (381) rss
databases (481) rss
design (896) rss
development (647) rss
hibernating-practices (72) rss
miscellaneous (592) rss
performance (397) rss
programming (1093) rss
raven (1459) rss
ravendb.net (545) rss
reviews (184) rss

2025
- August (6)
- July (7)
- June (7)
- May (10)
- April (10)
- March (10)
- February (7)
- January (12)
2024
- December (3)
- November (2)
- October (1)
- September (3)
- August (5)
- July (10)
- June (4)
- May (6)
- April (2)
- March (8)
- February (2)
- January (14)
2023
- December (4)
- October (4)
- September (6)
- August (12)
- July (5)
- June (15)
- May (3)
- April (11)
- March (5)
- February (5)
- January (8)
2022
- December (5)
- November (7)
- October (7)
- September (9)
- August (10)
- July (15)
- June (12)
- May (9)
- April (14)
- March (15)
- February (13)
- January (16)
2021
- December (23)
- November (20)
- October (16)
- September (6)
- August (16)
- July (11)
- June (16)
- May (4)
- April (10)
- March (11)
- February (15)
- January (14)
2020
- December (10)
- November (13)
- October (15)
- September (6)
- August (9)
- July (9)
- June (17)
- May (15)
- April (14)
- March (21)
- February (16)
- January (13)
2019
- December (17)
- November (14)
- October (16)
- September (10)
- August (8)
- July (16)
- June (11)
- May (13)
- April (18)
- March (12)
- February (19)
- January (23)
2018
- December (15)
- November (14)
- October (19)
- September (18)
- August (23)
- July (20)
- June (20)
- May (23)
- April (15)
- March (23)
- February (19)
- January (23)
2017
- December (21)
- November (24)
- October (22)
- September (21)
- August (23)
- July (21)
- June (24)
- May (21)
- April (21)
- March (23)
- February (20)
- January (23)
2016
- December (17)
- November (18)
- October (22)
- September (18)
- August (23)
- July (22)
- June (17)
- May (24)
- April (16)
- March (16)
- February (21)
- January (21)
2015
- December (5)
- November (10)
- October (9)
- September (17)
- August (20)
- July (17)
- June (4)
- May (12)
- April (9)
- March (8)
- February (25)
- January (17)
2014
- December (22)
- November (19)
- October (21)
- September (37)
- August (24)
- July (23)
- June (13)
- May (19)
- April (24)
- March (23)
- February (21)
- January (24)
2013
- December (23)
- November (29)
- October (27)
- September (26)
- August (24)
- July (24)
- June (23)
- May (25)
- April (26)
- March (24)
- February (24)
- January (21)
2012
- December (19)
- November (22)
- October (27)
- September (24)
- August (30)
- July (23)
- June (25)
- May (23)
- April (25)
- March (25)
- February (28)
- January (24)
2011
- December (17)
- November (14)
- October (24)
- September (28)
- August (27)
- July (30)
- June (19)
- May (16)
- April (30)
- March (23)
- February (11)
- January (26)
2010
- December (29)
- November (28)
- October (35)
- September (33)
- August (44)
- July (17)
- June (20)
- May (53)
- April (29)
- March (35)
- February (33)
- January (36)
2009
- December (37)
- November (35)
- October (53)
- September (60)
- August (66)
- July (29)
- June (24)
- May (52)
- April (63)
- March (35)
- February (53)
- January (50)
2008
- December (58)
- November (65)
- October (46)
- September (48)
- August (96)
- July (87)
- June (45)
- May (51)
- April (52)
- March (70)
- February (43)
- January (49)
2007
- December (100)
- November (52)
- October (109)
- September (68)
- August (80)
- July (56)
- June (150)
- May (115)
- April (73)
- March (124)
- February (102)
- January (68)
2006
- December (95)
- November (53)
- October (120)
- September (57)
- August (88)
- July (54)
- June (103)
- May (89)
- April (84)
- March (143)
- February (78)
- January (64)
2005
- December (70)
- November (97)
- October (91)
- September (61)
- August (74)
- July (92)
- June (100)
- May (53)
- April (42)
- March (41)
- February (84)
- January (31)
2004
- December (49)
- November (26)
- October (26)
- September (6)
- April (10)

Apr 28 2016

The design of RavenDB 4.0The implications of the blittable format

time to read 6 min | 1134 words

I have written extensively about the blittable format already, so I’ll not get into that again. But what I wanted to do in this post is to discuss the implication of the intersection of two very important features:

The blittable format requires no further action to be useful.
Voron is based on a memory mapped file concept.

Those two, brought together, are quite interesting.

To see why, let us consider the current state of affairs. In RavenDB 3.0, we store the data as json directly. Whenever we need to read a document, we need to load the document from disk, parse the json, load it into .NET objects, and only then do something with it. When we just got started with RavenDB, it didn’t actually matter to us. Our main concern was I/O, and that dominated all our costs. We spent multiple releases improving on that, and the solution was the prefetcher.

Prefetcher will load documents from the disk and make them ready to be indexed.
The prefetcher is running concurrently to indexing, so we can parallelize I/O and CPU work.

That allow us to reduce most of the I/O wait times, but it still left us with problems. If two indexes are working, and they each use their own prefetcher, then we have double the I/O cost, double the parsing cost, double the memory cost, double the GC cost. So in order to avoid that, we group indexes together that are roughly at the same space in their indexing. But that lead to a different set of problems, if we have one slow index, that would impact all the other indexes, so we need to have a way to “abandon” an index while it is indexing, to let the other indexes in the group the chance to run.

There is also another issue, when inserting documents into the database, we want to index them, but it seems stupid to take the index, write it to the disk, only to then load them from the disk, parse them, etc. So when we insert a new document, we add it to the prefetcher directly, saving us some work in the common case where indexes are caught up and only need to index new things. That, too, have a cost, it means that the lifetime of such objects tend to be much longer, which means that they are more likely to be pushed into Gen1 or Gen2, so they will not be collected for a while, and when they do, it will be a more expensive collection run.

Oh, and to top it off, all of the structure above need to consider available memory, load on the server, time for indexing batch, I/O rates, liveliness and probably a dozen other factors that don’t pop to mind right now. In short, this is complex.

With RavenDB 4.0, we set out to remove all of this complexity. A large part of the motivation for the blittable format and using Voron are driven by the reasoning below.

If we can get to a point where we can just access the values, and reading documents won’t incur a heavy penalty in CPU/memory, we could radically shift the cost structure. Let us see how. Now, the only cost for indexing is going to be pure I/O, paging the documents to memory when we access them. Actually indexing them is done by merely access the mapped memory directly, so we don’t actually need to allocate much memory during indexing.

Optimizing the actual I/O is pretty easily done by just asking the operating system, we can do that explicitly using PrefetchVirtualMemory or madvise(MADV_WILLNEED), or just let the OS handle that based on actual access pattern. So those are two separate issues that just went away completely. And without needing to spread the cost of loading the documents among all indexes, we no longer have a good reason to go with grouping indexes. So that is out the window, as well as all the complexity that is required to handle a slow index slowing down everyone.

And because newly written documents are likely to be memory resident (they have just been accessed, after all), we can just skip the whole “let us remember recently written documents for the indexes”, because by the time we index them, we are expecting them to still be in memory.

What is interesting here is that by using the right infrastructure we have been able to remove quite a lot of code. Now, the major part here is that being able to remove a lot of code is almost always great, the major change here is that all of the code we removed had to deal with a very large number of factors (if new documents are coming in, but indexing isn’t caught up to them, we need to stop putting the new documents into the perfetcher cache and clear it) that are hard to predict and sometimes interact in funny ways. By moving a lot of that complexity to “let us manage what parts of the file are memory resident”, we can simplify a lot of that complexity and even push much of it directly to the operation system.

This has other implications, because we now no longer need to run indexes in groups, and they can each run and do their own thing, we can now split them so each index has their own dedicated thread. Which mean, in turn, that if we have a very busy index, it is going to be very easy to point which one is the culprit. It also make it much easier for us to handle priorities. Because each index is a thread, it means that we can now rely on the OS prioritization. If you have an index that you really care about running as soon as possible, we can bump its priority higher. And by default, we can very easily mark the indexing thread as lower priority, so we can prioritize answer incoming requests over processing indexes.

Doing it in this manner means that we are able to ask the OS to handle the problem of starvation in the system, where an index doesn’t get to run because it has a lower priority. All of that is already handled in the OS scheduler, so we can lean on that.

Probably the hardest part in the design of RavenDB 4.0 is that we are thinking very hard about how to achieve our goals (and in many cases exceed them) not by writing code, but by not writing code. But by arranging things so the right thing would happen. Architecture and optimization by omission, so to speak.

As a reminder, we have the RavenDB Conference in Texas in a few months, which would be an excellent opportunity to learn about RavenDB 4.0 and the direction in which we are going.

Tweet Share Share 15 comments

Tags:

Comments

28 Apr 2016
09:57 AM

Pop Catalin

First of all I'd like to say the work you guys have been doing for 4.0 is really amazing.

For the blittable format do you also have an heap allocation free way to iterate over? If that's the case I'm assuming the overall GC pressure will be reduced significantly, and I would love to see some differences between Raven 3.x and 4.0 in regards to number of collections during operation.

28 Apr 2016
11:34 AM

Jesús López

That explains why you abandoned ESENT.

28 Apr 2016
12:20 PM

Stan

With a heavier lean on OS infrastrcuture it would be interesting to see performance characterstics of ravendb across different operating systems.

28 Apr 2016
12:47 PM

Oren Eini

Pop Catalin, If you mean if we need to do allocations to iterate over the indexes, then no, we can do that without any allocations (except the iterating object which is allocated once).

If you mean to iterate through the properties of a document, then the answer is that we have to allocate the actual values (typically .NET strings) that are being accessed. But we have to do that only for the things that are being accessed, and stuff like property names is cached and only materialized once.

28 Apr 2016
12:48 PM

Oren Eini

Jesus, Yeah, that was a large part of it. We can try to still take benefit of this using Esent, but that would involve a LOT of copying memory around, buffers management, etc. This way the code is stupidly simple and performs really well

28 Apr 2016
12:50 PM

Oren Eini

Stan, We try very hard to make sure that we give the OS the right hints, and arrange things so it would have a lot less to do, and more information, but yes, that would have an impact. So far, we are seeing roughly same performance on Linux and Windows

28 Apr 2016
13:03 PM

Pop Catalin

Yes, that's what I meant, iterating without creating in heap structures like JObject, JProperty or similar.

28 Apr 2016
13:31 PM

Oren Eini

Pop Catalin, Then yes, we create an object per document, but it just handles where to access it, and is very small (in the order of a few dozens bytes) and short lived, and there aren't the whole cost of parsing it.

28 Apr 2016
13:39 PM

Uri

this is a great design, seems like v4 is going to be splendid. I like the idea of splitting indexes, I would even consider the option to partition indexes, so there will be many sub-indexes per indexes. (similar to elk concept). in this manner large sets of queries / indexing would be able to run in parallel.

29 Apr 2016
13:56 PM

Federico Lois

@Uri except for the ingestion mechanism and finding ways to improve Lucene reliability and some internal indexing mechanisms, we are not going to be able to tackle a major indexing overhaul at the infrastructure level for the 4.x timeframe (aka Lucene is here to stay even if we would like to have that piece also in very tight control).

30 Apr 2016
09:14 AM

njy

First of all, this is an amazing feat guys. Splendid design. Two questions for you: 1) have you noticed some, even minor, perf regression in any circumstance in respect of Esent? 2) @Federico so are you thinking about eventually rolling out your own Lucene-like substitute? Like Voron for Esent or the other pieces you've replaced? I mean, replacing Lucene seems like madness, but at the same time you guys are you, and Oren is Oren so, you know :-)

30 Apr 2016
13:04 PM

Oren Eini

Njy, Esent tend to be slower for the kind of things we now do. In particular, because we can access the memory directly, it is much more expensive to use, and that blocks certain avenues of optimization

30 Apr 2016
13:04 PM

Oren Eini

Njy, Also, note that we didn't replace Lucene, that is something that I dearly want, but we cant for 4.0 What we did was make it reliable by putting it on Voron

30 Apr 2016
17:04 PM

njy

@Oren: about replacing Lucene, yeah that was clear you guys didn't replace Lucene (in the v4 timeframe, I should've been more clear on that). My question was if you thought about replacing it in the future, let's say "post v4". But I think I have my answer, post v4 that may happen. Good luck with that!

01 May 2016
09:09 AM

Uri

@njy, Lucene is one of the greatest peace of software exists in modern software development, so I wouldn't rush to replace it, but I would improve it. for instance, porting the latest version (or at least the almost ready 4.8) and then optimize it, using better memory allocations, async operations, simd instructions and many more. in this way Lucene can actually be much faster than the original java version.

Comment preview

Comments have been closed on this topic.

Markdown turns plain text formatting into fancy HTML formatting.

Phrase Emphasis

*italic*   **bold**
_italic_   __bold__

Links

Inline:

An [example](http://url.com/ "Title")

Reference-style labels (titles are optional):

An [example][id]. Then, anywhere
else in the doc, define the link:
  [id]: http://example.com/  "Title"

Images

Inline (titles are optional):

![alt text](/path/img.jpg "Title")

Reference-style:

![alt text][id]
[id]: /url/to/img.jpg "Title"

Headers

Setext-style:

Header 1
========
Header 2
--------

atx-style (closing #'s are optional):

# Header 1 #
## Header 2 ##
###### Header 6

Lists

Ordered, without paragraphs:

1.  Foo
2.  Bar

Unordered, with paragraphs:

*   A list item.
    With multiple paragraphs.
*   Bar

You can nest them:

*   Abacus
    * answer
*   Bubbles
    1.  bunk
    2.  bupkis
        * BELITTLER
    3. burper
*   Cunning

Blockquotes

> Email-style angle brackets
> are used for blockquotes.
> > And, they can be nested.
> #### Headers in blockquotes
> 
> * You can quote a list.
> * Etc.

Horizontal Rules

Three or more dashes or asterisks:

---
* * *
- - - -

Manual Line Breaks

End a line with two or more spaces:

Roses are red,   
Violets are blue.

Fenced Code Blocks

Code blocks delimited by 3 or more backticks or tildas:

```
This is a preformatted
code block
```

Header IDs

Set the id of headings with {#<id>} at end of heading line:

## My Heading {#myheading}

Tables

Fruit    |Color
---------|----------
Apples   |Red
Pears	 |Green
Bananas  |Yellow

Definition Lists

Term 1
: Definition 1
Term 2
: Definition 2

Footnotes

Body text with a footnote [^1]
[^1]: Footnote text here

Abbreviations

MDD <- will have title
*[MDD]: MarkdownDeep

Oren Eini

Oren Eini

CEO of RavenDB

The design of RavenDB 4.0The implications of the blittable format

More posts in "The design of RavenDB 4.0" series:

Comments

Comment preview

FUTURE POSTS

RECENT SERIES

RECENT COMMENTS

Syndication

Main feed
Comments feed

Oren Eini

CEO of RavenDB

Related posts that you may find interesting:

More posts in "The design of RavenDB 4.0" series:

Comments

Comment preview

Markdown formatting

Phrase Emphasis

Links

Images

Headers

Lists

Blockquotes

Horizontal Rules

Manual Line Breaks

Fenced Code Blocks

Header IDs

Tables

Definition Lists

Footnotes

Abbreviations

FUTURE POSTS

RECENT SERIES

RECENT COMMENTS

Syndication