Voron Internals: Reducing the journal

architecture (616) rss
bugs (451) rss
challanges (123) rss
community (381) rss
databases (481) rss
design (896) rss
development (642) rss
hibernating-practices (71) rss
miscellaneous (592) rss
performance (397) rss
programming (1088) rss
raven (1457) rss
ravendb.net (541) rss
reviews (184) rss

2025
- July (7)
- June (7)
- May (10)
- April (10)
- March (10)
- February (7)
- January (12)
2024
- December (3)
- November (2)
- October (1)
- September (3)
- August (5)
- July (10)
- June (4)
- May (6)
- April (2)
- March (8)
- February (2)
- January (14)
2023
- December (4)
- October (4)
- September (6)
- August (12)
- July (5)
- June (15)
- May (3)
- April (11)
- March (5)
- February (5)
- January (8)
2022
- December (5)
- November (7)
- October (7)
- September (9)
- August (10)
- July (15)
- June (12)
- May (9)
- April (14)
- March (15)
- February (13)
- January (16)
2021
- December (23)
- November (20)
- October (16)
- September (6)
- August (16)
- July (11)
- June (16)
- May (4)
- April (10)
- March (11)
- February (15)
- January (14)
2020
- December (10)
- November (13)
- October (15)
- September (6)
- August (9)
- July (9)
- June (17)
- May (15)
- April (14)
- March (21)
- February (16)
- January (13)
2019
- December (17)
- November (14)
- October (16)
- September (10)
- August (8)
- July (16)
- June (11)
- May (13)
- April (18)
- March (12)
- February (19)
- January (23)
2018
- December (15)
- November (14)
- October (19)
- September (18)
- August (23)
- July (20)
- June (20)
- May (23)
- April (15)
- March (23)
- February (19)
- January (23)
2017
- December (21)
- November (24)
- October (22)
- September (21)
- August (23)
- July (21)
- June (24)
- May (21)
- April (21)
- March (23)
- February (20)
- January (23)
2016
- December (17)
- November (18)
- October (22)
- September (18)
- August (23)
- July (22)
- June (17)
- May (24)
- April (16)
- March (16)
- February (21)
- January (21)
2015
- December (5)
- November (10)
- October (9)
- September (17)
- August (20)
- July (17)
- June (4)
- May (12)
- April (9)
- March (8)
- February (25)
- January (17)
2014
- December (22)
- November (19)
- October (21)
- September (37)
- August (24)
- July (23)
- June (13)
- May (19)
- April (24)
- March (23)
- February (21)
- January (24)
2013
- December (23)
- November (29)
- October (27)
- September (26)
- August (24)
- July (24)
- June (23)
- May (25)
- April (26)
- March (24)
- February (24)
- January (21)
2012
- December (19)
- November (22)
- October (27)
- September (24)
- August (30)
- July (23)
- June (25)
- May (23)
- April (25)
- March (25)
- February (28)
- January (24)
2011
- December (17)
- November (14)
- October (24)
- September (28)
- August (27)
- July (30)
- June (19)
- May (16)
- April (30)
- March (23)
- February (11)
- January (26)
2010
- December (29)
- November (28)
- October (35)
- September (33)
- August (44)
- July (17)
- June (20)
- May (53)
- April (29)
- March (35)
- February (33)
- January (36)
2009
- December (37)
- November (35)
- October (53)
- September (60)
- August (66)
- July (29)
- June (24)
- May (52)
- April (63)
- March (35)
- February (53)
- January (50)
2008
- December (58)
- November (65)
- October (46)
- September (48)
- August (96)
- July (87)
- June (45)
- May (51)
- April (52)
- March (70)
- February (43)
- January (49)
2007
- December (100)
- November (52)
- October (109)
- September (68)
- August (80)
- July (56)
- June (150)
- May (115)
- April (73)
- March (124)
- February (102)
- January (68)
2006
- December (95)
- November (53)
- October (120)
- September (57)
- August (88)
- July (54)
- June (103)
- May (89)
- April (84)
- March (143)
- February (78)
- January (64)
2005
- December (70)
- November (97)
- October (91)
- September (61)
- August (74)
- July (92)
- June (100)
- May (53)
- April (42)
- March (41)
- February (84)
- January (31)
2004
- December (49)
- November (26)
- October (26)
- September (6)
- April (10)

RavenDB Workshops - Deep dive into practical use of Document Data Modeling

Sep 07 2016

Voron InternalsReducing the journal

time to read 7 min | 1285 words

We spend a lot of time trying to reduce our sync I/O cost with Voron, namely, the actual journal write to the disk. This is very expensive, because we have to hit the actual disk, forgoing any buffering.

So anything that can reduce that cost is a really good idea. We spent some time looking at dynamic compression ratios heuristics, to see if it is worth it. Basically, we tried to figure out which option to use:

The idea is that based on the speed of the hard disk in use, we can decided whatever it is worth it or not to spend more time compressing the journal entry before saving it. We tested a system where the I/O duration would be balanced against compression speed and size, and adjust automatically.

It failed, horribly. Basically, even on the fastest drives we could find, it was almost always better to compress at the highest level, because the cost of going to disk is so high.

There is another aspect of this, however. The cost of going to disk isn’t linear to the size you are writing. I used the example of putting your groceries in the trunk. The fuel cost of the trip is not really going to be dominated by the weight of the groceries. After writing this statement, I fact checked myself. According to Auto Blog, each 100 pounds (50 KG) of added weight will increase the fuel utilization by about 1%. What is going to dominate the cost, however, is how much do you have to drive.

In the same manner, writing to the disk is impacted by the amount you write, but writing 4KB or 20KB has roughly the same cost anyway. Writing 2 MB is much longer, but not as much as you would expect. Note that all of those numbers assume no buffering all the way to disk, and using DMA.

We then tried to see what happen if we would just avoid compressing small writes. Anything smaller than 64KB is going to be compressed to less than 64KB, but the actual cost of writing to disk isn’t going to change, so we can save the compression costs. That actually improved performance a little bit for fast drives, but it hurt us on slow ones.

I had an interesting discussion with Alex on the usage of diff compression in the journal. This can take advantage on the fact that in many cases, we don’t modify full pages, so we can write just the changes out to disk. He was kind enough to include a few implementations of that for us to look at, those are RLE0 (Zero Run Length Encoding) implementations, and I’ll use RLE to refer to it from now on.

Reducing I/O is always good, and this promised to give a substantial boost, but the actual design details that cropped us are really interesting. Diff compression can be simple, like the RLE0 in this link, effectively, outputting something like:

... [x bytes unchanged][y bytes changed][byte 1 .. y][z bytes unchanged] ...

Or they can be much more complex, like bsdiff or xdelta. RLE handles the scenario where some bytes changes nicely, but fails badly if there is a single added byte (since it simply check for equality, we’ll see all the bytes are different). Algorithms like bsdiff or xdelta can handle much more complex differences, but they are drastically more expensive. For my purposes, bsdiff has runtime complexity of O( 2N * logN ) and memory utilization of 17N. It other words, to get the diff of 4 pages, we’ll need 272KB and about 230K operations.

Algorithms like that are usually meant for distributions. In other words, they are meant for cases where you can spend as much time as you want generating the diff, and you benefit from reduced download times. A modern usage of those is the Courgette project, for reducing the size of Chrome updates. It doesn’t matter if generating the update takes 3 hours, since it will be downloaded millions of times, and a 600KB saved in this manner will pay themselves many time over.

But those kind of costs are not something that we can pay. Analysis of our memory usage patterns also showed that in many cases, we are using mostly fixed addressing. In other words, we’ll typically change only small parts of the page, and we don’t tend to have moving writes. When we do (typically on defrag), we do them on a page boundary, so RLE implementation should generate good savings.

We have an implementation that we are currently testing, but while you can read the code, what is more interesting is the assumptions that we are making.

We scan the original and modified buffers using longs. We can safely assume that the buffers we scan are always sized in pages, so we don’t need to worry about buffers whose size isn’t divisible in sizeof(long), this make the code much simpler. We also don’t bother to encode identical parts, instead, we record the (start, count, raw bytes) differences from the original. There is a small optimization there for long runs of zeros (to make it cheaper to delete data), but beyond that, we do very little. I’ll have a separate post to dive into the actual implementation details and considerations that drove it, but that is for later.

An important reason why we don’t keep track of the unmodified data is that we don’t need it, and that we can’t actually trust the original data. Consider the case where we actually need to use the journal to recover. We do that by running through all of the transactions, and applying the diffs to the data. The problem is that we may fail midway through the recovery process, so the state of the data is not known. When applying a diff, if we use the original data, we might actually see data from a later transaction (which was applied, but we don’t know about it since we crashed before we can make a note of that). Because of this, we only use the modified data, which is safe to apply multiple times. Note that this assumes that modifying a page can not corrupt the page. In other words, if I have a 4 KB page, and I write a value to the 3rd byte, it isn’t going to cause any change to any other byte. Aside from that, we don’t require that the bytes that we modified will be there on restart, because we’ll overwrite them until we are sure that we properly synced them.

Another aspect of the diff operation that we aren’t actually all that worried about the data size (which is interesting, since we really want to reduce it), the reason for that is that we are going to throw all the diffed data into the compressor anyway. The idea is that even after the diff, we are still likely to find data to compress among the modifications on the transaction.

Currently, the steps to write a transaction to disk are:

Get all the modified pages.
For each of those, compute the difference between it and the previous version of that page.
Compress all the diffs of all the pages.
Write the compressed data to disk in a safe manner.

Tweet Share Share 11 comments

Tags:

Comments

07 Sep 2016
14:22 PM

J Healy

https://www.infoq.com/news/2016/09/facebook-zstandard-compression?utm_campaign=infoq_content&utm_source=infoq&utm_medium=feed&utm_term=global

07 Sep 2016
17:10 PM

Federico Lois

Even though you want to make it small, the tradeoff involved is faster CPU vs. smaller IO. But between smaller IO and a bit bigger IO there is equal costs. So in the general sense between 2.8 and 2.1 there is probably a very small gain than to avoid writting altogether (if possible). In general though the information written to the journal is becoming more and more compressed over time in its default layout (which makes general purpose algorithms less effective over time); eventually we will look into ZStandard but it is definitely not something that promise us to win much today in the current state.

08 Sep 2016
11:15 AM

alex

Cool, you've taken up this suggestion.

Since I did not see any follow up after my last posted comment, I thought you were not going to do anything with it.

Now both my comments on journal compression schemes seem to be finding their way into Raven DB https://ayende.com/blog/164738/fruits-of-the-poisonous-tree-voron-performance-refactoring?key=9af8a35bda6346ebbb5490108ebc3f5f#comment7.

I am curious if (and how much) this dif compression will improve Voron journal write performance.

08 Sep 2016
12:23 PM

Federico Lois

@Alex at the moment we were thinking of a different approach (to keep a few optimizations at the moment of recovery) and therefore a bit more balanced in the tradeoff. We just prototyped the full CPU vs IO version of the tradeoff for testing purposes.

Journal compression was needed because we are investigating a very different internal layout for the BTREE nodes that allow us to go with nodes with far bigger ( aka insane +2500 keys per node ) fanout, cache aware O(w) search per node instead of cache unfriendly O(log n) --- essentially free ---, and bigger node size (128kb per node) which improves write amplification. With our current simulations we can pack 1.5M keys on a 2 levels tree, we are far from decide to use it in the 4.0 timeframe, but is a bet we decided we wanted to make just in case it ends up being a viable direction.

08 Sep 2016
18:30 PM

alex

@Federico, do you mean some type of compressed trie as the node index?

09 Sep 2016
04:09 AM

Federico Lois

@alex it is a cache concious trie yes. The upside is that it is very locally aware, so inserts/upserts/deletes ensure that changes are very localized. Because of the cache concious structure lookups and creation are very fast too. While it can take between 20ns to 50ns to compare a single 60 bytes key in like 18-24ns on a cold node (and we have log(n) of those to do on edge nodes), the same structure written in C can do a whole lookup in 50ns on a tree of 600K variable size keys. For comparison an in-memory btree written in C++ takes 1080ns to do the same lookup.

Needless to say that we will be in a great position if we only achieve 0.5x the throughput of the original structure (which is totally doable). However, our most problematic issue today is write-amplification (when not in memory of course) so even if the structure is very fast, there is no guarantee that we can achieve better speeds. That's the reason we are treating this a highly experimental stuff and not in the roadmap for the time being. As you might guess, we will never ever add the intrisec complexity such a structure bring with itself, if it is not offsetted by a very measurable gain.

09 Sep 2016
04:27 AM

Federico Lois

@alex btw we invested in prototyping other types of TRIEs before but there is always a hidden cost somewhere. We investigated (we actually built a prototype of it) of a very interesting TRIE with (and I am not joking) O(1) high probability expected cost for lookups. The caveats, we had to write the key at the data level (which is very cache unfriendly). The key setup cost was awful and the write amplification worst, so inserts where 3x slower than the naive BTREE approach, and queries not that much faster.

Did we learn from it? Yes, a lot!!. Was worth the effort to prototype? Absolutely; this work spawned the discussion on tables which are pervasive now through Voron. Was it complex? Insane. Was it useful for Voron? Not at all. But you have to prototype and understand how it behave to know that first hand :)

09 Sep 2016
09:47 AM

Uri

Hi Federico, I think this is amazing, why not to use this as configurable option? there are some databases that is 99% read and only 1% write, in this case, I'm willing to pay the 3x slower writes to gain faster reads. why not to consider this?

09 Sep 2016
10:04 AM

Oren Eini

Uri, The problem is that this isn't really something that you can configure. It is a very deep factor in the db. This is the kind of thing that we make assumption on how it works to be effective. For example, if the cost of access is O(logN) vs. O(N) vs. O(1) - that result in very different behaviors, and that isn't just something that you can / want to just configure.

09 Sep 2016
17:56 PM

Federico Lois

@Uri we are moving slowly into bringing new data structures tuned for their specific uses, but as Oren noted they are very deep, so we approach the problem measuring. For the typical (and sometimes not so typical case) where are our limits, we measure to know which ones are the heaviest and move to unblock them; measuring again you just found a new blocker that was on the noise level after the one you just unblocked. Hash structures comes to my mind, we have several places where structures optimized for hash storage would be a sure win, but we are so far away from them to become a measurable blocker that there is no reason to mess with the underlying assumptions.

Just to exemplify the issue. In the 3.x branch the biggest performance blocker is storing the data using JSON. Unblocking that in 4.x opened up so many (smaller) blockers that were just hidden, that we could optimize many different subsystems and start thinking into deep changes that wouldnt have brought anything to the table before. Who knows, maybe at 5.x or even 6.x these kind of control could actually bring something to the table, today measurements show that it is just not big enough.

10 Sep 2016
18:05 PM

Uri

Thanks for the clarification, Awesome work!

Comment preview

Comments have been closed on this topic.

Markdown turns plain text formatting into fancy HTML formatting.

Phrase Emphasis

*italic*   **bold**
_italic_   __bold__

Links

Inline:

An [example](http://url.com/ "Title")

Reference-style labels (titles are optional):

An [example][id]. Then, anywhere
else in the doc, define the link:
  [id]: http://example.com/  "Title"

Images

Inline (titles are optional):

![alt text](/path/img.jpg "Title")

Reference-style:

![alt text][id]
[id]: /url/to/img.jpg "Title"

Headers

Setext-style:

Header 1
========
Header 2
--------

atx-style (closing #'s are optional):

# Header 1 #
## Header 2 ##
###### Header 6

Lists

Ordered, without paragraphs:

1.  Foo
2.  Bar

Unordered, with paragraphs:

*   A list item.
    With multiple paragraphs.
*   Bar

You can nest them:

*   Abacus
    * answer
*   Bubbles
    1.  bunk
    2.  bupkis
        * BELITTLER
    3. burper
*   Cunning

Blockquotes

> Email-style angle brackets
> are used for blockquotes.
> > And, they can be nested.
> #### Headers in blockquotes
> 
> * You can quote a list.
> * Etc.

Horizontal Rules

Three or more dashes or asterisks:

---
* * *
- - - -

Manual Line Breaks

End a line with two or more spaces:

Roses are red,   
Violets are blue.

Fenced Code Blocks

Code blocks delimited by 3 or more backticks or tildas:

```
This is a preformatted
code block
```

Header IDs

Set the id of headings with {#<id>} at end of heading line:

## My Heading {#myheading}

Tables

Fruit    |Color
---------|----------
Apples   |Red
Pears	 |Green
Bananas  |Yellow

Definition Lists

Term 1
: Definition 1
Term 2
: Definition 2

Footnotes

Body text with a footnote [^1]
[^1]: Footnote text here

Abbreviations

MDD <- will have title
*[MDD]: MarkdownDeep

Oren Eini

Oren Eini

CEO of RavenDB

Voron InternalsReducing the journal

More posts in "Voron Internals" series:

Comments

Comment preview

FUTURE POSTS

RECENT SERIES

RECENT COMMENTS

Syndication

Main feed
Comments feed

Oren Eini

CEO of RavenDB

Related posts that you may find interesting:

More posts in "Voron Internals" series:

Comments

Comment preview

Markdown formatting

Phrase Emphasis

Links

Images

Headers

Lists

Blockquotes

Horizontal Rules

Manual Line Breaks

Fenced Code Blocks

Header IDs

Tables

Definition Lists

Footnotes

Abbreviations

FUTURE POSTS

RECENT SERIES

RECENT COMMENTS

Syndication