Excerpts from the RavenDB Performance team report: Voron vs. Esent

architecture (618) rss
bugs (451) rss
challanges (123) rss
community (381) rss
databases (481) rss
design (896) rss
development (646) rss
hibernating-practices (72) rss
miscellaneous (592) rss
performance (397) rss
programming (1092) rss
raven (1459) rss
ravendb.net (544) rss
reviews (184) rss

2025
- August (5)
- July (7)
- June (7)
- May (10)
- April (10)
- March (10)
- February (7)
- January (12)
2024
- December (3)
- November (2)
- October (1)
- September (3)
- August (5)
- July (10)
- June (4)
- May (6)
- April (2)
- March (8)
- February (2)
- January (14)
2023
- December (4)
- October (4)
- September (6)
- August (12)
- July (5)
- June (15)
- May (3)
- April (11)
- March (5)
- February (5)
- January (8)
2022
- December (5)
- November (7)
- October (7)
- September (9)
- August (10)
- July (15)
- June (12)
- May (9)
- April (14)
- March (15)
- February (13)
- January (16)
2021
- December (23)
- November (20)
- October (16)
- September (6)
- August (16)
- July (11)
- June (16)
- May (4)
- April (10)
- March (11)
- February (15)
- January (14)
2020
- December (10)
- November (13)
- October (15)
- September (6)
- August (9)
- July (9)
- June (17)
- May (15)
- April (14)
- March (21)
- February (16)
- January (13)
2019
- December (17)
- November (14)
- October (16)
- September (10)
- August (8)
- July (16)
- June (11)
- May (13)
- April (18)
- March (12)
- February (19)
- January (23)
2018
- December (15)
- November (14)
- October (19)
- September (18)
- August (23)
- July (20)
- June (20)
- May (23)
- April (15)
- March (23)
- February (19)
- January (23)
2017
- December (21)
- November (24)
- October (22)
- September (21)
- August (23)
- July (21)
- June (24)
- May (21)
- April (21)
- March (23)
- February (20)
- January (23)
2016
- December (17)
- November (18)
- October (22)
- September (18)
- August (23)
- July (22)
- June (17)
- May (24)
- April (16)
- March (16)
- February (21)
- January (21)
2015
- December (5)
- November (10)
- October (9)
- September (17)
- August (20)
- July (17)
- June (4)
- May (12)
- April (9)
- March (8)
- February (25)
- January (17)
2014
- December (22)
- November (19)
- October (21)
- September (37)
- August (24)
- July (23)
- June (13)
- May (19)
- April (24)
- March (23)
- February (21)
- January (24)
2013
- December (23)
- November (29)
- October (27)
- September (26)
- August (24)
- July (24)
- June (23)
- May (25)
- April (26)
- March (24)
- February (24)
- January (21)
2012
- December (19)
- November (22)
- October (27)
- September (24)
- August (30)
- July (23)
- June (25)
- May (23)
- April (25)
- March (25)
- February (28)
- January (24)
2011
- December (17)
- November (14)
- October (24)
- September (28)
- August (27)
- July (30)
- June (19)
- May (16)
- April (30)
- March (23)
- February (11)
- January (26)
2010
- December (29)
- November (28)
- October (35)
- September (33)
- August (44)
- July (17)
- June (20)
- May (53)
- April (29)
- March (35)
- February (33)
- January (36)
2009
- December (37)
- November (35)
- October (53)
- September (60)
- August (66)
- July (29)
- June (24)
- May (52)
- April (63)
- March (35)
- February (53)
- January (50)
2008
- December (58)
- November (65)
- October (46)
- September (48)
- August (96)
- July (87)
- June (45)
- May (51)
- April (52)
- March (70)
- February (43)
- January (49)
2007
- December (100)
- November (52)
- October (109)
- September (68)
- August (80)
- July (56)
- June (150)
- May (115)
- April (73)
- March (124)
- February (102)
- January (68)
2006
- December (95)
- November (53)
- October (120)
- September (57)
- August (88)
- July (54)
- June (103)
- May (89)
- April (84)
- March (143)
- February (78)
- January (64)
2005
- December (70)
- November (97)
- October (91)
- September (61)
- August (74)
- July (92)
- June (100)
- May (53)
- April (42)
- March (41)
- February (84)
- January (31)
2004
- December (49)
- November (26)
- October (26)
- September (6)
- April (10)

RavenDB Workshops - Deep dive into practical use of Document Data Modeling

Jan 16 2015

Excerpts from the RavenDB Performance team reportVoron vs. Esent

time to read 3 min | 544 words

Another thing that turned up in the performance work was the Esent vs. Voron issue. We keep testing everything on both, and trying to see which one can outdo the other, fix a hotspot, then try again. When we run the YCSB benchmark we also compared between Esent vs. Voron as storage for our databases and we found that Voron was very good in read operation while Esent was slightly better in write operation. During the YCSB tests we found out one of the reason why Voron was a bit slower than Esent for writing, it was consuming 4 times the expected disk-space.

The reason for this high disk-space consumption was that the benchmark by default generates documents of exactly 1KB, with meta-data the actual size was 1.1KB. Voron internal implementation uses a B+ tree where the leafs are 4KB in size, 1KB was the threshold in which we decide not to save data to the leaf but to reference on it and save it on a new page. We ended up creating a new 4KB page to hold 1.1KB documents for each document that we saved. The benchmark actually hit the worst case scenario for our implementation, and caused us to use 4 times more disk space and write 4 times more data than we needed. Changing this threshold reduce the disk-space consumption to the expected size, and gave Voron a nice boost.

We are also testing our software on a wide variety of systems, and with Voron specifically with run into an annoying issue. Voron is a write ahead log system, and we are very careful to write to the log in a very speedy manner. This is one of the ways in which we are getting really awesome speed for Voron. But when running on slow I/O system, and putting a lot of load on Voron, we started to see very large stalls after a while. Tracing the issue took a while, but eventually we figured out what was going on. Writing to the log is all well and good, but we need to also send the data to the actual data file at some point.

The way Voron does it, it batch a whole bunch of work, write it to the data file, then sync the data file to make sure it is actually persisted on disk. Usually, that isn’t really an issue. But on slow I/O, and especially under load, you get results like this:

Start to sync data file (8:59:52 AM). Written but unsynced data size 309 MB
FlushViewOfFile duration 00:00:13.3482163. FlushFileBuffers duration: 00:00:00.2800050.
End of data pager sync (9:00:05 AM). Duration: 00:00:13.7042229

Note that this is random write, because we may be doing writes to any part of the file, but that is still way too long. What was worse, and the reason we actually care is that we were doing that while holding the transaction lock.

We were able to re-design that part so even under slow I/O, we can take the lock for a very short amount of time, update the in memory data structure and then release the lock and spend some quality time gazing at our navel in peace while the I/O proceeded in its own pace, but now without blocking anyone else.

Tweet Share Share 11 comments

Tags:

Comments

16 Jan 2015
13:06 PM

Jesús López

Why are you holding a lock while writting to the data file?

SQL Server for example, Lazy Writer writes dirty pages to database files asynchronously without affecting transaction. The transaction is committed when data pages are written to memory and logged on the transaction log.

16 Jan 2015
17:15 PM

Ayende Rahien

Jesús, This isn't during a transaction. This is what happens when we are actually flushing from the database journal to the data file. We need the lock to ensure that we get a consistent access to the current view of the system. We don't need to hold it for the duration of the write.

16 Jan 2015
18:02 PM

Ryan Heath

Great story!

One thing I do get though So documents between 1kb and 4kb would always have/had a page for themselves? So writing 2 2kb documents would have use 2 pages instead of one? Why not try to use as much as possible of a page? Why the 1kb threshold?

// Ryan

16 Jan 2015
19:00 PM

Ayende Rahien

Ryan, The actual size now is a bit higher, a bit over 2,000 bytes. The reason for that is that we need to be able to put at least two values inside a page, so if we can't fit two of them, that means that we need to go to an overflow page. It also means that if you are 2KB or higher, you are using a max of 50% additional space, but that tends to be much nicer than the 400% usage that we saw with 1KB values before this issue was fixed

17 Jan 2015
08:56 AM

alex

@Ayende, regarding the data sync, I am assuming you are already writing the pages out in sequential order by having a page number sorted list.

One additional thing I looked at, is to use a significantly sized memory buffer that can hold a number of adjacent pages (lets say 32 or 64, allocated at startup) and fill that up to reduce the number of I/O calls. There tend to be quite a few adjacent pages in a number of scenarios, that you can then batch in one I/O.

Wouldn't it be great though if scatter/gather was working for writes to memmapped files as well. Bummer.

17 Jan 2015
09:04 AM

alex

Err ... hold on, I see you are writing to the memmap, because "FlushViewOfFiles", In the scenario I mentioned, the memmap is always opened as read-only, and writes use normal buffered native file i/o. with an fsync at the end.

17 Jan 2015
14:06 PM

Ayende Rahien

Alex, We are writing the data to a mmap file, then flushing it. And you can't use normal i/o and mmap in the same file and get a coherent result.

17 Jan 2015
22:06 PM

alex

I know. documentation states "A mapped view of a file is not guaranteed to be coherent with a file that is being accessed by the ReadFile or WriteFile function.",

But it appears that if you use regular buffered I/O (not writethrough) this still works fine. I believe this is also what LMBD is doing on a page flush. See https://gitorious.org/mdb/mdb/source/985bbbbdd5d64e57f55249ffdeb7c08035b240b2:libraries/liblmdb/mdb.c#L3181

18 Jan 2015
10:08 AM

Ayende Rahien

Alex, The documentation is correct. For 99.99% of the time, you would be able to make it work. For a small percentage of cases, that won't work for us, and we'll see the previous details before they are synced. We have actually managed to reproduce this several time, and even without it, I would feel very uncomfortable about this.

18 Jan 2015
10:12 AM

Ayende Rahien

Alex, You can see this here: http://ayende.com/blog/164577/is-select-broken-memory-mapped-files-with-unbufferred-writes-race-condition?key=edf0a32bd4984be483e7c1d2ee95d177

This is for unbuffered output, sure, but the docs doesn't make a distinction about that

18 Jan 2015
10:39 AM

alex

Yes, I know the problem exists for unbuffered i/o. However, it appears to work in the buffered case (as illustrated by the fact that LMDB uses it).

But you are right, the documentation does not make a distinction, so it is entirely possible that - even if we were to assume that it works now on all possible platforms - a breaking change might occur in future. So I can understand why you would feel uncomfortable using such an approach.

Comment preview

Comments have been closed on this topic.

Markdown turns plain text formatting into fancy HTML formatting.

Phrase Emphasis

*italic*   **bold**
_italic_   __bold__

Links

Inline:

An [example](http://url.com/ "Title")

Reference-style labels (titles are optional):

An [example][id]. Then, anywhere
else in the doc, define the link:
  [id]: http://example.com/  "Title"

Images

Inline (titles are optional):

![alt text](/path/img.jpg "Title")

Reference-style:

![alt text][id]
[id]: /url/to/img.jpg "Title"

Headers

Setext-style:

Header 1
========
Header 2
--------

atx-style (closing #'s are optional):

# Header 1 #
## Header 2 ##
###### Header 6

Lists

Ordered, without paragraphs:

1.  Foo
2.  Bar

Unordered, with paragraphs:

*   A list item.
    With multiple paragraphs.
*   Bar

You can nest them:

*   Abacus
    * answer
*   Bubbles
    1.  bunk
    2.  bupkis
        * BELITTLER
    3. burper
*   Cunning

Blockquotes

> Email-style angle brackets
> are used for blockquotes.
> > And, they can be nested.
> #### Headers in blockquotes
> 
> * You can quote a list.
> * Etc.

Horizontal Rules

Three or more dashes or asterisks:

---
* * *
- - - -

Manual Line Breaks

End a line with two or more spaces:

Roses are red,   
Violets are blue.

Fenced Code Blocks

Code blocks delimited by 3 or more backticks or tildas:

```
This is a preformatted
code block
```

Header IDs

Set the id of headings with {#<id>} at end of heading line:

## My Heading {#myheading}

Tables

Fruit    |Color
---------|----------
Apples   |Red
Pears	 |Green
Bananas  |Yellow

Definition Lists

Term 1
: Definition 1
Term 2
: Definition 2

Footnotes

Body text with a footnote [^1]
[^1]: Footnote text here

Abbreviations

MDD <- will have title
*[MDD]: MarkdownDeep

Oren Eini

Oren Eini

CEO of RavenDB

Excerpts from the RavenDB Performance team reportVoron vs. Esent

More posts in "Excerpts from the RavenDB Performance team report" series:

Comments

Comment preview

FUTURE POSTS

RECENT SERIES

RECENT COMMENTS

Syndication

Main feed
Comments feed

Oren Eini

CEO of RavenDB

Related posts that you may find interesting:

More posts in "Excerpts from the RavenDB Performance team report" series:

Comments

Comment preview

Markdown formatting

Phrase Emphasis

Links

Images

Headers

Lists

Blockquotes

Horizontal Rules

Manual Line Breaks

Fenced Code Blocks

Header IDs

Tables

Definition Lists

Footnotes

Abbreviations

FUTURE POSTS

RECENT SERIES

RECENT COMMENTS

Syndication