Low level Voron optimizations: The page size bump

architecture (623) rss
bugs (451) rss
community (382) rss
databases (481) rss
design (899) rss
development (654) rss
hibernating-practices (73) rss
miscellaneous (592) rss
performance (397) rss
programming (1104) rss
raven (1471) rss
ravendb.net (558) rss
reviews (184) rss

2025
- October (4)
- September (10)
- August (6)
- July (7)
- June (7)
- May (10)
- April (10)
- March (10)
- February (7)
- January (12)
2024
- December (3)
- November (2)
- October (1)
- September (3)
- August (5)
- July (10)
- June (4)
- May (6)
- April (2)
- March (8)
- February (2)
- January (14)
2023
- December (4)
- October (4)
- September (6)
- August (12)
- July (5)
- June (15)
- May (3)
- April (11)
- March (5)
- February (5)
- January (8)
2022
- December (5)
- November (7)
- October (7)
- September (9)
- August (10)
- July (15)
- June (12)
- May (9)
- April (14)
- March (15)
- February (13)
- January (16)
2021
- December (23)
- November (20)
- October (16)
- September (6)
- August (16)
- July (11)
- June (16)
- May (4)
- April (10)
- March (11)
- February (15)
- January (14)
2020
- December (10)
- November (13)
- October (15)
- September (6)
- August (9)
- July (9)
- June (17)
- May (15)
- April (14)
- March (21)
- February (16)
- January (13)
2019
- December (17)
- November (14)
- October (16)
- September (10)
- August (8)
- July (16)
- June (11)
- May (13)
- April (18)
- March (12)
- February (19)
- January (23)
2018
- December (15)
- November (14)
- October (19)
- September (18)
- August (23)
- July (20)
- June (20)
- May (23)
- April (15)
- March (23)
- February (19)
- January (23)
2017
- December (21)
- November (24)
- October (22)
- September (21)
- August (23)
- July (21)
- June (24)
- May (21)
- April (21)
- March (23)
- February (20)
- January (23)
2016
- December (17)
- November (18)
- October (22)
- September (18)
- August (23)
- July (22)
- June (17)
- May (24)
- April (16)
- March (16)
- February (21)
- January (21)
2015
- December (5)
- November (10)
- October (9)
- September (17)
- August (20)
- July (17)
- June (4)
- May (12)
- April (9)
- March (8)
- February (25)
- January (17)
2014
- December (22)
- November (19)
- October (21)
- September (37)
- August (24)
- July (23)
- June (13)
- May (19)
- April (24)
- March (23)
- February (21)
- January (24)
2013
- December (23)
- November (29)
- October (27)
- September (26)
- August (24)
- July (24)
- June (23)
- May (25)
- April (26)
- March (24)
- February (24)
- January (21)
2012
- December (19)
- November (22)
- October (27)
- September (24)
- August (30)
- July (23)
- June (25)
- May (23)
- April (25)
- March (25)
- February (28)
- January (24)
2011
- December (17)
- November (14)
- October (24)
- September (28)
- August (27)
- July (30)
- June (19)
- May (16)
- April (30)
- March (23)
- February (11)
- January (26)
2010
- December (29)
- November (28)
- October (35)
- September (33)
- August (44)
- July (17)
- June (20)
- May (53)
- April (29)
- March (35)
- February (33)
- January (36)
2009
- December (37)
- November (35)
- October (53)
- September (60)
- August (66)
- July (29)
- June (24)
- May (52)
- April (63)
- March (35)
- February (53)
- January (50)
2008
- December (58)
- November (65)
- October (46)
- September (48)
- August (96)
- July (87)
- June (45)
- May (51)
- April (52)
- March (70)
- February (43)
- January (49)
2007
- December (100)
- November (52)
- October (109)
- September (68)
- August (80)
- July (56)
- June (150)
- May (115)
- April (73)
- March (124)
- February (102)
- January (68)
2006
- December (95)
- November (53)
- October (120)
- September (57)
- August (88)
- July (54)
- June (103)
- May (89)
- April (84)
- March (143)
- February (78)
- January (64)
2005
- December (70)
- November (97)
- October (91)
- September (61)
- August (74)
- July (92)
- June (100)
- May (53)
- April (42)
- March (41)
- February (84)
- January (31)
2004
- December (49)
- November (26)
- October (26)
- September (6)
- April (10)

RavenDB - High-Performance NoSQL Document Database

Feb 08 2017

Low level Voron optimizationsThe page size bump

time to read 5 min | 864 words

Explaining the usage pages seems to be one of the things is either hit of miss for me. Either people just get it, or they struggle with the concept. I have written extensively on this particular topic, so I’ll refer it to that post for the details on what exactly pages in a database are.

Voron is currently using 4KB pages. That is pretty much the default setting, since everything else also works in units of 4KB. That means that we play nice with requirements for alignment, CPU page sizes, etc. However, 4KB is pretty small, and that lead to trees that has higher depth. And the depth of the tree is one of the most major reasons for concern for database performance (the deeper the tree, the more I/O we have to do).

We previously tested using different page sizes (8KB, 16KB and 32KB), and we saw that our performance decreased as a result. That was surprising and completely contrary to our expectations. But a short investigation revealed what the problem was. Whenever you modify a value, you dirty up the entire page. That means that we would need to write that entire page back to storage (which means making a bigger write to the journal, then applying a bigger write to the data filed, etc).

In effect, when increasing the page size to 8KB, we also doubled the amount of I/O that we had to deal with. That was a while ago, and we recently implemented journal diffing, as a way to reduce the amount of unnecessary data that we write to disk. A side affect of that is that we no longer had a 1:1 correlation between a dirty page and full page write to disk. That opened up the path to increasing the page sizes. There is still an O(PageSize) cost to doing the actual diffing, of course, but that is memory to memory cost and negligible in compared to the saved I/O.

Actually making the change was both harder and easier then expected. The hard part was that we had to do a major refactoring working to split a shared value. Both the journal and the rest of Voron used the notion of Page Size. But while we want the page size of Voron to change, we didn’t want the journal write size to change. That led to a lot of frustration where we had to go over the entire codebase and look at each value and figure out whatever it meant writing to the journal, or pages as they are used in the rest of Voron. I’ve got another post scheduled talking about how you can generate intentional compilation errors to make this easy for you to figure it out.

Once we were past the journal issue, the rest was mostly dealing with places that made silent assumptions on the page size. That can be anything from “the max value we allow here is 512 (because we need to fit at least so many entries in)” to tests that wrote 1,000 values and expected the resulting B+Tree to be of a certain depth.

The results are encouraging, and we can see them mostly on the system behavior with very large data sets, those used to generate very deep trees, and this change reduced them significantly. To give some context, let us assume that we can fit 100 entries per page using 4KB pages.

That means that if we have as little as 2.5 million entries, we’ll have (in the ideal case):

1 root page holding 3 entries
3 branch pages holding 250 entries
25,000 leaf pages holding the 2.5 million entries

With 8 KB pages, we’ll have:

1 root page holding 63 entries
12,500 lead pages holding 2.5 million entries

That is a reducing of a full level. The nice thing about B+Trees is that in both cases, the branch pages are very few and usually reside in main memory already, so you aren’t directly paying for their I/O.

What we are paying for is the search on them.

The cost of searching the 4KB tree is:

O(log2 of 3) for searching the root page
O(log2 of 100) for searching the relevant branch page
O(log2 of 100) for searching the leaf page

In other words, about 16 operations. For the 8 KB page, that would be:

O(log2 of 63) for searching the root page
O(log2 of 200) for searching the leaf page

It comes to 14 operations, which doesn’t seems like a lot, but a lot of our time goes on key comparisons on the key, so anything helps.

However, note that I said that the situation above was the ideal one, this can only happen if the data was inserted sequentially, which it doesn’t usually do. Page splits can cause the tree depth to increase very easily (in fact, that is one of the core reasons why non sequential keys are so strongly discourage in pretty much all databases.

But the large page size allows us to pack many more entries into a single page, and that also reduce the risk of page splits significantly.

Tweet Share Share 12 comments

Tags:

Comments

08 Feb 2017
12:11 PM

tobi

I can see a use case for large pages such as 64KB or 1MB for sequential scanning workloads and compression. I always wished SQL Server supported optionally larger pages for those reasons.

08 Feb 2017
13:18 PM

Oren Eini

Tobi, Dynamic page size is a lot more complex, and it make certain optimizations in the hot path harder. We actually do internal compression for pages in certain cases, because it make a lot of sense, but it is not exposed or generally used.

08 Feb 2017
16:43 PM

alex

Nice to see that the diff compression has some other positive side effects.

A few questions:

Are you going to be using page sizes larger than 64K, and thus updating the Voron size and offset field types to be 32 bits instead of 16? If so, would it make sense to do 4 byte alignment on everything?
Are you going to use the cache-conscious trie design Federico mentioned for variable length keys instead of a B+Tree?
W.r.t. the variable page sizes mentioned by Tobi, could you not make that a per-database on-create decision, i.e. have a fixed configured page size per Voron database?

08 Feb 2017
20:09 PM

Federico Lois

@alex We are still working on cache-conscious tree because it stresses other Voron components and allow optimizations we would never realize they can exist because they tend to disguise themselves with the background noise. Believe it or not we had to rewrite our Voron benchmarks because they were overturned by noise, when you are running tight measurements are tricky. About those trees, while they have several impressive key compression characteristics, the change to larger pages makes it harder to justify them because the house-keeping introduced when dealing with multiple transactions is not negligible.

For single/long write transactions they are extremely fast. Easily outclassing the current B+ Trees by 30%+ at the 10M keys mark and getting better the bigger the database. But when write transactions are involved we devolve easily into the -2.0x territory. Therefore, we will continue investing in the technology but because there are places where they can work best. An example of those is insanely huge databases but kinda static or one shot created databases like DNA sequencing or time-series kind of storages which will work in bursts of inserts at a very high frequency. But for the current workloads it is difficult to justify its use.

Having said, we still havent integrated them into table storage (other work took precedence) to see if those -2.0x do carry over with the overhead introduced by the whole database. Those measurements have been done in isolation at the storage engine level without 'external' interference.

09 Feb 2017
08:09 AM

Pop Catalin

Reducing IO is the reason that for SQL Server the basic allocation unit is an Extent of 64 KB not the page (8 pages of 8 KB each). Have you tried to allocate continuous pages instead of increasing the page size? It might hit a sweet spot in the middle, reducing IO for reads (more sequential reads and disk caches) and also reducing IO for writes.

09 Feb 2017
09:01 AM

Oren Eini

Alex, We don't currently plan to go beyond the 64KB range. The main reason for that is that we are using the pages as a diff boundary, so the higher the are, the more work we have to do at transaction commit time to figure out what changed.

09 Feb 2017
09:02 AM

Oren Eini

Pop Catalin, Actually, the reason that this is the case for SQL Server is that the minimum allocation unit granularity on Windows in 64KB. We are actually also using that internally (on both Windows & Linux), but that has to do with read optimizations, not write optimizations.

And yes, we have several optimizations related to trying to make things as compact and local as possible

09 Feb 2017
13:44 PM

Pop Catalin

Ayende, are you referring to "Allocation Unit Size" at File System level? If yes, then the default is 4KB, with the possibility to increase it to 64KB, or are you referring to something else?

09 Feb 2017
13:46 PM

Pop Catalin

Actually the default is based on disk size, here's a nice table: https://support.microsoft.com/en-us/help/140365/default-cluster-size-for-ntfs,-fat,-and-exfat

10 Feb 2017
01:23 AM

Federico Lois

@Pop there are other data structures that are optimized for even higher. We are talking about the multiple megabytes kind of page size. However, the changes at the storage level to support such a thing is definitely not something we are willing to take unless we run out of gas with the current setup. Which for now, it is not showing diminishing returns.

10 Feb 2017
13:06 PM

Oren Eini

Pop Catalin, No, see here: https://blogs.msdn.microsoft.com/oldnewthing/20031008-00/?p=42223

10 Feb 2017
14:31 PM

Pop Catalin

Thanks for the link. That's a very nice optimization.

Comment preview

Comments have been closed on this topic.

Markdown turns plain text formatting into fancy HTML formatting.

Phrase Emphasis

*italic*   **bold**
_italic_   __bold__

Links

Inline:

An [example](http://url.com/ "Title")

Reference-style labels (titles are optional):

An [example][id]. Then, anywhere
else in the doc, define the link:
  [id]: http://example.com/  "Title"

Images

Inline (titles are optional):

![alt text](/path/img.jpg "Title")

Reference-style:

![alt text][id]
[id]: /url/to/img.jpg "Title"

Headers

Setext-style:

Header 1
========
Header 2
--------

atx-style (closing #'s are optional):

# Header 1 #
## Header 2 ##
###### Header 6

Lists

Ordered, without paragraphs:

1.  Foo
2.  Bar

Unordered, with paragraphs:

*   A list item.
    With multiple paragraphs.
*   Bar

You can nest them:

*   Abacus
    * answer
*   Bubbles
    1.  bunk
    2.  bupkis
        * BELITTLER
    3. burper
*   Cunning

Blockquotes

> Email-style angle brackets
> are used for blockquotes.
> > And, they can be nested.
> #### Headers in blockquotes
> 
> * You can quote a list.
> * Etc.

Horizontal Rules

Three or more dashes or asterisks:

---
* * *
- - - -

Manual Line Breaks

End a line with two or more spaces:

Roses are red,   
Violets are blue.

Fenced Code Blocks

Code blocks delimited by 3 or more backticks or tildas:

```
This is a preformatted
code block
```

Header IDs

Set the id of headings with {#<id>} at end of heading line:

## My Heading {#myheading}

Tables

Fruit    |Color
---------|----------
Apples   |Red
Pears	 |Green
Bananas  |Yellow

Definition Lists

Term 1
: Definition 1
Term 2
: Definition 2

Footnotes

Body text with a footnote [^1]
[^1]: Footnote text here

Abbreviations

MDD <- will have title
*[MDD]: MarkdownDeep

Oren Eini

Oren Eini

CEO of RavenDB

Low level Voron optimizationsThe page size bump

More posts in "Low level Voron optimizations" series:

Comments

Comment preview

FUTURE POSTS

RECENT SERIES

RECENT COMMENTS

Syndication

Main feed
Comments feed

Oren Eini

CEO of RavenDB

Related posts that you may find interesting:

More posts in "Low level Voron optimizations" series:

Comments

Comment preview

Markdown formatting

Phrase Emphasis

Links

Images

Headers

Lists

Blockquotes

Horizontal Rules

Manual Line Breaks

Fenced Code Blocks

Header IDs

Tables

Definition Lists

Footnotes

Abbreviations

FUTURE POSTS

RECENT SERIES

RECENT COMMENTS

Syndication