Low level Voron optimizationsThe page size bump
Explaining the usage pages seems to be one of the things is either hit of miss for me. Either people just get it, or they struggle with the concept. I have written extensively on this particular topic, so I’ll refer it to that post for the details on what exactly pages in a database are.
Voron is currently using 4KB pages. That is pretty much the default setting, since everything else also works in units of 4KB. That means that we play nice with requirements for alignment, CPU page sizes, etc. However, 4KB is pretty small, and that lead to trees that has higher depth. And the depth of the tree is one of the most major reasons for concern for database performance (the deeper the tree, the more I/O we have to do).
We previously tested using different page sizes (8KB, 16KB and 32KB), and we saw that our performance decreased as a result. That was surprising and completely contrary to our expectations. But a short investigation revealed what the problem was. Whenever you modify a value, you dirty up the entire page. That means that we would need to write that entire page back to storage (which means making a bigger write to the journal, then applying a bigger write to the data filed, etc).
In effect, when increasing the page size to 8KB, we also doubled the amount of I/O that we had to deal with. That was a while ago, and we recently implemented journal diffing, as a way to reduce the amount of unnecessary data that we write to disk. A side affect of that is that we no longer had a 1:1 correlation between a dirty page and full page write to disk. That opened up the path to increasing the page sizes. There is still an O(PageSize) cost to doing the actual diffing, of course, but that is memory to memory cost and negligible in compared to the saved I/O.
Actually making the change was both harder and easier then expected. The hard part was that we had to do a major refactoring working to split a shared value. Both the journal and the rest of Voron used the notion of Page Size. But while we want the page size of Voron to change, we didn’t want the journal write size to change. That led to a lot of frustration where we had to go over the entire codebase and look at each value and figure out whatever it meant writing to the journal, or pages as they are used in the rest of Voron. I’ve got another post scheduled talking about how you can generate intentional compilation errors to make this easy for you to figure it out.
Once we were past the journal issue, the rest was mostly dealing with places that made silent assumptions on the page size. That can be anything from “the max value we allow here is 512 (because we need to fit at least so many entries in)” to tests that wrote 1,000 values and expected the resulting B+Tree to be of a certain depth.
The results are encouraging, and we can see them mostly on the system behavior with very large data sets, those used to generate very deep trees, and this change reduced them significantly. To give some context, let us assume that we can fit 100 entries per page using 4KB pages.
That means that if we have as little as 2.5 million entries, we’ll have (in the ideal case):
- 1 root page holding 3 entries
- 3 branch pages holding 250 entries
- 25,000 leaf pages holding the 2.5 million entries
With 8 KB pages, we’ll have:
- 1 root page holding 63 entries
- 12,500 lead pages holding 2.5 million entries
That is a reducing of a full level. The nice thing about B+Trees is that in both cases, the branch pages are very few and usually reside in main memory already, so you aren’t directly paying for their I/O.
What we are paying for is the search on them.
The cost of searching the 4KB tree is:
- O(log2 of 3) for searching the root page
- O(log2 of 100) for searching the relevant branch page
- O(log2 of 100) for searching the leaf page
In other words, about 16 operations. For the 8 KB page, that would be:
- O(log2 of 63) for searching the root page
- O(log2 of 200) for searching the leaf page
It comes to 14 operations, which doesn’t seems like a lot, but a lot of our time goes on key comparisons on the key, so anything helps.
However, note that I said that the situation above was the ideal one, this can only happen if the data was inserted sequentially, which it doesn’t usually do. Page splits can cause the tree depth to increase very easily (in fact, that is one of the core reasons why non sequential keys are so strongly discourage in pretty much all databases.
But the large page size allows us to pack many more entries into a single page, and that also reduce the risk of page splits significantly.
More posts in "Low level Voron optimizations" series:
- (02 Mar 2017) Primitives & abstraction levels
- (28 Feb 2017) Transaction lock handoff
- (20 Feb 2017) Recyclers do it over and over again.
- (14 Feb 2017) High data locality
- (08 Feb 2017) The page size bump
I can see a use case for large pages such as 64KB or 1MB for sequential scanning workloads and compression. I always wished SQL Server supported optionally larger pages for those reasons.
Tobi, Dynamic page size is a lot more complex, and it make certain optimizations in the hot path harder. We actually do internal compression for pages in certain cases, because it make a lot of sense, but it is not exposed or generally used.
Nice to see that the diff compression has some other positive side effects.
A few questions:
@alex We are still working on cache-conscious tree because it stresses other Voron components and allow optimizations we would never realize they can exist because they tend to disguise themselves with the background noise. Believe it or not we had to rewrite our Voron benchmarks because they were overturned by noise, when you are running tight measurements are tricky. About those trees, while they have several impressive key compression characteristics, the change to larger pages makes it harder to justify them because the house-keeping introduced when dealing with multiple transactions is not negligible.
For single/long write transactions they are extremely fast. Easily outclassing the current B+ Trees by 30%+ at the 10M keys mark and getting better the bigger the database. But when write transactions are involved we devolve easily into the -2.0x territory. Therefore, we will continue investing in the technology but because there are places where they can work best. An example of those is insanely huge databases but kinda static or one shot created databases like DNA sequencing or time-series kind of storages which will work in bursts of inserts at a very high frequency. But for the current workloads it is difficult to justify its use.
Having said, we still havent integrated them into table storage (other work took precedence) to see if those -2.0x do carry over with the overhead introduced by the whole database. Those measurements have been done in isolation at the storage engine level without 'external' interference.
Reducing IO is the reason that for SQL Server the basic allocation unit is an Extent of 64 KB not the page (8 pages of 8 KB each). Have you tried to allocate continuous pages instead of increasing the page size? It might hit a sweet spot in the middle, reducing IO for reads (more sequential reads and disk caches) and also reducing IO for writes.
Alex, We don't currently plan to go beyond the 64KB range. The main reason for that is that we are using the pages as a diff boundary, so the higher the are, the more work we have to do at transaction commit time to figure out what changed.
Pop Catalin, Actually, the reason that this is the case for SQL Server is that the minimum allocation unit granularity on Windows in 64KB. We are actually also using that internally (on both Windows & Linux), but that has to do with read optimizations, not write optimizations.
And yes, we have several optimizations related to trying to make things as compact and local as possible
Ayende, are you referring to "Allocation Unit Size" at File System level? If yes, then the default is 4KB, with the possibility to increase it to 64KB, or are you referring to something else?
Actually the default is based on disk size, here's a nice table: https://support.microsoft.com/en-us/help/140365/default-cluster-size-for-ntfs,-fat,-and-exfat
@Pop there are other data structures that are optimized for even higher. We are talking about the multiple megabytes kind of page size. However, the changes at the storage level to support such a thing is definitely not something we are willing to take unless we run out of gas with the current setup. Which for now, it is not showing diminishing returns.
Pop Catalin, No, see here: https://blogs.msdn.microsoft.com/oldnewthing/20031008-00/?p=42223
Thanks for the link. That's a very nice optimization.