Low level Voron optimizationsHigh data locality
After talking about increasing the Voron page size, let us talk about another very important optimization. High data locality. The importance of locality comes up again and again in performance.The cost of getting the next bit of data can be so prohibitedly expensive that it dominates everything else, including standard Big O time complexity metrics. A fun discussion of that is here.
Remember that Voron actually stores all data in pages, and that means that it needs some way to allocate new pages. And by default, whenever you allocate a page, we use a page from the end of the file. In certain scenarios (pure sequential inserts), that generates some pretty good allocation pattern, but even there it can cause issues. Let us consider what the database file looks like after a while:
Imagine that the green sections are all pages that belong to the same B+Tree inside Voron. Traversing the B+Tree now means that we have a very high probability of having to jump around in the file a lot. Since we are memory mapped, we wouldn’t typically feel this, since we aren’t actually hitting the disk that often, but it has several interesting implications:
- Startup time can increase rapidly, since we need to issue many I/O requests to different places in the file
- Flush / sync time is also increased, because it need to touch more of the disk
Trees are typically used for indexes in Voron, and a document collection would typically have a few different storage indexes (lookup by etag, lookup by name, etc). Because they store different data, they have different growth pattern, so they are going to allocate pages at different rate, which means that the scattering of the pages across the data file is even more sever.
The change we just finished implementing is going to do several important things all at once:
- Pages for all the storage indexes of a collection are going to be pre-allocated, and when they run out, be allocated again in batches.
- The indexes will ask the storage to allocate pages nearby the sibling page, to increase locality even further.
- All indexes will use the same pre-allocation buffer, so they all reside in roughly the same place.
That also give us some really interesting optimizations opportunities. Since indexes are typically order of magnitude smaller than the data they cover, it is possible to ask the operation system to prefetch the sections that we reserved for indexes for each collection in advance, leading to far less paging in the future and improving the startup time.
It also means that the operation system can issue a lot more continuous reads and writes, which is perfectly in line with what we want.
The new allocation strategy ends up looking like this:
In this case, we have enough data to fill the first pre-allocated section, and then we allocate a new one. So instead of 4 operations to load things, we can do this in 2.
Even without explicit prefetching on our end, this is going to be great because the operating system is going to be able to recognize the pattern of access and optimize the access itself.
More posts in "Low level Voron optimizations" series:
- (02 Mar 2017) Primitives & abstraction levels
- (28 Feb 2017) Transaction lock handoff
- (20 Feb 2017) Recyclers do it over and over again.
- (14 Feb 2017) High data locality
- (08 Feb 2017) The page size bump
Comments
I like this. SQL Server and NTFS like to spread big, sequentially written data across any free space hole that they find. You literally get 100% percent fragmentation for a freshly written item sometimes. My 16GB paging file on drive C was created with 70000 fragments. It's braindead.
How big is the allocation region size that you are using? My idea always was to base it on the existing size of the index and pre-allocate 1%. That way the waste is tiny and for big indexes (which are the only ones that matter) there ends up being near zero fragmentation.
Tobi, That is mostly dependent on the allocation strategy used and the alloc/dealloc pattern. There is also an issue about how much time you can spend trying to optimize a write vs. how much time it take you to find a value.
We are using chunks of 2MB in size, which are used to store just index data. And we try to place similar items next so we get higher locallity even there. Note that even on very large database, our internal storage indexes will rarely exceed 1 - 2GB in size (that depepnd on items number, not item size), so that works very well in practice.
Yeah, a fixed 2MB also makes sense. It is notable, though, that the time it takes to perform a single disk seek burns 1MB of sequentially read data already. So a bigger block size might be worth it. Is that not worth it in your case?
Tobi, Remember that we are typically memory mapped, and we are actually cheating :-) At startup, we'll ask the OS to pre-load all those indexes directly, so most of the time, we will have optimal access pattern and no I/O. That is another reason to want to localize them.
Related to the varnish link and easier to read: https://www.varnish-cache.org/trac/wiki/ArchitectNotes
Comment preview