After talking about increasing the Voron page size, let us talk about another very important optimization. High data locality. The importance of locality comes up again and again in performance.The cost of getting the next bit of data can be so prohibitedly expensive that it dominates everything else, including standard Big O time complexity metrics. A fun discussion of that is here.
Remember that Voron actually stores all data in pages, and that means that it needs some way to allocate new pages. And by default, whenever you allocate a page, we use a page from the end of the file. In certain scenarios (pure sequential inserts), that generates some pretty good allocation pattern, but even there it can cause issues. Let us consider what the database file looks like after a while:
Imagine that the green sections are all pages that belong to the same B+Tree inside Voron. Traversing the B+Tree now means that we have a very high probability of having to jump around in the file a lot. Since we are memory mapped, we wouldn’t typically feel this, since we aren’t actually hitting the disk that often, but it has several interesting implications:
- Startup time can increase rapidly, since we need to issue many I/O requests to different places in the file
- Flush / sync time is also increased, because it need to touch more of the disk
Trees are typically used for indexes in Voron, and a document collection would typically have a few different storage indexes (lookup by etag, lookup by name, etc). Because they store different data, they have different growth pattern, so they are going to allocate pages at different rate, which means that the scattering of the pages across the data file is even more sever.
The change we just finished implementing is going to do several important things all at once:
- Pages for all the storage indexes of a collection are going to be pre-allocated, and when they run out, be allocated again in batches.
- The indexes will ask the storage to allocate pages nearby the sibling page, to increase locality even further.
- All indexes will use the same pre-allocation buffer, so they all reside in roughly the same place.
That also give us some really interesting optimizations opportunities. Since indexes are typically order of magnitude smaller than the data they cover, it is possible to ask the operation system to prefetch the sections that we reserved for indexes for each collection in advance, leading to far less paging in the future and improving the startup time.
It also means that the operation system can issue a lot more continuous reads and writes, which is perfectly in line with what we want.
The new allocation strategy ends up looking like this:
In this case, we have enough data to fill the first pre-allocated section, and then we allocate a new one. So instead of 4 operations to load things, we can do this in 2.
Even without explicit prefetching on our end, this is going to be great because the operating system is going to be able to recognize the pattern of access and optimize the access itself.