In my previous post, we stored keys and values as raw numbers inside the 8KB page. That was simple, but wasteful. For many scenarios, we are never going to need to utilize the full 8 bytes range for a long. Most numbers are far smaller.
In the example I gave in the last post, we are storing the following range of numbers (file offsets, basically). I’m using two test scenarios, one where I’m testing the full range (for correctness) and one where I’m testing files under 512 GB in size. Given that we are trying to compress the space, once we hit the 512GB mark, it is probably less urgent, after all.
Here are the number generations that I’m using:
What this means is:
|Full data set
|Realistic data set
This is meant to verify that we can handle any scenario, in practice, we can usually focus on the first 512 GB, which is far more common.
Using both approaches, I can fit using my previous approach, up to 511 entries per page. That makes sense, we are storing the data raw, so how can we do better? Most of the time, we don’t need anywhere near 8 bytes per value. For that reason, we have variable length encoding, which has many names, such as variable size int, 7 bits integers, etc. I adapted some methods from the .NET codebase to allow me to operate on Spans, like so:
Let’s check what sort of savings we can get using this approach:
- Under 127 bytes– 1 byte
- 128 bytes .. 32 KB – 2 bytes
- 32KB .. 8MB – 3 bytes
- 8MB .. 2GB – 4 bytes
- 2 GB .. 512 GB – 5 bytes
- 512GB .. 128 TB – 6 bytes
- 128TB .. 32 Petabytes – 7 bytes
- 32 Petabytes .. 8 Exabytes – 8 bytes
- Greater than 8 Exabytes – 9 bytes
That is really cool, since for the realistic data set, we can pack a lot more data into the page.
It comes with a serious issue, however. The data is no longer fixed size (well, that is the point, no?). Why is that a problem? Because we want to be able to do a binary search on that, which means that we need to be able to access the data by index. As usual, the solution is to utilize indirection. We’ll dedicate the bottom of the page to an array of fixed-size int (16 bits – sufficient to cover the 8KB range of the page) that will point to the actual location of the entry. Like before, we are going to reserve the first few bytes as a header, in this case we’ll use 4 bytes, divided into two shorts. Those will keep track of the writes to the bottom and the top of the page.
At the bottom, we’ll have the actual offsets that point to the entries, and at the top, we write the actual entries. Here is what this looks like:
Let’s see how our reading from the page will look now. As you can see, it is very similar to what we had before, but instead of going directly to the key by its offset, we have to use the indirection:
The offsets array contains the location of the entry in the page, and that is laid out as the [varint-key][varint-val]. So we read (and discard) the key from the offset we found (we have to do that to discover its size) and then we read and return the actual value.
Let’s look at how we implemented the actual binary search in the page:
This is a bog standard binary search, with the only interesting bit that we are going through the offsets array to find the actual location of the key, which we then read using variable size decoding.
The interesting part of this model happens when we need to set a value. Here is what this looks like, with my notes following the code.
This is quite a lot, I’ll admit. Let’s try to break up into individual pieces what is going on here.
First, we get the header values (bottom, top) and initialize them if empty (note that bottom is set to 4, after the header, while top is set to the end of the buffer). The idea is that the bottom grows up and the top grows down. This is called Slotted Page design and it is a staple of database design.
We then encode the key and the value into a temporary buffer. We need to do that so we’ll know what size the entry will take. Then we need to figure out if we are updating an existing record or creating a new one.
Updating an existing record is complex. This is because the size of the new record may be greater than the size of the old one. So we can’t put it in the same location. I’m handling this by just allocating new space for this entry, ignoring the old space that was allocated to it.
I’m not handling any deletes / space reclamation on this series. That is a separate subject, not complex, but fairly tedious to do properly. So I’m going to focus solely on writes.
Updates to an existing entry that also change its size aren’t in my test dataset, so I’m not worried about it too much here. I mention this to point out that variable length records bring with them considerations that we wouldn’t have run into with the fixed-size model.
And after all of this work? What are the results?
With the fixed-size version, we could fit 511 entries into the page. With the variable size int, however, we can do better.
For the realistic dataset, I can fit 712 entries for the page, and for the full dataset, 710 (there are very few very big elements even there, but we can see that it has an impact).
511 vs. 712 may not sound like much, but that is 40% increase in the number of entries that I can fit. To give some context, using 8KB pages, that is a difference of 5 MB per million entries. That adds up.
The question is, can we do better? More on that in my next post…