In the last post, I talked about how the simdcomp library is actually just doing bit-packing. Given a set of numbers, it will put the first N bits in a packed format. That is all. On its own, that isn’t very useful, we have to pair it with something else to make it do something more interesting.
Previously, I talked about delta compression, what if we put both of those together?
Given a sorted list of numbers, we’ll compute the delta between the numbers, then we can figure out how many bits we need to store them, and we can pack that tightly.
Analyzing the sample posting list I’m using, we get the following batches (of 128 numbers each)
- 328 batches at 13 bits
- 11 batches at 22 bits
- 7 batches at 23 bits
- 3 batches as 27 bits
- 1 batch at 31 bits
- 1 batch at 28 bits
Doing the math, we can fit all of those using 77Kb or so.
The idea is that for each batch, we can use simdpack() to write just the relevant bits. We’ll need some extra metadata to keep track of the number of bits per batch, but that is just one byte per every 128 values. For that matter, we can bit-pack that as well .
Note that in this case, we are using the simd packing routines just for bit packing. The actual compression aspect comes not from the packing, but from computing the deltas and then figuring out how densely we can pack those values.
This approach is called Frame Of Reference (FOR), and it has some really nice advantages. By splitting the data into blocks, we can take advantage of the patterns in the data. Here are the first 16 deltas in the posting list:
Value
Bits
[0]
1871143144
31
[1]
4
3
[2]
4
3
[3]
4
3
[4]
4
3
[5]
4
3
[6]
4
3
[7]
4
3
[8]
4
3
[9]
4
3
[10]
4
3
[11]
7984
13
[12]
4
3
[13]
4
3
[14]
4
3
[15]
4
3
Let’s assume that the batch size is 4 numbers, what would be the numbers here? Basically, we take the max number of bits for each batch of 4 numbers.
The first block will require us to use 31 bits, the second would be just 3 bits, but the third would require 13 bits and the last is 3 bits again.
The idea is that for each block, we have a separate frame of reference, the number of bits that are required for us to record & retrieve the data.
Of course, if you’ll look at the data itself, we are wasting quite a lot of space here needlessly. There are just two numbers (0, 11) that require more than 3 bits.
There is an algorithm called Patched Frame Of Reference (PFor) that handles this. Basically, when we compute a batch, we’ll record it using 3 bits only in this case. What about the rest of the values? Those we’ll store outside the batch, as an exception. On reading, we’ll need to patch the data to make it whole.
That adds a bit of complexity, but it also means that you can compress the data a lot more densely.
In the next post, I’m going to be talking about FastPFor, which combines simd bit packing, patched frame of references and some really clever tricks to get a very interesting integer compression algorithm.
In the previous post, I showed how you can use integer compression using variable-size integers. That is a really straightforward approach for integer compression, but it isn’t ideal. To start with, it means that we take at least one byte for each number, but in many cases, especially if we are using delta encoding, we can pack things a lot more tightly. For reference, here are the first 15 records we have in the posting list I’m using for testing, delta encoded:
1871143144
4
4
4
4
4
4
4
4
4
4
7984
4
4
4
See the range of 4, repeating. Each one of them can be represented by 3 bits, but we’ll waste 32 – 64 bits to hold it.
When you are searching for data on “database stuff”, the building blocks that are being used to build databases, you’ll usually find Lemire there. In this case, I would like to talk about the SimdComp library. From the GitHub repository:
A simple C library for compressing lists of integers using binary packing and SIMD instructions. This library can decode at least 4 billion of compressed integers per second on most desktop or laptop processors.
Those are some really nice numbers, but I actually have a pet peeve with the name of the library. It isn’t actually doing compression, what it does is bit packing, which is something quite different.
Bit packing just takes numbers that are 32 bits in length and stores them using fewer bits. For example, let’s consider the following list: 1,2,3,4,5,6,7,8,9,10. If I would store that in 32 bits integers, that would take 40 bytes. But given that the maximum size is 10, I can use 4 bits integers, instead. Taking only 5 bytes.
The core of the library is this set of routines:
- simdpack()
- simdpackwithoutmask()
- simdunpack()
All of them have roughly the same structure, something like this:
This switch statement goes on for 32 bits. Each time selecting a different function. This code makes a lot of assumptions. You’ll always give it exactly an input of 128 numbers and it will pack them and write them to output. The idea is to be as compact as you possibly can.
Reading this code is.. a challenge, because there is so much mental weight behind it, or so it seems. For example, take a look at how we pack a set of numbers into 2-bit integers:
The actual code goes on for quite a while (104 of dense SIMD code), so let’s try to break it up a bit. If we’ll translate this from SIMD to scalar, we’ll have code like this:
If you are used to manipulating bits, it’s fairly straightforward. We start by setting the output to the first value, then the second value was shifted by 2 bits and added to the value, then the third value was shifted by 4 bits, etc.
With SIMD, the interesting part is that we aren’t operating on a single value, but 4 at a time. That also means that we have a difference in how the data actually end up being packed.
Given a list of values from 1 .. 16, bit packing them will usually result in the data looking like this:
In contrast, the SIMD bit packing model will store the data like so:
This is a result of operating on 4 items at a time, but in practice, the actual placement of the bits does not matter much, just that we are able to store and retrieve them.
Note that for packing, we have two options, one that uses a mask and the other that does not. That will be important later. The only difference between those options is whether a mask is applied before we splice the bits into the output, nothing else.
The reason I said that this isn’t compression is that this library is focused solely on packing the bits, it has no behavior / concept beyond that stage. In the next post, we’ll start looking into how we can make use of this in more interesting ways.
If you are building a database, the need to work with a list of numbers is commonplace. For example, when building an index, we may want to store all the ids of documents of orders from Europe.
You can just store the raw values, of course, but that can be incredibly wasteful in terms of disk space, memory bandwidth and performance. Consider a database where millions of orders are from Europe. We’ll need multiple megabytes just to store the document ids. We can utilize patterns in the numbers to reduce the total storage size. For example, if you are storing a list of sorted numbers, you can store the delta of the numbers, instead of the whole value. The delta tends to be much smaller, after all. Take a look at the following code:
This will generate (in place) delta for the sorted list. We can then write it to a stream, using 7-bit integer (also called variable size integers) encoding. This way, smaller numbers take less space, which is the point of first doing delta encoding.
For testing purposes, I have a list of 44,956 items, which looks like this (full list is here):
1871143144 |
1871143148 |
1871143152 |
1871143156 |
1871143160 |
1871143164 |
1871143168 |
1871143172 |
1871143176 |
1871143180 |
Note that I’m using int64, so if we would store the data as is, it would take ~351KB. Using delta encoding, we can pack that into just ~45Kb.
I care a lot more about the decoding of the values, so let’s see how we read it back, shall we?
With very little code, we can gain quite a jump in information density. That is quite good. For reference purposes, taking the same data and compressing it using GZip would cost us ~58KB, so significantly more than this simpler mechanism.
On the other hand, using GZip on the delta encoded list will put us under 8KB. This is because in this context, there is a great deal of repetition in the data, there are many runs that are exactly 4 values apart from one another, so we can get quite a good compression ratio out of this.
That does require that we’ll have patterns in the data, which is by no means a guarantee. There is a whole field out there of research into integer compression, since it is such an important aspect of how database internals are working.
In this series, I’m going to focus on detailing how we implemented FastPFor inside of RavenDB, what we learned along the way and what the final results were.
I just completed a major refactoring of a piece of code inside RavenDB that is responsible for how we manage sorted queries. The first two tiers of tests all passed, which is great. Now was the time to test how this change performed. I threw 50M records into RavenDB and indexed them. I did… not like the numbers I got back. It makes sense, since I was heavily refactoring to get a particular structure, I could think of a few ways to improve performance, but I like doing this based on profiler output.
When running the same scenario under the profiler, the process crashed. That is… quite annoying, as you can imagine. In fact, I discovered a really startling issue.
If I index the data and query on it, I get the results I expect. If I restart the process and run the same query, I get an ExecutionEngineException. Trying to debug those is a PITA. In this case, I’m 100% at fault, we are doing a lot of unsafe things to get better performance, and it appears that I messed up something along the way. But my only reproduction is a 50M records dataset. To give some context, this means 51GB of documents to be indexed and 18 GB of indexing. Indexing this in release mode takes about 20 minutes. In debug mode, it takes a lot longer.
Trying to find an error there, especially one that can only happen after you restart the process is going to be a challenging task. But this isn’t my first rodeo. Part of good system design is knowing how to address just these sorts of issues.
The indexing process inside RavenDB is single-threaded per index. That means that we can rule out a huge chunk of issues around race conditions. It also means that we can play certain tricks. Allow me to present you with the nicest tool for debugging that you can imagine: repeatable traces.
Here is what this looks like in terms of code:
In this case, you can see that this is a development only feature, so it is really bare-bones one. What it does is capture the indexing and commit operations on the system and write them to a file. I have another piece of similarly trivial code that reads and applies them, as shown below. Don’t bother to dig into that, the code itself isn’t really that interesting. What is important is that I have captured the behavior of the system and can now replay it at will.
The code itself isn’t much, but it does the job. What is more important, note that we have calls to StopDatabase() and StartDatabase(), I was able to reproduce the crash using this code.
That was a massive win, since it dropped my search area from 50M documents to merely 1.2 million.
The key aspect of this is that I now have a way to play around with things. In particular, instead of using the commit points in the trace, I can force a commit (and start / stop the database) every 10,000 items (by calling FlushIndexAndRenewWriteTransaction). When using that, I can reproduce this far faster. Here is the output when I run this in release mode:
1 With 0 2 With 10000 3 With 10000 4 With 10000 5 With 10000 6 With 10000 7 With 10000 8 With 10000 9 With 10000 10 With 10000 11 With 10000 Fatal error. Internal CLR error. (0x80131506)
So now I dropped the search area to 120,000 items, which is pretty awesome. Even more important, when I run this in debug mode, I get this:
1 With 0 2 With 10000 Process terminated. Assertion failed. at Voron.Data.Containers.Container.Get(Low...
So now I have a repro in 30,000 items, what is even better, a debug assertion was fired, so I have a really good lead into what is going on.
The key challenge in this bug is that it is probably triggered as a result of a commit and an index of the next batch. There is a bunch of work that we do around batch optimizations that likely cause this sort of behavior. By being able to capture the input to the process and play with the batch size, we were able to reduce the amount of work required to generate a reproduction from 50M records to 30,000 and have a lead into what is going on.
With that, I can now start applying more techniques to narrow down what is going on. But by far the most important aspect as far as I’m concerned is the feedback cycle. I can now hit F5 to run the code and encounter the problem in a few seconds.
It looks like we hit a debug assertion because we keep a reference to an item that was already freed. That is really interesting, and now I can find out which item and then figure out why this is the case. And at each point, I can simply go one step back in the investigation, and reproduce the state, I have to hit F5 and wait a bit. This means that I can be far more liberal in how I figure out this bug.
This is triggered by a query on the indexed data, and if I follow up the stack, I have:
This is really interesting, I wonder… what happens if I query before I restart the database? With this structure, this is easy to do.
This is actually a big relief. I had no idea why restarting the database would cause us to expose this bug.
Another thing to note is that when I ran into the problem, I reproduced this on a query that sorted on a single field. In the test code, I’m testing on all fields, so that may be an asset in exposing this faster.
Right now the problem reproduces on the id field, which is unique. That helps, because it removes a large swath of code that deals with multiple terms for an entry. The current stage is that I can now reproduce this issue without running the queries, and I know exactly where it goes wrong.
And I can put a breakpoint on the exact location where this entry is created:
By the way, note that I’m modifying the code instead of using a conditional breakpoint. This is because of the performance difference. For a conditional breakpoint, the debugger has to stop execution, evaluate the condition and decide what to do. If this is run a lot, it can have a huge impact on performance. Easier to modify the code. The fact that I can do that and hit F5 and get to the same state allows me to have a lot more freedom in the ergonomics of how I work.
This allows me to discover that the entry in question was created during the second transaction. But the failure happens during the third, which is really interesting. More to the point, it means that I can now do this:
With the idea that this will trigger the assert on the exact entry that cause the problem. This is a good idea, and I wish that it worked, but we are actually doing a non-trivial amount of work during the commit process, so now we have a negative feedback and another clue. This is happening in the commit phase of the indexing process. It’s not a big loss, I can do the same in the commit process as well. I have done just that and now I know that I have a problem when indexing the term: “tweets/1212163952102137856”. Which leads to this code:
And at this point, I can now single step through this and figure out what is going on, I hope.
When working on complex data structures, one of the things that you need to do is to allow to visualize them. Being able to manually inspect the internal structure of your data structures can save you a lot of debugging. As I mentioned, this isn’t my first rodeo. So when I narrowed it down to a specific location, I started looking into exactly what is going on.
Beforehand, I need to explain a couple of terms (pun intended):
- tweets/1212163952102137856 – this is the entry that triggers the error.
- tweets/1212163846623727616 – this is the term that should be returned for 1679560
Here is what the structure looks like at the time of the insert:
You can notice that the value here for the last page is the same as the one that we are checking for 1679560.
To explain what is going on will take us down a pretty complex path that you probably don’t care about, but the situation is that we are keeping track of the id in two locations. Making sure to add and remove it in both locations as appropriate. However, at certain points, we may decide to shuffle things around inside the tree, and we didn’t sync that up properly with the rest of the system, leading to a dangling reference.
Now that I know what is going on, I can figure out how to fix it. But the story of this post was mostly about how I figured it out, not the bug itself.
The key aspect was to get to the point where I can reproduce this easily, so I can repeat it as many times that is needed to slowly inch closer to the solution.
I’m doing a pretty major refactoring inside of RavenDB right now. I was able to finish a bunch of work and submitted things to the CI server for testing. RavenDB has several layers of tests, which we run depending on context.
During development, we’ll usually run the FastTests. About 2,300 tests are being run to validate various behaviors for RavenDB, and on my machine, they take just over 3 minutes to complete. The next tier is the SlowTests, which run for about 3 hours on the CI server and run about 26,000 tests. Beyond that we actually have a few more layers, like tests that are being run only on the nightly builds and stress tests, which can take several minutes each to complete.
In short, the usual process is that you write the code and write the relevant tests. You also validate that you didn’t break anything by running the FastTests locally. Then we let CI pick up the rest of the work. At the last count, we had about 9 dedicated machines as CI agents and given our workload, an actual full test run of a PR may complete the next day.
I’m mentioning all of that to explain that when I reviewed the build log for my PR, I found that there were a bunch of tests that failed. That was reasonable, given the scope of my changes. I sat down to grind through them, fixing them one at a time. Some of them were quite important things that I didn’t take into account, after all. For example, one of the tests failed because I didn’t account for sorting on a dynamic numeric field. Sorting on a numeric field worked, and a dynamic text field also worked. But dynamic numeric field didn’t. It’s the sort of thing that I would never think of, but we got the tests to cover us.
But when I moved to the next test, it didn’t fail. I ran it again, and it still didn’t fail. I ran it in a loop, and it failed on the 5th iteration. That… sucked. Because it meant that I had a race condition in there somewhere. I ran the loop again, and it failed again on the 5th. In fact, in every iteration I tried, it would only fail on the 5th iteration.
When trying to isolate a test failure like that, I usually run that in a loop, and hope that with enough iterations, I’ll get it to reproduce. Having it happen constantly on the 5th iteration was… really strange. I tried figuring out what was going on, and I realized that the test was generating 1000 documents using a random. The fact that I’m using Random is the reason it is non-deterministic, of course, except that this is the code inside my test base class:
So this is properly initialized with a seed, so it will be consistent.
Read the code again, do you see the problem?
That is a static value. So there are two problems here:
- I’m getting the bad values on the fifth run in a consistent manner because that is the set of results that reproduce the error.
- This is a shared instance that may be called from multiple tests at once, leading to the wrong result because Random is not thread safe.
Before fixing this issue so it would run properly, I decided to use an ancient debugging technique. It’s called printf().
In this case, I wrote out all the values that were generated by the test and wrote a new test to replay them. That one failed consistently.
The problem was that it was still too big in scope. I iterated over this approach, trying to end up with a smaller section of the codebase that I could invoke to repeat this issue. That took most of the day. But the end result is a test like this:
As you can see, in terms of the amount of code that it invokes, it is pretty minimal. Which is pretty awesome, since that allowed me to figure out what the problem was:
I’ve been developing software professionally for over two decades at this point. I still get caught up with things like that, sigh.
In this series so far, we reduced the storage cost of key/value lookups by a lot. And in the last post we optimized the process of encoding the keys and values significantly. This is great, but the toughest challenge is ahead of us, because as much as encoding efficiency matters, the absolute cost we have is doing lookups. This is the most basic operation, which we do billions of times a second. Any amount of effort we’ll spend here will be worth it. That said, let’s look at the decoding process we have right now. It was built to be understandable over all else, so it is a good start.
What this code does is to accept a buffer and an offset into the buffer. But the offset isn’t just a number, it is composed of two values. The first 12 bits contain the offset in the page, but since we use 2-byte alignment for the entry position, we can just assume a zero bit at the start. That is why we compute the actual offset in the page by clearing the first four bits and then shifting left by three bits. That extracts the actual offset to the file, (usually a 13 bits value) using just 12 bits. The first four bits in the offset are the indicator for the key and value lengths. There are 15 known values, which we computed based on probabilities and one value reserved to say: Rare key/val length combination, the actual sizes are stored as the first byte in the entry.
Note that in the code, we handle that scenario by reading the key and value lengths (stored as two nibbles in the first byte) and incrementing the offset in the page. That means that we skip past the header byte in those rare situations.
The rest of the code is basically copying the key and value bytes to the relevant variables, taking advantage of partial copy and little-endian encoding.
The code in question takes 512 bytes and has 23 branches. In terms of performance, we can probably do much better, but the code is clear in what it is doing, at least.
The first thing I want to try is to replace the switch statement with a lookup table, just like we did before. Here is what the new version looks like:
The size of the function dropped by almost half and we have only 7 more branches involved. There are also a couple of calls to the memory copy routines that weren’t inlined. In the encoding phase, we reduced branches due to bound checks using raw pointers, and we skipped the memory calls routines by copying a fixed size value at varied offsets to be able to get the data properly aligned. In this case, we can’t really do the same. One thing that we have to be aware of is the following situation:
In other words, we may have an entry that is at the end of the page, if we’ll try to read unconditionally 8 bytes, we may read past the end of the buffer. That is not something that we can do. In the Encode() case, we know that the caller gave us a buffer large enough to accommodate the largest possible size, so that isn’t an issue. That complicates things, sadly, but we can go the other way around.
The Decode() function will always be called on an entry, and that is part of the page. The way we place entries means that we are starting at the top and moving down. The structure of the page means that we can never actually place an entry below the first 8 bytes of the page. That is where the header and the offsets array are going, after all. Given that, we can do an unconditional read backward from the entry. As you can see in the image below, we are reading some data that we don’t care about, but this is fine, we can fix it later, and without any branches.
The end result is that we can have the following changes:
I changed the code to use a raw pointer, avoiding bound checks that we already reasoned about. Most interesting is the ReadBackward function. This is an inner function, and was properly inlined during JIT compilation, it implements the backward reading of the value. Here is what the assembly looks like:
With this in place, we are now at 133 bytes and a single branch operation. That is pretty awesome, but we can do better still. Consider the following code (explanations to follow):
Note that the first element in the table here is different, it is now setting the 4th bit. This is because we are making use of that. The structure of the bytes in the table are two nibbles, but no other value in the table sets the 4th bit. That means that we can operate on that.
Indeed, what we are doing is use the decoder byte to figure out what sort of shift we want. We have the byte from the table and the byte from the buffer. And we use the fact that masking this with 8 gives (just for this value) the value of 8. We can then use that to select the appropriate byte. If we have an offloaded byte, then we’ll shift the value by 8, getting the byte from the buffer. For any other value, we’ll get 0 as the shift index, resulting in us getting the value from the table. That gives us a function with zero branches, and 141 bytes.
I spent a lot of time thinking about this, so now that we have those two approaches, let's benchmark them. The results were surprising:
| Method | Mean | Error | StdDev | |------------------------ |-----------:|---------:|---------:| | DecodeBranchlessShifts | 2,107.1 ns | 20.69 ns | 18.34 ns | | DecodeBranchy | 936.2 ns | 1.89 ns | 1.68 ns |
It turns out that the slightly smaller code with the branches is able to beat up the branchless code. When looking into what they are doing, I think that I can guess why. Branches aren’t a huge problem if they are predictable, and in our case, the whole point of all of this is that the offload process where we need to go to the entry to get the value is meant to be a rare event. In branchless code, on the other hand, you have to do something several times to avoid a branch (like shifting the value from the buffer up and maybe shifting it down, etc).
You can’t really argue with a difference like that. We also tried an AVX version, to see if this would have better performance. It turns out that there is really no way for us to beat the version with the single branch. Everything else was at least twice as slow.
At this point, I believe that we have a winner.
In my previous post, I showed how we use the nibble offload approach to store the size of entries in space that would otherwise be unused. My goal in that post was clarity, so I tried to make sure that the code was as nice to read as possible. In terms of machine code, that isn’t really ideal. Let’s talk about how we can make it better. Here is the starting version:
This code generates a function whose size exceeds 600 bytes and contains 24(!) branches. I already talked before about why this is a problem, there is a good discussion of the details on branches and their effect on performance here. In short, fewer branches are better. And when looking at machine instructions, the smaller the function, the better we are.
The first thing to do then is to remove the switch statement and move to a table-based approach. Given that this is a lookup of a small set of values, we can precompute all the options and just do a lookup like that. Here is what the code looks like:
This is already a win, since we are now at about half the code size (340 bytes) and there are just 7 branches in the function. Let’s take a look at the code and its assembly:
Code Assembly if (inlined == 0)
{
writeOffset = 1;
buffer[0] = (byte)(keyLen << 4 | valLen);
}L0065: test r14d, r14d
L0068: jne short L0081
L006a: mov r15d, 1
L0070: mov ecx, ebx
L0072: shl ecx, 4
L0075: or ecx, ebp
L0077: test edi, edi
L0079: je L014f
L007f: mov [rsi], cl
As you can see, in the assembly, we first test the value, and if it isn’t zero, we jump after the if statement. If it is 0, we compute a shift right by 4 and then or the values, then we do another check and finally set the value in the buffer.
Where does this check came from? There is no if there?
Well, that is the bound checking that we have with using Span, in fact, most of the checks there are because of Span or because of the safe intrinsics that are used.
Let’s get rid of this. There are actually several interesting iterations in the middle, but let’s jump directly to the final result I have. It’s under 10 lines of code, and it is quite beautiful.
I’m sure that you’ll look at this versus the previous iterations and go… huh?! I mean, even the reference table is different now.
Let’s analyze what is going on here in detail. The first thing you’ll note that changed is the method signature. Before we had multiple result types and now we use out parameters. This turns out to generate slightly less code, so I went with that approach, instead.
Next, computing the number of bytes we need to copy is the same. Once we have the sizes of the key and the value, we fetch the relevant instruction from the reference table. We do that in a way that skips the bounds checking on the table, since we know that we cannot exceed the length of the array.
Unlike before, we have new values in the table, where before we had 0 for entries that we didn’t care for, now we put 16. That means that we need to clear that when we set the nibble parameter. The reason for this is that we use this to compute the writeOffset. For cases where we have an offboarded nibble, the shift we do there will clear all the values, leaving us with a zero. For the values we cannot offload, we get 16, shifting by 4 gives us 1.
The reason we do it this way is that we do not use any conditional in this method. We unconditionally set the first byte of the buffer, because it is cheaper to do the work and discard that than check if it needs to be done.
Finally, we previously used Span.Copy() to move the data. That works, but it turns out that it is not ideal for very small writes, like what we have. At the same time, we write variable size each time, how can we manage that?
The answer is that we know that the buffer we are passed is large enough to contain the largest size possible. So we don’t need to worry about bound checking, etc.
We take advantage of the fact that the data is laid out in little-endian format and just write the whole 8 bytes of the key to the buffer at the right location. That may be shifted by the computed writeOffset. We then write the value immediately following the computed key length. The idea is that we overwrite the memory we just wrote (because parts of that were found to be not interesting). Using this approach, we were able to drop the code for this function to 114 bytes(!). Even with the encoding table, that is under three cache lines for the whole thing. That is really small.
There are also no conditionals or branches throughout the process. This is a piece of code that is ready and willing to be inlined anywhere. The amount of effort to understand what is going on here is balanced against how much this function can achieve in its lifetime.
For reference, here is the assembly of the encoding process:
The EncodingTable deserves a better explanation. I generated it using the following code:
In other words, I basically wrote out all the options and generated the appropriate value for each one of the options. I’m using the value of 16 here to be able to get to the right result using some bit shifts without any conditionals. So instead of doing many conditions, I replaced this with a (small) table lookup.
In my next post, I’m going to try to handle the same process for decoding.
Moving to nibble encoding gave us a measurable improvement in the density of the entries in the page. The problem is that we pretty much run out of room to do so. We are currently using a byte per entry to hold the size of the entry (as two nibbles, of 4 bits each). You can’t really go lower than that.
Let’s review again what we know about the structure of the data, we have an 8KB page, with three sections, fixed size header and variable size offsets and entries array. Here is what this looks like:
This is called a slotted page design. The idea is that the offset array at the bottom of the page is maintaining the sort orders of the entries, and that we can write the entries from the top of the page. When we need to sort the entries, we just need to touch the offsets array (shown in yellow in the image).
Given that we are talking about size and density, we spent a lot of time trying to reduce the size of the entries, but can we do something with the header or the offsets? The header is just 4 bytes right now, two shorts that denote the location of the bottom and the top position in the page. Given that the page is 8KB in size, we have to use 16 bits integer to cover the range. For offsets, the situation is the same. We have to be able to point to the entry location on the page, and that means that we have to reach 8KB. So the offsets are actually 16 bits ints and take two bytes.
In other words, there is a hidden overhead of 2 bytes per entry that we didn’t even consider. In the case of our latest success, we were able to push 759 entries into the page, which means that we are actually using 18.5% of the page just to hold the offsets of the entries. That is 1.48 KB that is being used.
The problem is that we need to use this. We have to be able to point to an entry anywhere in the page, which means that we have to reach 0 .. 8192. The minimum size we can use is 16 bits or two bytes.
Or do we?
16 bits gives us a range of 0 .. 65,535, after all. That is far in excess of what we need. We could use a 64KB page, but there are other reasons to want to avoid that.
To cover 8KB, we only need 13 bits to cover the range we need, after all. For that matter, we can extend that bit by 25%. If we decide that an entry should be 2 bytes aligned, we can access the entire page in 12 bits.
That means that we have 4 whole free bits to play with. The first idea is to change the offsets array from 16 bits ints to 12 bits ints. That would save us 380 bytes at 759 entries per page. That is quite a lot. Unfortunately, working with bits in this manner would be super awkward. We are doing a lot of random access and moves while we are building the page. It is possible to do this using bits, but not fun.
So we can set things up so we have a nibble free to use. We just used nibbles to save on the cost of variable size ints, to great success.
However, we don’t need just a nibble, we need two of them. We need to store the size of the key and the value in bytes. Actually, we don’t need two nibbles. The size of the key and the value maxes at 8 bytes, after all. We can encode that in 3 bits. In other words, we need 6 bits to encode this information.
We only have 4 bits, however. It is a really nice idea, however, and I kept turning that in my head, trying to come up with all sorts of clever ways to figure out how we can push 64 values in 4 bits. The impact of that would be pretty amazing.
Eventually, I realized that it is fairly easy to prove, using math, that there is no way to do so. Faced with this failure, I realigned my thinking and found a solution. I don’t need to have a perfect answer, I can have a good one.
4 bits give me a range of 16 values (out of the possible 64). If I give up on trying to solve the whole problem, can I solve a meaningful part of it?
And I came up with the following idea. We can do a two-stage approach, we’ll map the most common 15 values of key and value sizes to those 4 bits. The last value will be a marker that you have to go and look elsewhere.
Using just the data in the offset, I’m able to figure out what the location of the entry in the page is as well as the size of the key and value for most cases. For the (hopefully rare) scenarios where that is not the case, we fall back to storing the size information as two nibbles preceding the entry data.
This is a pretty neat idea, even if I say so myself, and it has a good chance to allow us to save about 1 byte per entry in the common case. In fact, I tested that and about 90% of the cases in my test case are covered by the top 15 cases. That is a pretty good indication that I’m on the right track.
All of that said, let’s look at how this looks in code:
I’m using a switch expression here for readability, so it is clear what is going on. If the key and value sizes are in one of the known patterns, we can put that in the nibble we’ll return. If the value is not, we’ll write it to the entry buffer.
The Set method itself had to change in some subtle but crucial ways, let’s look at it first, then I’ll discuss those changes:
As before, we encode the entry into a temporary buffer. Now, in addition to getting the length of the entry, we are also getting the nibble that we’ll need to store.
You can see the changes in how we work with the offsets array following that. When we need to update an existing value, we are using this construction to figure out the actual entry offset:
var actualEntryOffset = ((offsets[idx] & 0xFFF0) >> 3);
What exactly is going on here? Don’t try to figure it out yet, let’s see how we are writing the data:
top = (ushort)((top - reqEntryLen) & ~1); // align on two bytes boundaryoffsets[idx] = (ushort)(top << 3 | nibble);
Those two code snippets may look very odd, so let’s go over them in detail.
First, remember that we have an 8KB page to work with, but we need to use 4 bits for the size nibble we got from encoding the entry. To address the full 8,192 values in the page, we’ll need to reserve 13 bits. That is… a problem. We solve that by saying that the entry addresses must always be aligned on two bytes boundary. That is handled by clearing the first bit in the new top computation. Since we are growing down, that has the effect of ensuring aligned-by-two.
Then, we merge the top location and the nibble together. We know that the bottom-most of the top is cleared, so we can just move the value by 3 bits and we know that we’ve 4 cleared bits ready.
Conversely, when we want to read, we clear the first 4 bits and then we shift by three. That has the effect of returning us back to the original state.
A little bit confusing, but we managed to get to squeeze 784 entries into the page using the realistic dataset and 765 using the full one. That is another 3.5% of space savings over the previous nibble attempt and over 10% increase in capacity from the variable integer approach.
And at this point, I don’t believe that there is anything more that I can do to reduce the size in a significant manner without having a negative impact elsewhere.
We are not done yet, however. We are done with the size aspect, but we also have much to do in terms of performance and optimizations for runtime.
In the meantime, you can see my full code here. In the next post, we are going to start talking about the actual machine code and how we can optimize it.
In my last post we implemented variable-sized encoding to be able to pack even more data into the page. We were able to achieve 40% better density because of that. This is pretty awesome, but we would still like to do better. There are two disadvantages for variable size integers:
- They may take more space than the actual raw numbers.
- The number of branches is high, and non-predictable.
Given that we need to encode the key and value together, let’s see if we can do better. We know that both the key and the value are 8 bytes long. Using little-endian systems, we can consider the number as a byte array.
Consider this number: 139,713,513,353 which is composed of the following bytes: [137, 7, 147, 135, 32, 0, 0, 0]. This is how it looks in memory. This means, that we only need the first 5 bytes, not the last 3 zero ones.
It turns out that there is a very simple way to compute the number of used bytes, like so:
This translates into the following assembly:
Which is about as tight as you can want it to be.
Of course, there is a problem. In order to read the value back, we need to store the number of bytes we used somewhere. For variable-sized integers, they use the top bit until they run out. But we cannot do that here.
Remember however, that we encode two numbers here. And the length of the number is 8 bytes. In binary, that means that we need 4 bits to encode the length of each number. This means that if we’ll take an additional byte, we can fit the length of both numbers into a single byte.
The length of the key and the value would each fit on a nibble inside that byte. Here is what the encoding step looks like now:
And in assembly:
Note that there are no branches at all here. Which I’m really stoked about. As for decoding, we just have to go the other way around:
No branches, and really predictable code.
That is all great, but what about the sizes? We are always taking 4 additional bits per number. So it is actually a single additional byte for each entry we encode. By using varint, the memory we encode numbers that are beyond the 2GB range, we’re already winning. Encoding (3,221,241,856), for example, will cost us 5 bytes (since we limit the range of each byte to 7 bits). The key advantage in our case is that if we have any case where either key or value needs to take an additional byte, we are at parity with the nibble method. If both of them need that, we are winning, since the nibble method will use a single additional byte and the variable size integer will take two (one for each number).
Now that we understand encoding and decoding, the rest of the code is basically the same. We just changed the internal format of the entry, nothing about the rest of the code changes.
And the results?
For the realistic dataset, we can fit 759 entries versus 712 for the variable integer model.
For the full dataset, we can fit 752 entries versus 710 for the variable integer model.
That is a 7% improvement in size, but it also comes with a really important benefit. Fewer branches.
This is the sort of code that runs billions of times a second. Reducing its latency has a profound impact on overall performance. One of the things that we pay attention to in high-performance code is the number of branches, because we are using super scalar CPUs, multiple instructions may execute in parallel at the chip level. A branch may cause us to stall (we have to wait until the result is known before we can execute the next instruction), so the processor will try to predict what the result of the branch would be. If this is a highly predictable branch (an error code that is almost never taken, for example), there is very little cost to that.
The variable integer code, on the other hand, is nothing but branches, and as far as the CPU is concerned, there is no way to actually predict what the result will be, so it has to wait. Branchless or well-predicted code is a key aspect of high-performance code. And this approach can have a big impact.
As a reminder, we started at 511 items, and we are now at 759. The question is, can we do more?
I’ll talk about it in the next post…