Fight for every byte it takesNibbling at the costs
In my last post we implemented variable-sized encoding to be able to pack even more data into the page. We were able to achieve 40% better density because of that. This is pretty awesome, but we would still like to do better. There are two disadvantages for variable size integers:
- They may take more space than the actual raw numbers.
- The number of branches is high, and non-predictable.
Given that we need to encode the key and value together, let’s see if we can do better. We know that both the key and the value are 8 bytes long. Using little-endian systems, we can consider the number as a byte array.
Consider this number: 139,713,513,353 which is composed of the following bytes: [137, 7, 147, 135, 32, 0, 0, 0]. This is how it looks in memory. This means, that we only need the first 5 bytes, not the last 3 zero ones.
It turns out that there is a very simple way to compute the number of used bytes, like so:
This translates into the following assembly:
Which is about as tight as you can want it to be.
Of course, there is a problem. In order to read the value back, we need to store the number of bytes we used somewhere. For variable-sized integers, they use the top bit until they run out. But we cannot do that here.
Remember however, that we encode two numbers here. And the length of the number is 8 bytes. In binary, that means that we need 4 bits to encode the length of each number. This means that if we’ll take an additional byte, we can fit the length of both numbers into a single byte.
The length of the key and the value would each fit on a nibble inside that byte. Here is what the encoding step looks like now:
And in assembly:
Note that there are no branches at all here. Which I’m really stoked about. As for decoding, we just have to go the other way around:
No branches, and really predictable code.
That is all great, but what about the sizes? We are always taking 4 additional bits per number. So it is actually a single additional byte for each entry we encode. By using varint, the memory we encode numbers that are beyond the 2GB range, we’re already winning. Encoding (3,221,241,856), for example, will cost us 5 bytes (since we limit the range of each byte to 7 bits). The key advantage in our case is that if we have any case where either key or value needs to take an additional byte, we are at parity with the nibble method. If both of them need that, we are winning, since the nibble method will use a single additional byte and the variable size integer will take two (one for each number).
Now that we understand encoding and decoding, the rest of the code is basically the same. We just changed the internal format of the entry, nothing about the rest of the code changes.
And the results?
For the realistic dataset, we can fit 759 entries versus 712 for the variable integer model.
For the full dataset, we can fit 752 entries versus 710 for the variable integer model.
That is a 7% improvement in size, but it also comes with a really important benefit. Fewer branches.
This is the sort of code that runs billions of times a second. Reducing its latency has a profound impact on overall performance. One of the things that we pay attention to in high-performance code is the number of branches, because we are using super scalar CPUs, multiple instructions may execute in parallel at the chip level. A branch may cause us to stall (we have to wait until the result is known before we can execute the next instruction), so the processor will try to predict what the result of the branch would be. If this is a highly predictable branch (an error code that is almost never taken, for example), there is very little cost to that.
The variable integer code, on the other hand, is nothing but branches, and as far as the CPU is concerned, there is no way to actually predict what the result will be, so it has to wait. Branchless or well-predicted code is a key aspect of high-performance code. And this approach can have a big impact.
As a reminder, we started at 511 items, and we are now at 759. The question is, can we do more?
I’ll talk about it in the next post…
More posts in "Fight for every byte it takes" series:
- (01 May 2023) Decoding the entries
- (28 Apr 2023) Optimizing the encoding process
- (27 Apr 2023) Fitting 64 values in 4 bits
- (26 Apr 2023) Nibbling at the costs
- (25 Apr 2023) Variable size data
- (24 Apr 2023) Storing raw numbers
Join the conversation...