Fight for every byte it takesOptimizing the encoding process
In my previous post, I showed how we use the nibble offload approach to store the size of entries in space that would otherwise be unused. My goal in that post was clarity, so I tried to make sure that the code was as nice to read as possible. In terms of machine code, that isn’t really ideal. Let’s talk about how we can make it better. Here is the starting version:
This code generates a function whose size exceeds 600 bytes and contains 24(!) branches. I already talked before about why this is a problem, there is a good discussion of the details on branches and their effect on performance here. In short, fewer branches are better. And when looking at machine instructions, the smaller the function, the better we are.
The first thing to do then is to remove the switch statement and move to a table-based approach. Given that this is a lookup of a small set of values, we can precompute all the options and just do a lookup like that. Here is what the code looks like:
This is already a win, since we are now at about half the code size (340 bytes) and there are just 7 branches in the function. Let’s take a look at the code and its assembly:
Code Assembly if (inlined == 0)
{
writeOffset = 1;
buffer[0] = (byte)(keyLen << 4 | valLen);
}L0065: test r14d, r14d
L0068: jne short L0081
L006a: mov r15d, 1
L0070: mov ecx, ebx
L0072: shl ecx, 4
L0075: or ecx, ebp
L0077: test edi, edi
L0079: je L014f
L007f: mov [rsi], cl
As you can see, in the assembly, we first test the value, and if it isn’t zero, we jump after the if statement. If it is 0, we compute a shift right by 4 and then or the values, then we do another check and finally set the value in the buffer.
Where does this check came from? There is no if there?
Well, that is the bound checking that we have with using Span, in fact, most of the checks there are because of Span or because of the safe intrinsics that are used.
Let’s get rid of this. There are actually several interesting iterations in the middle, but let’s jump directly to the final result I have. It’s under 10 lines of code, and it is quite beautiful.
I’m sure that you’ll look at this versus the previous iterations and go… huh?! I mean, even the reference table is different now.
Let’s analyze what is going on here in detail. The first thing you’ll note that changed is the method signature. Before we had multiple result types and now we use out parameters. This turns out to generate slightly less code, so I went with that approach, instead.
Next, computing the number of bytes we need to copy is the same. Once we have the sizes of the key and the value, we fetch the relevant instruction from the reference table. We do that in a way that skips the bounds checking on the table, since we know that we cannot exceed the length of the array.
Unlike before, we have new values in the table, where before we had 0 for entries that we didn’t care for, now we put 16. That means that we need to clear that when we set the nibble parameter. The reason for this is that we use this to compute the writeOffset. For cases where we have an offboarded nibble, the shift we do there will clear all the values, leaving us with a zero. For the values we cannot offload, we get 16, shifting by 4 gives us 1.
The reason we do it this way is that we do not use any conditional in this method. We unconditionally set the first byte of the buffer, because it is cheaper to do the work and discard that than check if it needs to be done.
Finally, we previously used Span.Copy() to move the data. That works, but it turns out that it is not ideal for very small writes, like what we have. At the same time, we write variable size each time, how can we manage that?
The answer is that we know that the buffer we are passed is large enough to contain the largest size possible. So we don’t need to worry about bound checking, etc.
We take advantage of the fact that the data is laid out in little-endian format and just write the whole 8 bytes of the key to the buffer at the right location. That may be shifted by the computed writeOffset. We then write the value immediately following the computed key length. The idea is that we overwrite the memory we just wrote (because parts of that were found to be not interesting). Using this approach, we were able to drop the code for this function to 114 bytes(!). Even with the encoding table, that is under three cache lines for the whole thing. That is really small.
There are also no conditionals or branches throughout the process. This is a piece of code that is ready and willing to be inlined anywhere. The amount of effort to understand what is going on here is balanced against how much this function can achieve in its lifetime.
For reference, here is the assembly of the encoding process:
The EncodingTable deserves a better explanation. I generated it using the following code:
In other words, I basically wrote out all the options and generated the appropriate value for each one of the options. I’m using the value of 16 here to be able to get to the right result using some bit shifts without any conditionals. So instead of doing many conditions, I replaced this with a (small) table lookup.
In my next post, I’m going to try to handle the same process for decoding.
More posts in "Fight for every byte it takes" series:
- (01 May 2023) Decoding the entries
- (28 Apr 2023) Optimizing the encoding process
- (27 Apr 2023) Fitting 64 values in 4 bits
- (26 Apr 2023) Nibbling at the costs
- (25 Apr 2023) Variable size data
- (24 Apr 2023) Storing raw numbers
Comments
Nice! Did you benchmark the difference in speed?
What about a delta encoding on keys? I mean, assuming pages are somehow part of a BTree so we know the range of the current page. Keys are sorted an likely to have a short difference between them. For a quick search probably some sort of keygroup concept is going to be needed.
Joel,
I'm focused on the read speed more than the encoding one. That is generally far more common, so I didn't benchmark it to any level, I'm afraid.
Andrea,
You are absolutely correct in this regard. In the real implementation, I have a shared baseline at the page level that I can use to reduce the scope of the key. We'll also likely do the same for the values (Since they are also clustered).
In this case, I'm testing random values and in a single page, so probably won't benefit from that.
Comment preview