Fight for every byte it takesDecoding the entries
In this series so far, we reduced the storage cost of key/value lookups by a lot. And in the last post we optimized the process of encoding the keys and values significantly. This is great, but the toughest challenge is ahead of us, because as much as encoding efficiency matters, the absolute cost we have is doing lookups. This is the most basic operation, which we do billions of times a second. Any amount of effort we’ll spend here will be worth it. That said, let’s look at the decoding process we have right now. It was built to be understandable over all else, so it is a good start.
What this code does is to accept a buffer and an offset into the buffer. But the offset isn’t just a number, it is composed of two values. The first 12 bits contain the offset in the page, but since we use 2-byte alignment for the entry position, we can just assume a zero bit at the start. That is why we compute the actual offset in the page by clearing the first four bits and then shifting left by three bits. That extracts the actual offset to the file, (usually a 13 bits value) using just 12 bits. The first four bits in the offset are the indicator for the key and value lengths. There are 15 known values, which we computed based on probabilities and one value reserved to say: Rare key/val length combination, the actual sizes are stored as the first byte in the entry.
Note that in the code, we handle that scenario by reading the key and value lengths (stored as two nibbles in the first byte) and incrementing the offset in the page. That means that we skip past the header byte in those rare situations.
The rest of the code is basically copying the key and value bytes to the relevant variables, taking advantage of partial copy and little-endian encoding.
The code in question takes 512 bytes and has 23 branches. In terms of performance, we can probably do much better, but the code is clear in what it is doing, at least.
The first thing I want to try is to replace the switch statement with a lookup table, just like we did before. Here is what the new version looks like:
The size of the function dropped by almost half and we have only 7 more branches involved. There are also a couple of calls to the memory copy routines that weren’t inlined. In the encoding phase, we reduced branches due to bound checks using raw pointers, and we skipped the memory calls routines by copying a fixed size value at varied offsets to be able to get the data properly aligned. In this case, we can’t really do the same. One thing that we have to be aware of is the following situation:
In other words, we may have an entry that is at the end of the page, if we’ll try to read unconditionally 8 bytes, we may read past the end of the buffer. That is not something that we can do. In the Encode() case, we know that the caller gave us a buffer large enough to accommodate the largest possible size, so that isn’t an issue. That complicates things, sadly, but we can go the other way around.
The Decode() function will always be called on an entry, and that is part of the page. The way we place entries means that we are starting at the top and moving down. The structure of the page means that we can never actually place an entry below the first 8 bytes of the page. That is where the header and the offsets array are going, after all. Given that, we can do an unconditional read backward from the entry. As you can see in the image below, we are reading some data that we don’t care about, but this is fine, we can fix it later, and without any branches.
The end result is that we can have the following changes:
I changed the code to use a raw pointer, avoiding bound checks that we already reasoned about. Most interesting is the ReadBackward function. This is an inner function, and was properly inlined during JIT compilation, it implements the backward reading of the value. Here is what the assembly looks like:
With this in place, we are now at 133 bytes and a single branch operation. That is pretty awesome, but we can do better still. Consider the following code (explanations to follow):
Note that the first element in the table here is different, it is now setting the 4th bit. This is because we are making use of that. The structure of the bytes in the table are two nibbles, but no other value in the table sets the 4th bit. That means that we can operate on that.
Indeed, what we are doing is use the decoder byte to figure out what sort of shift we want. We have the byte from the table and the byte from the buffer. And we use the fact that masking this with 8 gives (just for this value) the value of 8. We can then use that to select the appropriate byte. If we have an offloaded byte, then we’ll shift the value by 8, getting the byte from the buffer. For any other value, we’ll get 0 as the shift index, resulting in us getting the value from the table. That gives us a function with zero branches, and 141 bytes.
I spent a lot of time thinking about this, so now that we have those two approaches, let's benchmark them. The results were surprising:
| Method | Mean | Error | StdDev | |------------------------ |-----------:|---------:|---------:| | DecodeBranchlessShifts | 2,107.1 ns | 20.69 ns | 18.34 ns | | DecodeBranchy | 936.2 ns | 1.89 ns | 1.68 ns |
It turns out that the slightly smaller code with the branches is able to beat up the branchless code. When looking into what they are doing, I think that I can guess why. Branches aren’t a huge problem if they are predictable, and in our case, the whole point of all of this is that the offload process where we need to go to the entry to get the value is meant to be a rare event. In branchless code, on the other hand, you have to do something several times to avoid a branch (like shifting the value from the buffer up and maybe shifting it down, etc).
You can’t really argue with a difference like that. We also tried an AVX version, to see if this would have better performance. It turns out that there is really no way for us to beat the version with the single branch. Everything else was at least twice as slow.
At this point, I believe that we have a winner.
More posts in "Fight for every byte it takes" series:
- (01 May 2023) Decoding the entries
- (28 Apr 2023) Optimizing the encoding process
- (27 Apr 2023) Fitting 64 values in 4 bits
- (26 Apr 2023) Nibbling at the costs
- (25 Apr 2023) Variable size data
- (24 Apr 2023) Storing raw numbers
This post: https://johnnysswlab.com/how-branches-influence-the-performance-of-your-code-and-what-can-you-do-about-it/ shows when you should try a branchless Algorithm. The takeaway is if your branches are well predictable branching code will outperform branchless ones.
That was a great read, thanks.
The issue is that in this case, assuming good selection of the biased nibbles, we can get to 80% predicted values. That means that it is cheaper (and easier to understand) the branchy version.
Join the conversation...