Integer compression: Implementing FastPFor decoding in C#

Jun 20 2023

Integer compressionImplementing FastPFor decoding in C#

time to read 6 min | 1029 words

In the previous post, I discussed FastPFor encoding, now I’m going to focus on how we deal with decoding. Here is the decode struct:

Note that this is a struct for performance reasons. We expect that we’ll have a lot of usage here, so saving the allocation here ends up paying high dividends. Before we’ll start, I want to pay special attention to the fields on the decoder:

Of particular interest is the _exceptionOffsets array. If you aren’t familiar with it, this is a fixed-size array on the struct.

Here is the constructor for the decoder:

We are being provided with the encoded buffer and its size. What is actually happening here?

We start by allocating memory for the exceptions buffer, then scan the exceptions bitmap and extract the exceptions data to the buffer. We use a single buffer for all of the exceptions, and the _exceptionsOffsets is an indication of where each bit width exception is currently at.

Finally, we set the _prev to be a vector of the baseline. That is the counterpoint to how we dealt with the values during the encoding. Like the encoding process, we have to deal with three states:

Big deltas (for the next block)
Variable integer (at the end)
Packed block of 256 numbers

We’ll start with dealing with the first two options:

If we have a big delta, we record that in the bigDeltaOffsets. Note that we are playing an annoying game here. I’m actually storing the metadata position in the delta offsets. I can cast it to long and then to pointer because I’m ensured that the data is fixed during the operation.

For the varint scenario, the last item in the batch, we don’t have to do much, we compute the running total of the values as we read them from the input, that’s about it.

Things get a lot more interesting when we start dealing with the actual compressed blocks:

Here we simply decode the values to the buffer, then we need to apply the exceptions. You can see that we saved some space by not storing exceptions of 1 bit (since we know what the value would be). For exceptions of different sizes, see how we consume the exception from _exceptionOffsets for each used exception. I’m using a ref variable here, so the offset++ operation will increment the value in the array. That ensures that we have to keep very little state in the decoding process as it moves forward. Remember that we require that the output buffer for the decoded numbers be at least 256 values, to ensure that we can fit them all in. That doesn’t mean that we have enough space to fit everything. So we may be called multiple times and need to resume from where we left off.

Finally, we set the expectedBufferIndex if we have a big delta offset. We’ll shortly see what we’ll do with this.

Remember that at this point, the buffer we use has int32 deltas in it, not the actual final numbers. In order to get the final values, we need to compute the sum of the deltas, this is handled here:

What do we do here? We load a vector of 8 integers (32 bits) and widen that to two vectors of 4 integers (64 bits) each.

We then check whether this is the expected buffer index for big delta offsets, if so, we’ll or the high parts of the number back to the vector. This is handled here:

There are quite a few things that are happening here all at once. We first read the current delta offset from the metadata, and read 16 bytes into a Vector128. We then translate that vector to a vector of 256 bits. That would basically add zeros to the vector. We set the next index to check and then we shuffle the vector on numbers to arrange it on the high parts of the vector as longs.

Let’s say that we had the numbers [1,2,3,4] in the metadata, we read them into the Vector128, like so:

Vector128 = [1,2,3,4]

We then extended that to a Vector256, like so:

Vector256 = [1,2,3,4,0,0,0,0]

Finally, we shuffle the values, (in this case, 7 is known to be zero) so we have:

highBitsDelta = [0,1,0,2,0,3,0,4]

We convert that to a vector of longs (with the same bits pattern) and then or that with the values from the packed deltas.

We have to do the expectedBufferIndex twice on each iteration, but we expect that to be rare, so the branch predictor is going to be really good in predicting this.

Finally, we have the part where we compute the prefix sum, here:

This looks like black magic, but let’s break it apart into individual pieces.

Let’s assume that we started out with [25,27,30, 35] – using our approach, we have Baseline: 21 and the deltas: [4,2,3,5].

We start with prev being set to the baseline value on all elements in the vector:

prev = [21,21,21,21]

And cur is set to deltas:

cur = [4,2,3,5]

On line 5, we shuffle and BitwiseAnd the value with a mask, let’s see what we get:

Line 5 – shuffled: [4,4,2,3]

LIne 5- masked: [0,4,2,3]

We add that to cur, giving us:

cur = [4 + 0, 4 + 2, 3 + 2, 5 + 3] = [4,6,5,8]

On line 7, we do another shuffle & mask:

Line 7 – shuffled: [4,4,4,6]

Line 7 – masked: [0,0,4,6]

And adding that to cur, will give us:

cur = [4+0, 6+ 0, 5 + 4, 8+ 6] = [4,6,9,14]

We now add the prev vector, giving us:

cur = [4 + 21, 6 + 21, 9 + 21, 14 + 21] = [25, 27, 30, 35]

We computed the sum again, hurray!

We then set all elements of prev to the last value of cur:

prev = [35,35,35,35]

And now we are ready for the next element to compute.

And… this is pretty much it, to be honest. We keep reading until we are out of space in the buffer or consumed all the metadata. We are able to decode the data really fast, and it only took a 10 parts (so far) series of blog posts to explain.

In the next post, I’m going to discuss what we can see from actually using this with real numbers.

Tweet Share Share 0 comments

Tags:

Oren Eini

Oren Eini

CEO of RavenDB