The FastPFor is an integer compression algorithm that was published in 2012 initially. You can read the paper about it here: Decoding billions of integers per second through vectorization.
I’ve run into this algorithm many times in the past. You pretty much can’t be in the database arena and not run into that. It is an interesting paper, and it has a GitHub repository with the code, which is great. Except that I couldn’t figure out what was going on there.
I actually tried stepping through the code a bunch of times, and I always ended up getting lost. The code is in C++ and makes heavy use of templates, containers and non-trivial amounts of magic. To give some context, I gave up when I run into these code snippers:
The paper itself describes the algorithm, but not in a way that actually made sense to me. I tried looking at the code, and I basically noped out. Too complex for me to understand, it seems.
But I kept running into this, and we recently had a strong need for something similar. So I basically took the time to meditate on this for a while.
On a personal level, I realized that I was trying to understand how the whole thing worked by just stepping through the code and reading the paper. The problem was that I was mixing several different layers and wasn’t able to create the proper separation between them.
FastPFor builds upon the previously discussed simdcomp, once you understand that, you can go through this in isolation. For this reason, I first talked about SIMD bit-packing and showed an example usage, before actually diving into how FastPFor itself works.
As a reminder, simdcomp provides routines for quick packing and unpacking of integers, nothing more. FastPFor is where the actual compression happens. Let’s see what is actually going on here. Consider the following list, which we previously discussed:
If we’ll try bit-pack this list, we’ll find out that we aren’t gaining much. The first value is 31 bits in length, after all, and that means that we aren’t able to save much. We talked about Frame of Reference (FOR) as a way to handle that. We can treat every128 block of numbers as a block that has its own reference point and compute the maximum number of bits for that location. That would actually save us a lot of space, but it isn’t ideal. In the case above, most entries in the list can be packed in just 3 bits, except 2. That is where PFor comes into play, which stands for Patched Frame of Reference.
The key aspects of FastPFor are how it figures out what is the best size to use to pack the numbers, the way it detects what should be patched, and the manner in which it stores this.
The code boils down to a single function:
What is going on here? This function accepts an array of integers (whose size is 128) and computes the best number of bits to use to pack it.
How does that work? The first thing this does is compute how many numbers we have for each bit range. This is done using the asmbits() call, which is basically a LeadingZeroCount(). The end result is a frequency map that we can use to estimate how many bits will be required to store the items from this block.
We start from the maximum number of bits that we need to store all the numbers in the block, and we go down, checking what would be the best cost (in terms of space). Since we need to store exceptions somewhere, we compute the cost of that as we go down in the list.
This function gives us:
- bestb – the number of bits needed to pack all the items in the block that would result in the smallest output size
- bestcexception – the count of exceptions (numbers that do not fit into bestb bits
- maxb – the maximum number of bits that we need to store for this block
That is the key function, the one that is responsible for getting the best compression ratio.
What I found that made FastPFor challenging to understand is that it has a lot of state that you need to keep track of. What I described so far is a single block, but FastPFor operates over sets of numbers, and so needs to tie all of them together.
At any given point, FastPFor has:
- The block that is being outputted
- The metadata about the entire process
- 32 arrays of exceptions
The process interleaves all of them together in interesting ways, and I had a hard time figuring out how it all ties together.
Let’s talk about the way FastPFor processes a single block. We process the data in the array in blocks of 128 numbers at a time, like so:
The first thing that happens is the computation of the best bits widths for the current block. Then, we use the bc value to record the metadata about the block.
That is an array of bytes with the following format:
- Packed number of bits
- Number of exceptions
If there are exceptions for this block, we also add:
- Maximum number of bits
- Offset of exception in the block (repeated as the number of exceptions)
The metadata is shared for the entire encoding operation, across blocks.
You can see in the code that if bestcexcept (which holds the number of exceptions) is greater than zero, we find the appropriate exceptions buffer to use. That requires a better explanation.
the getBestBFromData() call gave us two important data points, the best number of bits to pack the numbers, and the maximum number. We are going to pack all the numbers in the block to the output, but what about the exceptions? Where do they go?
It turns out that we need to store the exceptions, but we don’t actually need to store max bits, only the difference between the best number of bits and the maximum. This is what thisexceptioncontainer is holding, the remainding bits for exceptions. It’s important to understand that this is done across blocks. So the thisexceptioncontainer value holds exceptions from many different blocks. That will turn out to be very important in just a little bit. We then scan the block and write the remainding bits to the container, and write to the metadata the offset of the value in the block. Since we are using blocks of 128 numbers, this is ensured to fit inside a byte.
The last step that we do for the block is to call: packblockupsimd(), this ends up calling to simdpack() and packs all the numbers from the block to bestb bits in the output.
It’s really important to understand that we still have two data items that haven’t been written. The first is the metadata for the encoding process (bits in blocks, exceptions offsets, etc). The second is the set of exceptions themselves.
This process repeats itself for each block, until the end of the buffer. It is at this point that we need to write the remaining data to the output. Here is what the code looks like:
What is going on? This is dense, and it takes a while to unpack (pun intended) what is going on here.
First, we write the size of the metadata and then we copy the metadata that we have to the output. That is the description of all the blocks that we just went over. Then, we run over the set of exceptions that we gathered. Remember, the datatobepacked is an array that holds lists of exception data for each bit range that we have to store. We iterate over all the bit widths where we wrote exception data and generate a bitmap. That will help us later during the decoding process.
Finally, we run over the same exceptions and write them out. Note that we do that using the same simd bit-packing approach as the rest of the code.
The end result is that we have the following format:
For decoding, we start from the metadata offset and jump to that. Then we read the exceptions bitmap. That tells us how many exceptions we have and how many bits we have in each one of them. In the image above, you can see that we have 3 exceptions list, for 4 bits, 12 bits and 19 bits.
We use the simd bit-packing to decode those values into memory. We’ll need them shortly. Now, we start iterating over the metadata, each block has an overhead of minimum two metadata bytes (number of bits and number of exceptions). Let’s assume that we don’t have any exceptions in the block. In that case, the process of decoding is simple. Just do the unpacking from bits to numbers and you are done.
If we have exceptions, on the other hand, we have to deal with them. At that point, the next metadata byte would contain the maximum number of bits for the exceptions in this block. Using that and the normal number of bits, we can tell where the extra bits are located. Here is the relevant code:
Note that there is a special case here if the difference in the number of bits between the block bits and the maximum number of bits is one. In that case, we don’t need to store the data, we already know what the value is then (it’s one, after all). So we can compute it once and set it in the output.
For scenarios where we can’t tell from the bit width along, we read the relevant exception array based on the difference between the bits in the block and the exceptions bits. That gives us an array that is shared across blocks. The idea is that the metadata contains the offset in the block, and we read from the relevant array one item at a time. This two-step process will result in us setting the right value and getting ready for the next call, which may be done in a different block.
Note that the exception bits buffers only care about the number of bits, not where they come from. So two blocks, which have (bestb: 4, maxb: 9) and (bestb: 7: maxb: 10) will both go to the 3 bits container.
Okay, this blog post is getting long enough, so I think that would be it. Hopefully, it will make it easier to understand exactly how the FastPFor format is working. I’m going to be talking more about FastPFor and the format implications in the next post, then move on to actually using that in C#.
More posts in "Integer compression" series:
- (21 Jun 2023) FastPFor in C#, results
- (20 Jun 2023) Implementing FastPFor decoding in C#
- (19 Jun 2023) Implementing FastPFor encoding in C#
- (16 Jun 2023) Adapting FastPFor to RavenDB
- (15 Jun 2023) Porting simdcomp to C#
- (14 Jun 2023) The FastPFor code
- (13 Jun 2023) Understanding FastPFor
- (12 Jun 2023) SIMD bit packing and unusual usages
- (08 Jun 2023) Using SIMD bit packing in practice
- (07 Jun 2023) Understanding Simd Compression by Lemire
- (06 Jun 2023) delta encoding + variable size integers