Oren Eini

CEO of RavenDB

a NoSQL Open Source Document Database

Get in touch with me:

oren@ravendb.net +972 52-548-6969

Posts: 7,503
Comments: 51,091
Privacy Policy · Terms
filter by tags archive
time to read 2 min | 356 words

After this saga, I wanted to close the series with some numbers about the impact of this algorithm.

If you’ll recall, I started this whole series discussing variable-sized integers. I was using this list of numbers to compare the values. There are 44,956 items in the list.

Algorithm Size
Raw 359.648
Varint 224,780
Delta + Varint    45,970
Delta + Varint + Gzip             5,273
FastPFor      24,717

You can see some interesting details about the various options. Delta + Varint + Gzip is a really good setup, if you can assume that the output pattern has a high degree of repetition. In some cases, that is possible, but that isn’t a generic property. Aside from that, FastPFor is winning hands down in terms of the output size of the data.

There is also another important aspect. The size of the output here is too big to fit in a single 8KB buffer, so I’ll need to use multiple for that. This is not something that you can really do with GZip. Here is the cost across multiple buffers:

This particular list would fit into 4 pages:

  • 14,080 entries at 8,048 bytes
  • 14,080 entries at 8,063 bytes
  • 15,616 entries at 8,030 bytes
  •   1,180 entries at    644 bytes

But the compression ratio is only part of that. Let’s talk about the performance aspect. On my machine, you can run the encoding process in 1,115 ticks and decoding in just 462 ticks.

To give some context, that means that you can do the encoding at a rate of ~400,000,000 / second and decoding at a rate of about 1 billion per second.

My perf team also asked me to mention that they haven’t gotten the chance to look at the code yet, so things are likely to improve.

The entire premise of FastPFor inside of RavenDB relies on these fast decoding & encoding times. It makes it super cheap to iterate over a vast amount of numbers, and the B+Tree layout we have means that updating a posting list’s page is trivial. Read it, merge, and encode it back. It’s fast enough that there is really no other place for meaningful optimizations / complexity.

time to read 6 min | 1029 words

In the previous post, I discussed FastPFor encoding, now I’m going to focus on how we deal with decoding. Here is the decode struct:


Note that this is a struct for performance reasons. We expect that we’ll have a lot of usage here, so saving the allocation here ends up paying high dividends.  Before we’ll start, I want to pay special attention to the fields on the decoder:

Of particular interest is the _exceptionOffsets array. If you aren’t familiar with it, this is a fixed-size array on the struct.

Here is the constructor for the decoder:

We are being provided with the encoded buffer and its size. What is actually happening here?

We start by allocating memory for the exceptions buffer, then scan the exceptions bitmap and extract the exceptions data to the buffer. We use a single buffer for all of the exceptions, and the _exceptionsOffsets is an indication of where each bit width exception is currently at.

Finally, we set the _prev to be a vector of the baseline. That is the counterpoint to how we dealt with the values during the encoding. Like the encoding process, we have to deal with three states:

  • Big deltas (for the next block)
  • Variable integer (at the end)
  • Packed block of 256 numbers

We’ll start with dealing with the first two options:

If we have a big delta, we record that in the bigDeltaOffsets. Note that we are playing an annoying game here. I’m actually storing the metadata position in the delta offsets. I can cast it to long and then to pointer because I’m ensured that the data is fixed during the operation.

For the varint scenario, the last item in the batch, we don’t have to do much, we compute the running total of the values as we read them from the input, that’s about it.

Things get a lot more interesting when we start dealing with the actual compressed blocks:

Here we simply decode the values to the buffer, then we need to apply the exceptions. You can see that we saved some space by not storing exceptions of 1 bit (since we know what the value would be). For exceptions of different sizes, see how we consume the exception from _exceptionOffsets for each used exception. I’m using a ref variable here, so the offset++ operation will increment the value in the array. That ensures that we have to keep very little state in the decoding process as it moves forward. Remember that we require that the output buffer for the decoded numbers be at least 256 values, to ensure that we can fit them all in. That doesn’t mean that we have enough space to fit everything. So we may be called multiple times and need to resume from where we left off.

Finally, we set the expectedBufferIndex if we have a big delta offset. We’ll shortly see what we’ll do with this.

Remember that at this point, the buffer we use has int32 deltas in it, not the actual final numbers. In order to get the final values, we need to compute the sum of the deltas, this is handled here:

What do we do here? We load a vector of 8 integers (32 bits) and widen that to two vectors of 4 integers (64 bits) each.

We then check whether this is the expected buffer index for big delta offsets, if so, we’ll or the high parts of the number back to the vector. This is handled here:

There are quite a few things that are happening here all at once. We first read the current delta offset from the metadata, and read 16 bytes into a Vector128. We then translate that vector to a vector of 256 bits. That would basically add zeros to the vector. We set the next index to check and then we shuffle the vector on numbers to arrange it on the high parts of the vector as longs.

Let’s say that we had the numbers [1,2,3,4] in the metadata, we read them into the Vector128, like so:

Vector128 = [1,2,3,4]

We then extended that to a Vector256, like so:

Vector256 = [1,2,3,4,0,0,0,0]

Finally, we shuffle the values, (in this case, 7 is known to be zero) so we have:

highBitsDelta = [0,1,0,2,0,3,0,4]

We convert that to a vector of longs (with the same bits pattern) and then or that with the values from the packed deltas.

We have to do the expectedBufferIndex twice on each iteration, but we expect that to be rare, so the branch predictor is going to be really good in predicting this.

Finally, we have the part where we compute the prefix sum, here:

This looks like black magic, but let’s break it apart into individual pieces. 

Let’s assume that we started out with [25,27,30, 35] – using our approach, we have Baseline: 21 and the deltas: [4,2,3,5].

We start with prev being set to the baseline value on all elements in the vector:

prev = [21,21,21,21]

And cur is set to deltas:

cur = [4,2,3,5]

On line 5, we shuffle and BitwiseAnd the value with a mask, let’s see what we get:

Line 5 – shuffled: [4,4,2,3]

LIne 5- masked: [0,4,2,3]

We add that to cur, giving us:

cur = [4 + 0, 4 + 2, 3 + 2, 5 + 3] = [4,6,5,8]

On line 7, we do another shuffle & mask:

Line 7 – shuffled: [4,4,4,6]

Line 7 – masked: [0,0,4,6]

And adding that to cur, will give us:

cur = [4+0, 6+ 0, 5 + 4, 8+ 6] = [4,6,9,14]

We now add the prev vector, giving us:

cur = [4 + 21, 6 + 21, 9 + 21, 14 + 21] = [25, 27, 30, 35]

We computed the sum again, hurray!

We then set all elements of prev to the last value of cur:

prev = [35,35,35,35]

And now we are ready for the next element to compute.

And… this is pretty much it, to be honest. We keep reading until we are out of space in the buffer or consumed all the metadata. We are able to decode the data really fast, and it only took a 10 parts (so far) series of blog posts to explain.

In the next post, I’m going to discuss what we can see from actually using this with real numbers.

time to read 8 min | 1553 words

In the previous post I outlined the requirements we have for FastPFor in RavenDB. Now I want to dig into the actual implementation. Here is the shape of the class in question:


The process starts when we start encoding the items. Each call to encode is going to process a different posting list, and it is expected that a single instance will process many posting lists during a single indexing run.

Here is the internal state of this class:


Here is how we start the process:

As you can see, there isn’t much here. We initialize the state of exceptions and exceptionsStarts (will discuss this in more detail later in this post) and then we ensure that we have a big enough buffer to encode the data. Note that we retain a point to the original entries that we were passed, that will be important later as well.

Here is the code of the encoding process:

I create a couple of vector variables, one to hold the first item in the list and the other to hold uint.MaxValue. Then I start processing the buffers in blocks of 256 items.

A good resource for using vectors and SIMD in C# can be found here.

The classic FastPFor is using vectors of 128 bits (although there are implementations of up to 512 bits). I chose to use Vector256 for the implementation. Today, pretty much any x64 machine is going to have native support for AVX2 (shipped since 2011!). On ARM, they are implemented using Vector128, so they are still vectorized there as well.

What this means is that we can process 256 numbers as a block. But what are we doing with them? Look at the inner look, where you can see some interesting details. We are using shuffle instructions as well as masking to shift the current value and add the previous one.

In other words, let’s assume that we have the following input array: 1,2,3,4,5,6,7,8 – using longs, so Vector256 will contain 4 elements.

On line 5, we create the prev vector and assign it to have 4 copies of the first item in the array:

prev = [1,1,1,1]

On line 13, we load the 4 elements from the array into cur, giving us:

cur = [1,2,3,4]

On line 14, we shuffle both the prev and the cur values, here are intermediate states:

curShuffled = [0,1,2,3] – note that we have 0 at the start because of the mask

prevShuffled = [1,0,0,0] – the last three items are zero because of the mask.

We then or those two intermediates together, giving us:

mixed = [1,1,2,3]

We subtract that from cur to get the delta. The idea is that we massively reduce the number of operations that we have to perform to compute the delta of the values.

On lines 19..22 we check if any of the deltas is greater than uint.MaxValue, we’ll dig into that in a bit.

The last part of the loop is when we deal with detlaInts, where we are doing some interesting things. We are pulling the low bits from all four elements in the vector to the start of the vector. Then we write the deltas to the output entries buffer.

However, we write 8 values, but we increment the position by 4, so we’ll actually overwrite the last 4 items in the next loop iteration. The entries buffer is sized based on the entries buffer, so we are ensured to have enough space, and the double writes give us better performance.

The end result of the inner loop is that we are performing running delta over the items, converting them from longs to ints. Then we can call the ProcessBlock() method to actually do the code of the FastPFor algorithm:

I already explained how this works in this post.

It is important to understand that we have multiple data channels that we operate on here:

  • The entries output, where we write the packed bits
  • Metadata, where we describe each block
  • Exceptions, where we store the exception data

The encoding process runs through all of them, but isn’t actually going to be writing them anywhere, that happens later.

We process the entries in the array in blocks of 256 numbers each, but what happens when we get to the end? The last block size may not be a multiple of 256, after all. For the last item, we use simple delta encoding plus variable-sized integers. Nothing really fancy is needed at this point, since that is very small.

Finally, we run over all the exceptions data that we created during the encoding process and compute their size. The final result we have is the total size we would need to store all the encoded data as a single buffer.

What about handling numbers whose delta is greater than uint.MaxValue?

You can see that we expect this to be rare, so we handle that to be on the safe side, but aren’t trying to be too optimal about it. We simply take the high portion of the current delta and write it to the metadata. That is wasteful in terms of space, but it is simple and doesn’t require complex code. We’ll see how that is being used during the decoding process in the next post.

Finally, it is now time to actually write things out to the buffer. We have to be careful here, since we need to support partial writes. Let’s see how that works:

We define a header for the encoded data, which looks like this:


The header contains the initial value of the encoded data, the exceptions bitmap (discussed here) and the exception and metadata offsets. Because we always limit the size of the buffer to a maximum of 8KB, I used ushort here.

The first value we set in the header is the baseline, here we need to explain a bit. When we encode the data, we start from the very first item in the list, and go forward. The problem with doing things in this manner is that when we need to split the encoded data, we need to use the baseline that is the previous value that we encoded with. For that reason, we use either the first item in the entries array or the item from the end of the previous block that we consumed.

Next, we need to deal with the exceptions, as a reminder, we have a set of exceptions that are shared across blocks. When we do a partial write, we need to keep track of how many exceptions we consumed from each of the exceptions array.

The most interesting part comes when we start iterating over the metadata values, which tells us how to deal with the data:

Here we start with the exceptional cases, of a metadata value that is greater than the max delta or the varint batch marker (which occurs in the end).

You can see that there isn’t much there, except that we verify that we have enough space for the written buffers. For fun, I’m also using a goto here, this is needed to break out of the loop, since I cannot use a break statement in a switch case.

The more interesting aspect happens when we process full blocks of compressed 256 numbers:

This looks like a lot of code, but what it is actually doing is to mostly checking that we have enough space in the buffer for the block we are about to write. Of particular interest to us here is that we may have enough space to write the block, but we also need to ensure that we have space for the metadata and exceptions data that is associated with the current and all previous blocks.

If we have enough space, we continue to write more blocks to the output, or we exit if we are about to run out of room. After the loop, we have the header and compressed integers data in the output buffer, but we still need to write the metadata and exceptions data, this is handled here:

We compute the bitmap of all the exceptions, writing them out as packed segments (basically, 256 blocks of packed bits). Note that we have to account for how many of them we actually write out to the output, since we may have stopped in the middle.

Then we can simply write out the metadata buffer and return the total size we took.

I find it very amusing that we use the same bit packing routines to store the numbers and the exceptions data as well.

Overall, most of the code isn’t particularly interesting, but there are a few highlights that I would like to point out:

  • We compute the deltas and the narrowing from int64 to int32 on the buffer during the Encode()
  • We do the actual bit packing during the Write()

There is some argument to make here that we can do the bit packing during the Encode() as well, but it is unlikely to matter, I believe.

In the next post, I’m going to be diving into how we implement decoding for FastPFor. That is actually of far more interest to me, since that happens a lot more often. It’s going to be interesting…

time to read 6 min | 1155 words

In this series so far, I explored several ways that we can implement integer compression. I focused on the FastPFor algorithm and dove deeply into how it works. In the last post, I showed how we can use the SIMD intrinsics to generate vectorized code that can process a lot of data very quickly.

With this baseline established, we need to understand the use case we have in RavenDB for integer compression. We are actually using this for the classic scenario, posting lists as part of the indexing engine. However, there are quite a few differences in how we want to actually utilize it.

Let’s take Lucene as a good example, since that is one of the classic examples for a search engine. From the ground up, Lucene is meant to be used in batching scenarios. That has a lot of impact on the design. In particular, it means that a posting list for Lucene is something that you write once in a big batch of operations. For RavenDB, however, we explicitly reject his approach. We want to allow to do on-the-fly indexing and we cannot afford to pay the cost of merging many small writes to big files over time.

The FastPFor algorithm is meant to get a lot of data, encode that in a compressed form and then allow you to iterate over the results as fast as possible. But RavenDB’s needs are different. At the most basic level, RavenDB thinks about things in 8KB pages. A posting list for RavenDB is composed of the following:

  • Single – just a single value (common when indexing unique values)
  • Small – can fit in under 4KB
  • Large – requires more than 4KB of data, in which case we split it on 8KB boundaries.

What’s the point here? The idea is that when we need to update a posting list (add or remove items), we have one of those three options:

  • Single – we just update the value, duh!
  • Small – FastPFor is fast, right? It is cheap enough to unpack the compressed values, merge with the new items (or the removals) and compress again.
  • Large – here we are talking about a lot of values, potentially tens of millions or higher. How do we handle updates in this case?

The answer is that we are going to be making use of a B+ Tree, with some caveats. The basic idea is that we have two types of pages in the tree. A branch page is the classic B+ Tree concept, which points to leaf pages below it that contain some range of values. However, the leaf pages, on the other hand, are very different.

Each leaf page is just an 8KB buffer that holds the compressed data, nothing more. When we want to update the values in the posting list, we find the relevant page that needs to be updated, then we just unpack, merge & compress the values. It’s actually cheaper to go this route than any other alternative.

There is another important aspect. Lucene (and most other indexing systems) assume that your posting list is using 32 bits integers. That limits the number of items in an index to 2.1 billion (up to 4.2 billion, for some implementations). I don’t like that limit, not at today’s scale. So we internally use 64bits integers in our posting lists.

That… however, is actually too big. In practice, we use 54 bits for the indexed entries. That gives us a maximum number of 18,014,398,509,481,984 items and at a range of 100,000 items per second, we are good for 5,712 years.  Those 10 bits allow us to encode additional information about the actual entry that is being indexed, and reduce the number of external lookups we need to do. Taking 10 bits from the entry id means that we are going to hit the 32 bits limit at around 2 million entries, which is really low.

Then there is another problem, if I have a posting list I want to compress that has 100,000 items in it, I would need to compress that into multiple pages, but figuring out how to split the compressed stream in a way that makes it usable to work with is not trivial at all.

In short, we have two major concerns with using FastPFor in RavenDB, int64 and paged writes. There is another aspect that I didn’t mention, however. FastPFor will take an array of numbers and compress them. If you want to also use delta compression, that is something that you need to add on top of that.

Now that we understand the constraints we have to operate around, let’s talk about the kind of API that I would like to build:


What is going on here?  Well, the basic idea is that you’ll create an encoder as part of the indexing process. For each posting list that you want to process, you’ll call the Encode() method, passing it the data to be encoded.

The Encode() method will return the required size for encoding the entire list as a single operation. Note that it does not actually write the data out (indeed, it has nowhere to write it). The calling code is then able to make decisions based on the returned size. If they have a buffer big enough to hold the encoded data, they can pass it to the Write() method.

Things get more interesting when we don’t have a suitable buffer. Remember that we have a limit of 4KB for small posting lists and 8KB per page for large posting lists. The idea with Write() is that it will write as much as it can to the output, allowing you to use multiple buffers to store the compressed posting list.

In terms of work being done, the Encode() phase copies the post list entries to its own buffer, preparing it to be written out. The process of actually writing the data to the output buffer is mostly concerned with ensuring that we don’t exceed the size of the provided buffer.

This approach means that we only have to pay the encoding code once, which is quite nice. Moreover, the number of checks that we have to make to ensure that we can fit the data into the buffer is also greatly reduced.

What about decoding? That is handled using:


Here we have a very interesting observation, the Encoder is a class, but the Decoder is a struct. We assume that decoding is far more frequent, so we use a struct to avoid an unnecessary memory allocation.

In order to ensure high performance, the decoder makes assumptions about the provided buffers. We will always accept a buffer that has a minimum size, for example.

With all of this out of the way, I’m going to dedicate the next post to discussing the actual implementation and semantics we have for FastPFor encoding.

time to read 4 min | 715 words

In the code of the simdcomp library there is a 25KLOC file that does evil things to SIMD registers to get bit packing to work. When I looked at the code the first few dozen times, I had a strong desire to run away screaming. Luckily, this isn’t just some pile of complicated code, but well-thought-out set of functions that are meant to provide optimal code for specific tasks. In particular, the code is specialized for each bit width that you may want to bit pack (0 .. 32). Even better, no one actually sat down to write it out by hand, there is a Python script that would generate the code.

The first step was to understand what exactly is going on in the code, and then see how we can translate that to C#. Even just a few years ago, that would have been either an impossible dream or required the use of a native library (with the associated PInvoke overhead). However, .NET today has a very robust intrinsics support, and vectors / SIMD instructions are natively supported.

I actually had a tougher challenge, since I want this code to run on x64 and ARM64 instances. The original code is x64 only, of course. The nice thing about SIMD support for .NET is that most of that can be done in a cross platform manner, with the JIT deciding what instructions will actually be emitted. There is still a very strong correlation between the vectorized code and the instructions that are being emitted, which means that I get both great control of what is going on and the appropriate abstraction. I was actually able to implement the whole thing without dropping to architecture-specific code, which makes me very happy.

Before we get any deeper, let’s take a simple challenge. We want to take an array of 32 integers and pack each one of them in 4 bits into an array of 16 bytes. Here is what the code will look like:

This is a bit dense, but let’s see if I can explain what is going on here.

We load from the array a vector (4 items) at the 0, 4, 8, 12, 16, 20, 24, and 28 intervals. For each one of those, we shift the values by the required offset and or all of them together. Note that this means that the first item’s four bits go in the first nibble, but the second item’s bits go in the fifth nibble, etc.

The idea is that we are operating on 4 items at a time, reducing the total number of operations we have to perform. It may be easier to understand if you see those changes visually:


What is happening here, however, is that we are able to do this transformation in very compact code. That isn’t just an issue of high-level code, let’s take a look at the assembly instructions that this generates:

I’m going to assume that you aren’t well versed with assembly, so let’s explain what is going on. This code contains zero branches, it does four reads from memory, mask the elements, shift them and or them together.

The relevant instructions are:

  • vmovupd – read 4 integers to the register
  • vpand – binary and with a value (masking)
  • vpslld – shift to the left
  • vpor – binary or
  • vmovdqu – write 16 bytes to memory

There are no branches, nothing to complicate the code at all. This is about as tight as you can get, at the level of machine instructions.

Of particular interest here is the C# code. The entire thing is basically a couple of lines of code, and I could express the whole thing as a single expression in a readable manner. Let’s look at the C code, to compare what is going on:

Note that both should generate the same machine code, but being able to rely on operator overloading means that I can now get a far more readable code.

From that point, the remaining task was to re-write the Python script so it would generate C# code, not C. In the next post I’m going to be talking more about the constraints we have and what we are actually trying to solve with this approach.

time to read 3 min | 563 words

As I mentioned, I spent quite a lot of time trying to understand the mechanism behind how the FastPFor algorithm works. A large part of that was the fact that I didn’t initially distinguish the separate steps in the process. Finding the bits' widths, packing the bits, metadata and exception management are all integral to how this works, and I tried to grok it all in one shot. I kept getting lost in 14KLOC files with dense SIMD instructions.

When looking at the code, it is also important to keep in mind that this is not a library that you are meant to pick up & use. This is a research project. As such, there was enough work done to show the results, but all the spit & polish that are associated with making a library ready for general consumption weren’t done.

A simple example of that is that the code just assumes that the output buffer provided will be large enough (with no way to estimate upfront) and will overflow the buffer if that happens. Perfectly acceptable for research code, hell no for production usage, naturally.

In the same manner, the repository contains a lot of code. Most of that is being used as part of comparing the FastPFor algorithm to various other options. That can make it hard to understand what are the core parts of the algorithm and what is effectively chuff.

To be honest, the hardest part for me, however, was figuring out the memory allocation patterns that are going on in here. There is a high usage of C++ containers, with implicit allocations and hidden memory management. Here is a piece of code that I had to struggle to understand for quite some time, for example:

This does bit-packing using SIMD without a mask (the naming convention is something that you really have to get used to, I don’t think I encountered justlowercase before).

Note the call to fastpackwithoutmask() in there, it assumes that the provided buffer has exactly 32 elements. A lot of the code in the project assumes that you are passed a buffer of a certain size and operate on that directly. That produces great results in terms of efficiency and code density. But when I read this code it was “obvious” to me that there is an out-of-band work here. If the provided buffer isn’t perfectly aligned on 32 boundary, that method will write read or write beyond the end of the buffer.

Look at line 9 there, where the source container is being resized to ensure that this is fine, since we increase the container size to be exactly 32 elements aligned. We revert this operation later in line 20.

For some SIMD instructions, it matters the alignment of the buffers we are working on. This is handled using a custom allocator on the containers, which is not usually something that I would pay attention to.

In short, from my perspective, there is a lot of hidden behavior in non-trivial C++ code usage there that masks the actual behavior of the algorithm in question.

If you are looking at FastPFor (and if you care enough to look at this series of posts, you probably should), take into consideration that this code shouldn’t be applied as is, but should probably be adapted for your needs.

In the next post, I’m going to start discussing how we did just that for RavenDB.

time to read 9 min | 1766 words

The FastPFor is an integer compression algorithm that was published in 2012 initially. You can read the paper about it here: Decoding billions of integers per second through vectorization.

I’ve run into this algorithm many times in the past. You pretty much can’t be in the database arena and not run into that. It is an interesting paper, and it has a GitHub repository with the code, which is great. Except that I couldn’t figure out what was going on there.

I actually tried stepping through the code a bunch of times, and I always ended up getting lost. The code is in C++ and makes heavy use of templates, containers and non-trivial amounts of magic. To give some context, I gave up when I run into these code snippers:

The paper itself describes the algorithm, but not in a way that actually made sense to me. I tried looking at the code, and I basically noped out. Too complex for me to understand, it seems.

But I kept running into this, and we recently had a strong need for something similar. So I basically took the time to meditate on this for a while.

On a personal level, I realized that I was trying to understand how the whole thing worked by just stepping through the code and reading the paper. The problem was that I was mixing several different layers and wasn’t able to create the proper separation between them.

FastPFor builds upon the previously discussed simdcomp, once you understand that, you can go through this in isolation. For this reason, I first talked about SIMD bit-packing and showed an example usage, before actually diving into how FastPFor itself works.

As a reminder, simdcomp provides routines for quick packing and unpacking of integers, nothing more. FastPFor is where the actual compression happens. Let’s see what is actually going on here. Consider the following list, which we previously discussed:

















If we’ll try bit-pack this list, we’ll find out that we aren’t gaining much. The first value is 31 bits in length, after all, and that means that we aren’t able to save much. We talked about Frame of Reference (FOR) as a way to handle that. We can treat every128 block of numbers as a block that has its own reference point and compute the maximum number of bits for that location. That would actually save us a lot of space, but it isn’t ideal. In the case above, most entries in the list can be packed in just 3 bits, except 2. That is where PFor comes into play, which stands for Patched Frame of Reference.

The key aspects of FastPFor are how it figures out what is the best size to use to pack the numbers, the way it detects what should be patched, and the manner in which it stores this.

The code boils down to a single function:

What is going on here? This function accepts an array of integers (whose size is 128) and computes the best number of bits to use to pack it.

How does that work? The first thing this does is compute how many numbers we have for each bit range. This is done using the asmbits() call, which is basically a LeadingZeroCount(). The end result is a frequency map that we can use to estimate how many bits will be required to store the items from this block.

We start from the maximum number of bits that we need to store all the numbers in the block, and we go down, checking what would be the best cost (in terms of space). Since we need to store exceptions somewhere, we compute the cost of that as we go down in the list.

This function gives us:

  • bestb – the number of bits needed to pack all the items in the block that would result in the smallest output size
  • bestcexception – the count of exceptions (numbers that do not fit into bestb bits
  • maxb – the maximum number of bits that we need to store for this block

That is the key function, the one that is responsible for getting the best compression ratio.

What I found that made FastPFor challenging to understand is that it has a lot of state that you need to keep track of. What I described so far is a single block, but FastPFor operates over sets of numbers, and so needs to tie all of them together.

At any given point, FastPFor has:

  • The block that is being outputted
  • The metadata about the entire process
  • 32 arrays of exceptions

The process interleaves all of them together in interesting ways, and I had a hard time figuring out how it all ties together.

Let’s talk about the way FastPFor processes a single block. We process the data in the array in blocks of 128 numbers at a time, like so:

The first thing that happens is the computation of the best bits widths for the current block. Then, we use the bc value to record the metadata about the block.

That is an array of bytes with the following format:

  1. Packed number of bits
  2. Number of exceptions

If there are exceptions for this block, we also add:

  1. Maximum number of bits
  2. Offset of exception in the block (repeated as the number of exceptions)

The metadata is shared for the entire encoding operation, across blocks.

You can see in the code that if bestcexcept (which holds the number of exceptions) is greater than zero, we find the appropriate exceptions buffer to use. That requires a better explanation.

the getBestBFromData() call gave us two important data points, the best number of bits to pack the numbers, and the maximum number. We are going to pack all the numbers in the block to the output, but what about the exceptions? Where do they go?

It turns out that we need to store the exceptions, but we don’t actually need to store max bits, only the difference between the best number of bits and the maximum. This is what thisexceptioncontainer is holding, the remainding bits for exceptions. It’s important to understand that this is done across blocks. So the thisexceptioncontainer value holds exceptions from many different blocks. That will turn out to be very important in just a little bit. We then scan the block and write the remainding bits to the container, and write to the metadata the offset of the value in the block. Since we are using blocks of 128 numbers, this is ensured to fit inside a byte.

The last step that we do for the block is to call: packblockupsimd(), this ends up calling to simdpack() and packs all the numbers from the block to bestb bits in the output.

It’s really important to understand that we still have two data items that haven’t been written. The first is the metadata for the encoding process (bits in blocks, exceptions offsets, etc). The second is the set of exceptions themselves.

This process repeats itself for each block, until the end of the buffer. It is at this point that we need to write the remaining data to the output. Here is what the code looks like:

What is going on? This is dense, and it takes a while to unpack (pun intended) what is going on here.

First, we write the size of the metadata and then we copy the metadata that we have to the output. That is the description of all the blocks that we just went over. Then, we run over the set of exceptions that we gathered. Remember, the datatobepacked is an array that holds lists of exception data for each bit range that we have to store. We iterate over all the bit widths where we wrote exception data and generate a bitmap. That will help us later during the decoding process.

Finally, we run over the same exceptions and write them out. Note that we do that using the same simd bit-packing approach as the rest of the code.

The end result is that we have the following format:


For decoding, we start from the metadata offset and jump to that. Then we read the exceptions bitmap. That tells us how many exceptions we have and how many bits we have in each one of them. In the image above, you can see that we have 3 exceptions list, for 4 bits, 12 bits and 19 bits.

We use the simd bit-packing to decode those values into memory. We’ll need them shortly.  Now, we start iterating over the metadata, each block has an overhead of minimum two metadata bytes (number of bits and number of exceptions). Let’s assume that we don’t have any exceptions in the block. In that case, the process of decoding is simple. Just do the unpacking from bits to numbers and you are done.

If we have exceptions, on the other hand, we have to deal with them. At that point, the next metadata byte would contain the maximum number of bits for the exceptions in this block. Using that and the normal number of bits, we can tell where the extra bits are located. Here is the relevant code:

Note that there is a special case here if the difference in the number of bits between the block bits and the maximum number of bits is one. In that case, we don’t need to store the data, we already know what the value is then (it’s one, after all). So we can compute it once and set it in the output.

For scenarios where we can’t tell from the bit width along, we read the relevant exception array based on the difference between the bits in the block and the exceptions bits. That gives us an array that is shared across blocks. The idea is that the metadata contains the offset in the block, and we read from the relevant array one item at a time. This two-step process will result in us setting the right value and getting ready for the next call, which may be done in a different block.

Note that the exception bits buffers only care about the number of bits, not where they come from. So two blocks, which have (bestb: 4, maxb: 9) and (bestb: 7: maxb: 10) will both go to the 3 bits container.

Okay, this blog post is getting long enough, so I think that would be it. Hopefully, it will make it easier to understand exactly how the FastPFor format is working. I’m going to be talking more about FastPFor and the format implications in the next post, then move on to actually using that in C#.

time to read 3 min | 430 words

I talked a bit before about the nature of bit packing and how the simdcomp library isn’t actually doing compression. Why do I care about that, then?

Because the simdcomp library provides a very useful building block. A simple set of functions that allows you to very quickly pack and unpack bits. That is important, since those operations are fast enough that we can treat them as effectively free. Once that exists, we can now start finding interesting usages for those.

Let’s assume that we have the following functions:

  • Pack(Span<uint> input, Stream output, int bit);
  • Unpack(Stream input, Span<uint> output, int bit);

It get a set of numbers and the number of bits and encode or decode them as needed.

Because those methods are fast, it means that we can play with various interesting designs. For example, let’s take dictionary compression. Assume that we have a set of values, such as country names. This looks like this:

[ "United States", "China", "Japan", "India", "United States", "Brazil", "Germany", "India", "France", "China", "Japan", "Brazil", … ]

This array contains thousands or millions of entries. We want to encode that in an efficient manner, how can we do that? Here is a simple way to utilize the speed of bit-packing to do something useful:

What is going on here?

  • We first iterate over the set of countries, finding the unique names and giving an index for each one of them.
  • Along the way, we produce an integer array of the indexes of country names.
  • Then we compute the maximum number of bits that we have to use to store those indexes.
  • To the output, we write the number of unique country names, the names themselves and then the bit-packed set of indexes.

Let’s assume that we have 30 unique values in this list, but the list itself is 5 million items in size. What is the actual cost here?

We’ll assume an average country name of 8 bytes, to make the math easy. That means that to store the list of countries would cost us 38MB, if we did that using the most obvious approach.

Using the code above, assuming 30 unique values, we’ll have 5 million values with 5 bits each, leading to a total cost of about 3MB, instead of 38MB.

The nice thing about this approach is that once you have fast bit-packing routines, you can now start using them in all sorts of interesting ways.

The reason I talked about the topic before starting to discuss FastPFor is that it makes a lot more sense to understand how bit-packing can be applied separately from FastPFor, since that is just making interesting use of the capability.

time to read 3 min | 520 words

In the last post, I talked about how the simdcomp library is actually just doing bit-packing. Given a set of numbers, it will put the first N bits in a packed format. That is all. On its own, that isn’t very useful, we have to pair it with something else to make it do something more interesting.

Previously, I talked about delta compression, what if we put both of those together?

Given a sorted list of numbers, we’ll compute the delta between the numbers, then we can figure out how many bits we need to store them, and we can pack that tightly.

Analyzing the sample posting list I’m using, we get the following batches (of 128 numbers each)

  • 328 batches at 13 bits
  • 11 batches at 22 bits
  • 7 batches at 23 bits
  • 3 batches as 27 bits
  • 1 batch at 31 bits
  • 1 batch at 28 bits

Doing the math, we can fit all of those using 77Kb or so.

The idea is that for each batch, we can use simdpack() to write just the relevant bits. We’ll need some extra metadata to keep track of the number of bits per batch, but that is just one byte per every 128 values. For that matter, we can bit-pack that as well Smile.

Note that in this case, we are using the simd packing routines just for bit packing. The actual compression aspect comes not from the packing, but from computing the deltas and then figuring out how densely we can pack those values.

This approach is called Frame Of Reference (FOR), and it has some really nice advantages. By splitting the data into blocks, we can take advantage of the patterns in the data. Here are the first 16 deltas in the posting list:





































































Let’s assume that the batch size is 4 numbers, what would be the numbers here? Basically, we take the max number of bits for each batch of 4 numbers.

The first block will require us to use 31 bits, the second would be just 3 bits, but the third would require 13 bits and the last is 3 bits again.

The idea is that for each block, we have a separate frame of reference, the number of bits that are required for us to record & retrieve the data.

Of course, if you’ll look at the data itself, we are wasting quite a lot of space here needlessly. There are just two numbers (0, 11) that require more than 3 bits.

There is an algorithm called Patched Frame Of Reference (PFor) that handles this. Basically, when we compute a batch, we’ll record it using 3 bits only in this case. What about the rest of the values? Those we’ll store outside the batch, as an exception. On reading, we’ll need to patch the data to make it whole.

That adds a bit of complexity, but it also means that you can compress the data a lot more densely.

In the next post, I’m going to be talking about FastPFor, which combines simd bit packing, patched frame of references and some really clever tricks to get a very interesting integer compression algorithm.

time to read 4 min | 728 words

In the previous post, I showed how you can use integer compression using variable-size integers. That is a really straightforward approach for integer compression, but it isn’t ideal. To start with, it means that we take at least one byte for each number, but in many cases, especially if we are using delta encoding, we can pack things a lot more tightly. For reference, here are the first 15 records we have in the posting list I’m using for testing, delta encoded:


See the range of 4, repeating. Each one of them can be represented by 3 bits, but we’ll waste 32 – 64 bits to hold it.

When you are searching for data on “database stuff”, the building blocks that are being used to build databases, you’ll usually find Lemire there. In this case, I would like to talk about the SimdComp library. From the GitHub repository:

A simple C library for compressing lists of integers using binary packing and SIMD instructions. This library can decode at least 4 billion of compressed integers per second on most desktop or laptop processors.

Those are some really nice numbers, but I actually have a pet peeve with the name of the library. It isn’t actually doing compression, what it does is bit packing, which is something quite different.

Bit packing just takes numbers that are 32 bits in length and stores them using fewer bits.  For example, let’s consider the following list: 1,2,3,4,5,6,7,8,9,10. If I would store that in 32 bits integers, that would take 40 bytes. But given that the maximum size is 10, I can use 4 bits integers, instead. Taking only 5 bytes.

The core of the library is this set of routines:

  • simdpack()
  • simdpackwithoutmask()
  • simdunpack()

All of them have roughly the same structure, something like this:


This switch statement goes on for 32 bits. Each time selecting a different function. This code makes a lot of assumptions. You’ll always give it exactly an input of 128 numbers and it will pack them and write them to output. The idea is to be as compact as you possibly can.

Reading this code is.. a challenge, because there is so much mental weight behind it, or so it seems. For example, take a look at how we pack a set of numbers into 2-bit integers:

The actual code goes on for quite a while (104 of dense SIMD code), so let’s try to break it up a bit. If we’ll translate this from SIMD to scalar, we’ll have code like this:

If you are used to manipulating bits, it’s fairly straightforward. We start by setting the output to the first value, then the second value was shifted by 2 bits and added to the value, then the third value was shifted by 4 bits, etc.

With SIMD, the interesting part is that we aren’t operating on a single value, but 4 at a time. That also means that we have a difference in how the data actually end up being packed.

Given a list of values from 1 .. 16, bit packing them will usually result in the data looking like this:

1	2	3	4	5	6	7	8	9	10	11	12	13	14	15	16

In contrast, the SIMD bit packing model will store the data like so:

1	5	9	13	2	6	10	14	3	7	11	15	4	8	12	16

This is a result of operating on 4 items at a time, but in practice, the actual placement of the bits does not matter much, just that we are able to store and retrieve them.

Note that for packing, we have two options, one that uses a mask and the other that does not. That will be important later. The only difference between those options is whether a mask is applied before we splice the bits into the output, nothing else.

The reason I said that this isn’t compression is that this library is focused solely on packing the bits, it has no behavior / concept beyond that stage. In the next post, we’ll start looking into how we can make use of this in more interesting ways.


No future posts left, oh my!


  1. Recording (14):
    19 Jun 2024 - Building a Database Engine in C# & .NET
  2. re (33):
    28 May 2024 - Secure Drop protocol
  3. Meta Blog (2):
    23 Jan 2024 - I'm a JS Developer now
  4. Production postmortem (51):
    12 Dec 2023 - The Spawn of Denial of Service
  5. Challenge (74):
    13 Oct 2023 - Fastest node selection metastable error state–answer
View all series


Main feed Feed Stats
Comments feed   Comments Feed Stats