In this series so far, I explored several ways that we can implement integer compression. I focused on the FastPFor algorithm and dove deeply into how it works. In the last post, I showed how we can use the SIMD intrinsics to generate vectorized code that can process a lot of data very quickly.
With this baseline established, we need to understand the use case we have in RavenDB for integer compression. We are actually using this for the classic scenario, posting lists as part of the indexing engine. However, there are quite a few differences in how we want to actually utilize it.
Let’s take Lucene as a good example, since that is one of the classic examples for a search engine. From the ground up, Lucene is meant to be used in batching scenarios. That has a lot of impact on the design. In particular, it means that a posting list for Lucene is something that you write once in a big batch of operations. For RavenDB, however, we explicitly reject his approach. We want to allow to do on-the-fly indexing and we cannot afford to pay the cost of merging many small writes to big files over time.
The FastPFor algorithm is meant to get a lot of data, encode that in a compressed form and then allow you to iterate over the results as fast as possible. But RavenDB’s needs are different. At the most basic level, RavenDB thinks about things in 8KB pages. A posting list for RavenDB is composed of the following:
- Single – just a single value (common when indexing unique values)
- Small – can fit in under 4KB
- Large – requires more than 4KB of data, in which case we split it on 8KB boundaries.
What’s the point here? The idea is that when we need to update a posting list (add or remove items), we have one of those three options:
- Single – we just update the value, duh!
- Small – FastPFor is fast, right? It is cheap enough to unpack the compressed values, merge with the new items (or the removals) and compress again.
- Large – here we are talking about a lot of values, potentially tens of millions or higher. How do we handle updates in this case?
The answer is that we are going to be making use of a B+ Tree, with some caveats. The basic idea is that we have two types of pages in the tree. A branch page is the classic B+ Tree concept, which points to leaf pages below it that contain some range of values. However, the leaf pages, on the other hand, are very different.
Each leaf page is just an 8KB buffer that holds the compressed data, nothing more. When we want to update the values in the posting list, we find the relevant page that needs to be updated, then we just unpack, merge & compress the values. It’s actually cheaper to go this route than any other alternative.
There is another important aspect. Lucene (and most other indexing systems) assume that your posting list is using 32 bits integers. That limits the number of items in an index to 2.1 billion (up to 4.2 billion, for some implementations). I don’t like that limit, not at today’s scale. So we internally use 64bits integers in our posting lists.
That… however, is actually too big. In practice, we use 54 bits for the indexed entries. That gives us a maximum number of 18,014,398,509,481,984 items and at a range of 100,000 items per second, we are good for 5,712 years. Those 10 bits allow us to encode additional information about the actual entry that is being indexed, and reduce the number of external lookups we need to do. Taking 10 bits from the entry id means that we are going to hit the 32 bits limit at around 2 million entries, which is really low.
Then there is another problem, if I have a posting list I want to compress that has 100,000 items in it, I would need to compress that into multiple pages, but figuring out how to split the compressed stream in a way that makes it usable to work with is not trivial at all.
In short, we have two major concerns with using FastPFor in RavenDB, int64 and paged writes. There is another aspect that I didn’t mention, however. FastPFor will take an array of numbers and compress them. If you want to also use delta compression, that is something that you need to add on top of that.
Now that we understand the constraints we have to operate around, let’s talk about the kind of API that I would like to build:
What is going on here? Well, the basic idea is that you’ll create an encoder as part of the indexing process. For each posting list that you want to process, you’ll call the Encode() method, passing it the data to be encoded.
The Encode() method will return the required size for encoding the entire list as a single operation. Note that it does not actually write the data out (indeed, it has nowhere to write it). The calling code is then able to make decisions based on the returned size. If they have a buffer big enough to hold the encoded data, they can pass it to the Write() method.
Things get more interesting when we don’t have a suitable buffer. Remember that we have a limit of 4KB for small posting lists and 8KB per page for large posting lists. The idea with Write() is that it will write as much as it can to the output, allowing you to use multiple buffers to store the compressed posting list.
In terms of work being done, the Encode() phase copies the post list entries to its own buffer, preparing it to be written out. The process of actually writing the data to the output buffer is mostly concerned with ensuring that we don’t exceed the size of the provided buffer.
This approach means that we only have to pay the encoding code once, which is quite nice. Moreover, the number of checks that we have to make to ensure that we can fit the data into the buffer is also greatly reduced.
What about decoding? That is handled using:
Here we have a very interesting observation, the Encoder is a class, but the Decoder is a struct. We assume that decoding is far more frequent, so we use a struct to avoid an unnecessary memory allocation.
In order to ensure high performance, the decoder makes assumptions about the provided buffers. We will always accept a buffer that has a minimum size, for example.
With all of this out of the way, I’m going to dedicate the next post to discussing the actual implementation and semantics we have for FastPFor encoding.
More posts in "Integer compression" series:
- (21 Jun 2023) FastPFor in C#, results
- (20 Jun 2023) Implementing FastPFor decoding in C#
- (19 Jun 2023) Implementing FastPFor encoding in C#
- (16 Jun 2023) Adapting FastPFor to RavenDB
- (15 Jun 2023) Porting simdcomp to C#
- (14 Jun 2023) The FastPFor code
- (13 Jun 2023) Understanding FastPFor
- (12 Jun 2023) SIMD bit packing and unusual usages
- (08 Jun 2023) Using SIMD bit packing in practice
- (07 Jun 2023) Understanding Simd Compression by Lemire
- (06 Jun 2023) delta encoding + variable size integers