Sorted integer compression
In the database field and information retrieval in general, there is a very common scenario. I have a list of (sorted) integers that I want to store, and I want to do that in an as efficient a manner as possible. There are dozens of methods to do this and this is a hot topic for research. This is so useful because there are so many places where you can operate on a sorted integer list and gain massive benefits. Unlike generic compression routines, we can usually take advantage of the fact that we understand the data we are trying to work with and get better results.
The reason I need to compress integers (actually, int64 values) is that I’m trying to keep track of matches for some data, so the integers that I’m tracking are actually file offsets for user’s data inside of Voron. That lead to a few different scenarios that I have to deal with:
- There is a single result
- There is a reasonable number of results
- There is a boatload of results
I’m trying to figure out what is the best way to store the later two options in as efficient manner as possible.
The first stop was Daniel Lemire’s blog, naturally, since he has wrote about this extensively. I looked at the following schemes: FastPFor and StreamVByte. They have somewhat different purposes, but basically, FastPFor is using a bits stream while StreamVByte is using byte oriented mode. Theoretically speaking, you can get better compression rate from FastPFor, but StreamVByte is supposed to be faster. Another integer compression system come from the Gorilla paper from Facebook, that is a bigger scheme, which include time series values compression. However, part of that scheme talks about how you can compress integers (they use that to store the ticks of a particular operation). We are actually using that for the time series support inside of RavenDB.
I’m not going to cover that in depth, here is the paper on Gorilla compression, the relevant description is at section 4.1.1. Suffice to say that they are using a bit stream and delta of deltas computation. Basically, if you keep getting values that are the same distance apart, you don’t need to record all the value, you can compute that naturally. In the best case scenario, Gorilla compression needs a single bit per value, assuming the results are spaced similarly.
For my purpose, I want to get as high a compression rate as possible, and I need to store just the list of integers. The problem with Gorilla compression is that if we aren’t getting numbers that are the same distance apart, we need to record the amount that they are different. That means that at a minimum, we’ll need a minimum of 9 bits per value. That adds up quickly, sadly.
On the other hand, with PFor, there is a different system. PFor computes the maximum number of bits required for a batch of integer, and then record just those values. I found the Binary Packing section (2.6) in this paper to be the clearest explanation on how that works exactly. The problem with PFor, however, is that if you have a single large outlier, you’ll waste a lot of bits unnecessarily.
I decided to see if I can do something about that and created an encoder that works on batches of 128 integers at a time. This encoder will:
- Check the maximum number of bits required to record the deltas of the integers. That along already saves us a lot.
- Then we check the top and bottom halves of the batch, to see if we’ll get a benefit from recording them separately. A single large value (or a group of them) that is localized to a part of the batch will be recorded independently in this case.
- Finally, instead of only recording the meaningful bit ranges, we’ll also analyze the batch we get further. The idea is to try to find ranges within the batch that have the same distance from one another. We can encode those as repetitions instead of each independent value. That can end up saving a surprisingly amount of space.
You can look at the results of my research here. I’ll caution you that this is raw, but the results are promising. I’m able to beat (in terms of compression rate) the standard PFor implementation by a bit, with a lot less code.
I’m also looking at a compression rate of 30% – 40% for realistic data sets. And if the data is setup just right and I’m able to take advantage of the repeated delta optimization, I can pack things real tight.
Currently numbers say that I can push upward of 10,000 int64 values in an 8KB buffer without any repeated deltas. It goes to just under 500,000 int64 values in an 8KB buffer if I can take full advantage of the deltas.
The reason I mention the delta so often, it is very likely that I’ll record values that are roughly the same size, so we’ll get offsets that are the same space from one another. In that case, my encoder goes to town and the compression rate is basically crazy.
This is a small piece of a much larger work, but this is the first time in a while that I got to code at Voron’s level. This is fun.
That's a fun topic indeed! When I did a similar research for my own sample data, I came to the conclusion the Simple compressions worked very well for me (and with fast decoding). Most of the time performed better than PFor variants. See here: http://www.cs.otago.ac.nz/homepages/andrew/papers/2015-1.pdf Specifically my final approach was a delta & zig-zag encoding, then Simple16 compression (a variant with an RLE option). And since Simple16 supports ints less than 1<<24, I used Gorilla floating-point compression if any delta exceeds 1<<24 (which works pretty well for ints actually).
Hi Yoav, That is quite an interesting insight. I haven't got to the point when I'm benchmarking the encoding / decoding performance yet, right now I focused on the real world For my purposes, being able to detect and optimize runs of identical deltas give a really awesome perf benefit. I'm focused on ensuring that this is a likely scenario by controlling how I generate the data for that.
If you're interested in the compression ratio, I develop a C library that outperforms most other integer algorithms: VTEnc, which is based on a novel algorithm that I invented. Just a heads-up that the library is not production-ready. Let me know if you think it can be useful for you.
Some discussion around this post over on Hacker News: https://news.ycombinator.com/item?id=27549075