Integer compression: delta encoding + variable size integers

Jun 06 2023

Integer compressiondelta encoding + variable size integers

time to read 3 min | 418 words

If you are building a database, the need to work with a list of numbers is commonplace. For example, when building an index, we may want to store all the ids of documents of orders from Europe.

You can just store the raw values, of course, but that can be incredibly wasteful in terms of disk space, memory bandwidth and performance. Consider a database where millions of orders are from Europe. We’ll need multiple megabytes just to store the document ids. We can utilize patterns in the numbers to reduce the total storage size. For example, if you are storing a list of sorted numbers, you can store the delta of the numbers, instead of the whole value. The delta tends to be much smaller, after all. Take a look at the following code:

This will generate (in place) delta for the sorted list. We can then write it to a stream, using 7-bit integer (also called variable size integers) encoding. This way, smaller numbers take less space, which is the point of first doing delta encoding.

For testing purposes, I have a list of 44,956 items, which looks like this (full list is here):

1871143144
1871143148
1871143152
1871143156
1871143160
1871143164
1871143168
1871143172
1871143176
1871143180

Note that I’m using int64, so if we would store the data as is, it would take ~351KB. Using delta encoding, we can pack that into just ~45Kb.

I care a lot more about the decoding of the values, so let’s see how we read it back, shall we?

With very little code, we can gain quite a jump in information density. That is quite good. For reference purposes, taking the same data and compressing it using GZip would cost us ~58KB, so significantly more than this simpler mechanism.

On the other hand, using GZip on the delta encoded list will put us under 8KB. This is because in this context, there is a great deal of repetition in the data, there are many runs that are exactly 4 values apart from one another, so we can get quite a good compression ratio out of this.

That does require that we’ll have patterns in the data, which is by no means a guarantee. There is a whole field out there of research into integer compression, since it is such an important aspect of how database internals are working.

In this series, I’m going to focus on detailing how we implemented FastPFor inside of RavenDB, what we learned along the way and what the final results were.

Tweet Share Share 0 comments

Tags:

Oren Eini

Oren Eini

CEO of RavenDB