Integer compressiondelta encoding + variable size integers
If you are building a database, the need to work with a list of numbers is commonplace. For example, when building an index, we may want to store all the ids of documents of orders from Europe.
You can just store the raw values, of course, but that can be incredibly wasteful in terms of disk space, memory bandwidth and performance. Consider a database where millions of orders are from Europe. We’ll need multiple megabytes just to store the document ids. We can utilize patterns in the numbers to reduce the total storage size. For example, if you are storing a list of sorted numbers, you can store the delta of the numbers, instead of the whole value. The delta tends to be much smaller, after all. Take a look at the following code:
This will generate (in place) delta for the sorted list. We can then write it to a stream, using 7-bit integer (also called variable size integers) encoding. This way, smaller numbers take less space, which is the point of first doing delta encoding.
For testing purposes, I have a list of 44,956 items, which looks like this (full list is here):
1871143144 |
1871143148 |
1871143152 |
1871143156 |
1871143160 |
1871143164 |
1871143168 |
1871143172 |
1871143176 |
1871143180 |
Note that I’m using int64, so if we would store the data as is, it would take ~351KB. Using delta encoding, we can pack that into just ~45Kb.
I care a lot more about the decoding of the values, so let’s see how we read it back, shall we?
With very little code, we can gain quite a jump in information density. That is quite good. For reference purposes, taking the same data and compressing it using GZip would cost us ~58KB, so significantly more than this simpler mechanism.
On the other hand, using GZip on the delta encoded list will put us under 8KB. This is because in this context, there is a great deal of repetition in the data, there are many runs that are exactly 4 values apart from one another, so we can get quite a good compression ratio out of this.
That does require that we’ll have patterns in the data, which is by no means a guarantee. There is a whole field out there of research into integer compression, since it is such an important aspect of how database internals are working.
In this series, I’m going to focus on detailing how we implemented FastPFor inside of RavenDB, what we learned along the way and what the final results were.
More posts in "Integer compression" series:
- (21 Jun 2023) FastPFor in C#, results
- (20 Jun 2023) Implementing FastPFor decoding in C#
- (19 Jun 2023) Implementing FastPFor encoding in C#
- (16 Jun 2023) Adapting FastPFor to RavenDB
- (15 Jun 2023) Porting simdcomp to C#
- (14 Jun 2023) The FastPFor code
- (13 Jun 2023) Understanding FastPFor
- (12 Jun 2023) SIMD bit packing and unusual usages
- (08 Jun 2023) Using SIMD bit packing in practice
- (07 Jun 2023) Understanding Simd Compression by Lemire
- (06 Jun 2023) delta encoding + variable size integers
Comments
Comment preview