Integer compressionSIMD bit packing and unusual usages
I talked a bit before about the nature of bit packing and how the simdcomp library isn’t actually doing compression. Why do I care about that, then?
Because the simdcomp library provides a very useful building block. A simple set of functions that allows you to very quickly pack and unpack bits. That is important, since those operations are fast enough that we can treat them as effectively free. Once that exists, we can now start finding interesting usages for those.
Let’s assume that we have the following functions:
- Pack(Span<uint> input, Stream output, int bit);
- Unpack(Stream input, Span<uint> output, int bit);
It get a set of numbers and the number of bits and encode or decode them as needed.
Because those methods are fast, it means that we can play with various interesting designs. For example, let’s take dictionary compression. Assume that we have a set of values, such as country names. This looks like this:
[ "United States", "China", "Japan", "India", "United States", "Brazil", "Germany", "India", "France", "China", "Japan", "Brazil", … ]
This array contains thousands or millions of entries. We want to encode that in an efficient manner, how can we do that? Here is a simple way to utilize the speed of bit-packing to do something useful:
What is going on here?
- We first iterate over the set of countries, finding the unique names and giving an index for each one of them.
- Along the way, we produce an integer array of the indexes of country names.
- Then we compute the maximum number of bits that we have to use to store those indexes.
- To the output, we write the number of unique country names, the names themselves and then the bit-packed set of indexes.
Let’s assume that we have 30 unique values in this list, but the list itself is 5 million items in size. What is the actual cost here?
We’ll assume an average country name of 8 bytes, to make the math easy. That means that to store the list of countries would cost us 38MB, if we did that using the most obvious approach.
Using the code above, assuming 30 unique values, we’ll have 5 million values with 5 bits each, leading to a total cost of about 3MB, instead of 38MB.
The nice thing about this approach is that once you have fast bit-packing routines, you can now start using them in all sorts of interesting ways.
The reason I talked about the topic before starting to discuss FastPFor is that it makes a lot more sense to understand how bit-packing can be applied separately from FastPFor, since that is just making interesting use of the capability.
More posts in "Integer compression" series:
- (21 Jun 2023) FastPFor in C#, results
- (20 Jun 2023) Implementing FastPFor decoding in C#
- (19 Jun 2023) Implementing FastPFor encoding in C#
- (16 Jun 2023) Adapting FastPFor to RavenDB
- (15 Jun 2023) Porting simdcomp to C#
- (14 Jun 2023) The FastPFor code
- (13 Jun 2023) Understanding FastPFor
- (12 Jun 2023) SIMD bit packing and unusual usages
- (08 Jun 2023) Using SIMD bit packing in practice
- (07 Jun 2023) Understanding Simd Compression by Lemire
- (06 Jun 2023) delta encoding + variable size integers
Comments
I think you forgot an else in there, you never actually add values to dic
Echo, That happens on line 9. You enter the if if there is no value in the dic already
Could you use the simdcomp library as tool to test optimization (3NF) of a database?
Blaise,
I'm not sure that I follow what you mean here. Can you explain?
I get that it's not production code and not the point, but iterating
Dictionary<K,V>.Keys
like that is incorrect, because the order the keys are returned in is not guaranteed.Svick,
You absolutely can, assuming that there are no deletes, the iteration order of a dictionary in .NET is insertion order.
That's not what the documentation says. And I don't think you should depend on implementation details of the current implementation.
Svick,
I rely on Hyrum's Law and the MS backward compatibility guarantees. More to the point, if this is ever a problem, shifting to something that does guarantee the order is simple
Comment preview