As I mentioned, I spent quite a lot of time trying to understand the mechanism behind how the FastPFor algorithm works. A large part of that was the fact that I didn’t initially distinguish the separate steps in the process. Finding the bits' widths, packing the bits, metadata and exception management are all integral to how this works, and I tried to grok it all in one shot. I kept getting lost in 14KLOC files with dense SIMD instructions.
When looking at the code, it is also important to keep in mind that this is not a library that you are meant to pick up & use. This is a research project. As such, there was enough work done to show the results, but all the spit & polish that are associated with making a library ready for general consumption weren’t done.
A simple example of that is that the code just assumes that the output buffer provided will be large enough (with no way to estimate upfront) and will overflow the buffer if that happens. Perfectly acceptable for research code, hell no for production usage, naturally.
In the same manner, the repository contains a lot of code. Most of that is being used as part of comparing the FastPFor algorithm to various other options. That can make it hard to understand what are the core parts of the algorithm and what is effectively chuff.
To be honest, the hardest part for me, however, was figuring out the memory allocation patterns that are going on in here. There is a high usage of C++ containers, with implicit allocations and hidden memory management. Here is a piece of code that I had to struggle to understand for quite some time, for example:
This does bit-packing using SIMD without a mask (the naming convention is something that you really have to get used to, I don’t think I encountered justlowercase before).
Note the call to fastpackwithoutmask() in there, it assumes that the provided buffer has exactly 32 elements. A lot of the code in the project assumes that you are passed a buffer of a certain size and operate on that directly. That produces great results in terms of efficiency and code density. But when I read this code it was “obvious” to me that there is an out-of-band work here. If the provided buffer isn’t perfectly aligned on 32 boundary, that method will write read or write beyond the end of the buffer.
Look at line 9 there, where the source container is being resized to ensure that this is fine, since we increase the container size to be exactly 32 elements aligned. We revert this operation later in line 20.
For some SIMD instructions, it matters the alignment of the buffers we are working on. This is handled using a custom allocator on the containers, which is not usually something that I would pay attention to.
In short, from my perspective, there is a lot of hidden behavior in non-trivial C++ code usage there that masks the actual behavior of the algorithm in question.
If you are looking at FastPFor (and if you care enough to look at this series of posts, you probably should), take into consideration that this code shouldn’t be applied as is, but should probably be adapted for your needs.
In the next post, I’m going to start discussing how we did just that for RavenDB.
More posts in "Integer compression" series:
- (21 Jun 2023) FastPFor in C#, results
- (20 Jun 2023) Implementing FastPFor decoding in C#
- (19 Jun 2023) Implementing FastPFor encoding in C#
- (16 Jun 2023) Adapting FastPFor to RavenDB
- (15 Jun 2023) Porting simdcomp to C#
- (14 Jun 2023) The FastPFor code
- (13 Jun 2023) Understanding FastPFor
- (12 Jun 2023) SIMD bit packing and unusual usages
- (08 Jun 2023) Using SIMD bit packing in practice
- (07 Jun 2023) Understanding Simd Compression by Lemire
- (06 Jun 2023) delta encoding + variable size integers