Oren Eini

CEO of RavenDB

a NoSQL Open Source Document Database

Get in touch with me:

oren@ravendb.net +972 52-548-6969

Posts: 7,546
|
Comments: 51,161
Privacy Policy · Terms
filter by tags archive
time to read 1 min | 195 words

We are going to be making some changes to our RavenDB 6.0 docker image, you can already access them using the nightly builds:

docker pull ravendb/ravendb-nightly:6.0-ubuntu-latest

The purpose of those changes is to make it easier to run RavenDB in your environment and to make it a more natural fit to a Linux system.  The primary reason we made this change is that we wanted to enable running RavenDB containers as non-root users.

Most of the other changes are internal, primarily about file paths and how we are actually installing RavenDB on the container instance. We now share a single installation process across all Linux systems, which makes is easier to support and manage.

This does have an impact on users' migration from RavenDB 5.4 docker images, but the initialization process should migrate them seamlessly. Note that if you have an existing 5.4 docker instance and you want to update that to 6.0 and run as non-root, you may need to explicitly grant permissions to the RavenDB data folder for the RavenDB user (uid: 999).

As usual, I would like to invite you to take the system for a spin. We would really love your feedback.

time to read 4 min | 631 words

On its face, we have a simple requirement:

  • Generate sequential numbers
  • Ensure that there can be no gaps
  • Do that in a distributed manner

Generating the next number in the sequence is literally as simple as ++, so surely that is a trivial task, right?

The problem is with the second requirement. The need to ensure that there are no gaps comes often when dealing with things like invoices. The tax authorities are really keen on “show me all your invoices”, and if there are gaps in the numbers, you may have to provide Good Answers.

You may think that the third one, running in a distributed environment, is the tough challenge, but that isn’t actually the case. If we are running in a single location, that is fairly easy. Run the invoice id generation as a transaction, and you are done. But the normal methods of doing that are usually wrong in edge cases.

Let’s assume that we use an Oracle database, which uses the following mechanism to generate the new invoice id:

invoice_seq.NEXTVAL

Or SQL Server with an identity column:

CREATE TABLE invoices ( invoice_id INT IDENTITY(1,1) PRIMARY KEY, ... );

In both cases, we may insert a new value to the invoices table, consuming an invoice id. At some later point in time, we may roll back the transaction. Care to guess what happens then?

You have INVOICE #1000 and INVOICE #1002, but nothing in between. In fact, no way to even tell what happened, usually.

In other words, identity, sequence, serial, or autonumber – regardless of what database platform you use, are not suitable for generating gapless numbers.

The reasoning is quite simple. Assume that you have two concurrent transactions, which generate two new invoices at roughly the same time. You commit the later one before the first one, and roll back the first. You now have:

  • Invoice #1
  • Invoice #2
  • Invoice #1000
  • Invoice #1002

However, you don’t have Invoice #1001, and you cannot roll back the sequence value there, because if you do so, it will re-generate the #1002 on the next call.

Instead, for gapless numbers, we need to create this as a dedicated part of the transaction. So there would be a record in our system that contains the NextInvoiceId, which will be incremented as part of the new invoice creation.

In order to ensure that there are no gaps, you need to ensure that the NextInvoideId record increment is handled as a user operation, not a database operation. In other words, in SQL Server, that is a row in a table, that you manually increment as part of adding a new invoice. Here is what this will look like:

As you can see, we increment the row directly. So it will be rolledback as well.

The downside here is that we can no longer create two invoices concurrently. The second transaction would have to wait on the lock on the row in the next_gapless_ids table.

All of that happens inside a single database server. What happens when we are running in a distributed environment?

The answer in this case, is the exact same thing. You need a transaction, a distributed one, using a consensus algorithm. Here is how you can achieve this using RavenDB’s cluster wide transactions, which use the Raft protocol behind the scenes:

The idea is simple, we have a transaction that modifies the gapless ids document and creates a new invoice at the same time. We have to handle a concurrency exception if two transactions try to create a new invoice at the same time (because they both want to use the same invoice id value), but in essence this is pretty much exactly the same behavior as when you are running on a single node.

In other words, to ensure the right behavior, you need to use a transaction. And if you need a distributed transaction, that is just a flag away with RavenDB.

time to read 3 min | 490 words

A long while ago I had this project, which allows you to define software licenses that you can distribute. The basic idea is pretty simple, we want to be able to define a key (needs to be short and copy/pastable) that we’ll be able to provide to our code that is running in a separate environment. In particular, we have to deal with the issue of not being connected to a central server.

That sounds like a really hard problem, but it turns out that it is a pretty simple solution, if we use public key cryptography. Typically you’ll utilize public key cryptography for encryption, but you can also use that for signing. The idea is that we can use the ability to sign the key using a private key, then validate it (potentially offline) using the public key.

Licensing can be complex, so we are going to punt all of that to someone else. In this post I’m going to just discuss how we can sign a piece of data and then verify it. We’ll start by generating the keys, this is an action that you should only do once:

Here is the output of this code:

You can now embed the public key in your software, while keeping the private key hidden.

With that in place, we can now sign a license. But what is a license? At its most basic form, it is a set of attributes that describe what your license allows. As such, we can use a Dictionary<string,string> to define the capabilities of the license.

With that in place, we can write very simple code to generate the license text:

And here is the result:

The last part to deal with is verifying that the license is valid, we can do that using the following function:

If the license options are the same as the one we signed and the cryptographic signature is a match, we can safely return the license options. Otherwise, we’ll return null.

As I said, this is pretty simple code all around. And doesn’t do much, but it means that you can let the framework carry most of the load.

Features such as enabling specific capabilities for a license, expiration, etc are all possible. Define an ExpirationDate property and have your software react to that, for example.

A few words about this implementation, however. I’m relying heavily on the .NET framework, obviously. But beyond just using the cryptographic primitives, we also have to take into account a bunch of other aspects.

For example, I’m not bothering to normalize the license. Instead, I rely on the fact that the .NET Dictionary will iterate over keys in insertion order. Note that any change in the actual data will result in a verification fails.

This is a pretty simple design, but it ensures that you cannot “crack” the algorithm used to generate the license keys. Of course, users can still always just patch the isValidLicense function, instead Smile.

time to read 2 min | 356 words

After this saga, I wanted to close the series with some numbers about the impact of this algorithm.

If you’ll recall, I started this whole series discussing variable-sized integers. I was using this list of numbers to compare the values. There are 44,956 items in the list.

Algorithm Size
Raw 359.648
Varint 224,780
Delta + Varint    45,970
Delta + Varint + Gzip             5,273
FastPFor      24,717

You can see some interesting details about the various options. Delta + Varint + Gzip is a really good setup, if you can assume that the output pattern has a high degree of repetition. In some cases, that is possible, but that isn’t a generic property. Aside from that, FastPFor is winning hands down in terms of the output size of the data.

There is also another important aspect. The size of the output here is too big to fit in a single 8KB buffer, so I’ll need to use multiple for that. This is not something that you can really do with GZip. Here is the cost across multiple buffers:

This particular list would fit into 4 pages:

  • 14,080 entries at 8,048 bytes
  • 14,080 entries at 8,063 bytes
  • 15,616 entries at 8,030 bytes
  •   1,180 entries at    644 bytes

But the compression ratio is only part of that. Let’s talk about the performance aspect. On my machine, you can run the encoding process in 1,115 ticks and decoding in just 462 ticks.

To give some context, that means that you can do the encoding at a rate of ~400,000,000 / second and decoding at a rate of about 1 billion per second.

My perf team also asked me to mention that they haven’t gotten the chance to look at the code yet, so things are likely to improve.

The entire premise of FastPFor inside of RavenDB relies on these fast decoding & encoding times. It makes it super cheap to iterate over a vast amount of numbers, and the B+Tree layout we have means that updating a posting list’s page is trivial. Read it, merge, and encode it back. It’s fast enough that there is really no other place for meaningful optimizations / complexity.

time to read 6 min | 1029 words

In the previous post, I discussed FastPFor encoding, now I’m going to focus on how we deal with decoding. Here is the decode struct:

image

Note that this is a struct for performance reasons. We expect that we’ll have a lot of usage here, so saving the allocation here ends up paying high dividends.  Before we’ll start, I want to pay special attention to the fields on the decoder:

Of particular interest is the _exceptionOffsets array. If you aren’t familiar with it, this is a fixed-size array on the struct.

Here is the constructor for the decoder:

We are being provided with the encoded buffer and its size. What is actually happening here?

We start by allocating memory for the exceptions buffer, then scan the exceptions bitmap and extract the exceptions data to the buffer. We use a single buffer for all of the exceptions, and the _exceptionsOffsets is an indication of where each bit width exception is currently at.

Finally, we set the _prev to be a vector of the baseline. That is the counterpoint to how we dealt with the values during the encoding. Like the encoding process, we have to deal with three states:

  • Big deltas (for the next block)
  • Variable integer (at the end)
  • Packed block of 256 numbers

We’ll start with dealing with the first two options:

If we have a big delta, we record that in the bigDeltaOffsets. Note that we are playing an annoying game here. I’m actually storing the metadata position in the delta offsets. I can cast it to long and then to pointer because I’m ensured that the data is fixed during the operation.

For the varint scenario, the last item in the batch, we don’t have to do much, we compute the running total of the values as we read them from the input, that’s about it.

Things get a lot more interesting when we start dealing with the actual compressed blocks:

Here we simply decode the values to the buffer, then we need to apply the exceptions. You can see that we saved some space by not storing exceptions of 1 bit (since we know what the value would be). For exceptions of different sizes, see how we consume the exception from _exceptionOffsets for each used exception. I’m using a ref variable here, so the offset++ operation will increment the value in the array. That ensures that we have to keep very little state in the decoding process as it moves forward. Remember that we require that the output buffer for the decoded numbers be at least 256 values, to ensure that we can fit them all in. That doesn’t mean that we have enough space to fit everything. So we may be called multiple times and need to resume from where we left off.

Finally, we set the expectedBufferIndex if we have a big delta offset. We’ll shortly see what we’ll do with this.

Remember that at this point, the buffer we use has int32 deltas in it, not the actual final numbers. In order to get the final values, we need to compute the sum of the deltas, this is handled here:

What do we do here? We load a vector of 8 integers (32 bits) and widen that to two vectors of 4 integers (64 bits) each.

We then check whether this is the expected buffer index for big delta offsets, if so, we’ll or the high parts of the number back to the vector. This is handled here:

There are quite a few things that are happening here all at once. We first read the current delta offset from the metadata, and read 16 bytes into a Vector128. We then translate that vector to a vector of 256 bits. That would basically add zeros to the vector. We set the next index to check and then we shuffle the vector on numbers to arrange it on the high parts of the vector as longs.

Let’s say that we had the numbers [1,2,3,4] in the metadata, we read them into the Vector128, like so:

Vector128 = [1,2,3,4]

We then extended that to a Vector256, like so:

Vector256 = [1,2,3,4,0,0,0,0]

Finally, we shuffle the values, (in this case, 7 is known to be zero) so we have:

highBitsDelta = [0,1,0,2,0,3,0,4]

We convert that to a vector of longs (with the same bits pattern) and then or that with the values from the packed deltas.

We have to do the expectedBufferIndex twice on each iteration, but we expect that to be rare, so the branch predictor is going to be really good in predicting this.

Finally, we have the part where we compute the prefix sum, here:

This looks like black magic, but let’s break it apart into individual pieces. 

Let’s assume that we started out with [25,27,30, 35] – using our approach, we have Baseline: 21 and the deltas: [4,2,3,5].

We start with prev being set to the baseline value on all elements in the vector:

prev = [21,21,21,21]

And cur is set to deltas:

cur = [4,2,3,5]

On line 5, we shuffle and BitwiseAnd the value with a mask, let’s see what we get:

Line 5 – shuffled: [4,4,2,3]

LIne 5- masked: [0,4,2,3]

We add that to cur, giving us:

cur = [4 + 0, 4 + 2, 3 + 2, 5 + 3] = [4,6,5,8]

On line 7, we do another shuffle & mask:

Line 7 – shuffled: [4,4,4,6]

Line 7 – masked: [0,0,4,6]

And adding that to cur, will give us:

cur = [4+0, 6+ 0, 5 + 4, 8+ 6] = [4,6,9,14]

We now add the prev vector, giving us:

cur = [4 + 21, 6 + 21, 9 + 21, 14 + 21] = [25, 27, 30, 35]

We computed the sum again, hurray!

We then set all elements of prev to the last value of cur:

prev = [35,35,35,35]

And now we are ready for the next element to compute.

And… this is pretty much it, to be honest. We keep reading until we are out of space in the buffer or consumed all the metadata. We are able to decode the data really fast, and it only took a 10 parts (so far) series of blog posts to explain.

In the next post, I’m going to discuss what we can see from actually using this with real numbers.

time to read 8 min | 1553 words

In the previous post I outlined the requirements we have for FastPFor in RavenDB. Now I want to dig into the actual implementation. Here is the shape of the class in question:

image

The process starts when we start encoding the items. Each call to encode is going to process a different posting list, and it is expected that a single instance will process many posting lists during a single indexing run.

Here is the internal state of this class:

image

Here is how we start the process:

As you can see, there isn’t much here. We initialize the state of exceptions and exceptionsStarts (will discuss this in more detail later in this post) and then we ensure that we have a big enough buffer to encode the data. Note that we retain a point to the original entries that we were passed, that will be important later as well.

Here is the code of the encoding process:

I create a couple of vector variables, one to hold the first item in the list and the other to hold uint.MaxValue. Then I start processing the buffers in blocks of 256 items.

A good resource for using vectors and SIMD in C# can be found here.

The classic FastPFor is using vectors of 128 bits (although there are implementations of up to 512 bits). I chose to use Vector256 for the implementation. Today, pretty much any x64 machine is going to have native support for AVX2 (shipped since 2011!). On ARM, they are implemented using Vector128, so they are still vectorized there as well.

What this means is that we can process 256 numbers as a block. But what are we doing with them? Look at the inner look, where you can see some interesting details. We are using shuffle instructions as well as masking to shift the current value and add the previous one.

In other words, let’s assume that we have the following input array: 1,2,3,4,5,6,7,8 – using longs, so Vector256 will contain 4 elements.

On line 5, we create the prev vector and assign it to have 4 copies of the first item in the array:

prev = [1,1,1,1]

On line 13, we load the 4 elements from the array into cur, giving us:

cur = [1,2,3,4]

On line 14, we shuffle both the prev and the cur values, here are intermediate states:

curShuffled = [0,1,2,3] – note that we have 0 at the start because of the mask

prevShuffled = [1,0,0,0] – the last three items are zero because of the mask.

We then or those two intermediates together, giving us:

mixed = [1,1,2,3]

We subtract that from cur to get the delta. The idea is that we massively reduce the number of operations that we have to perform to compute the delta of the values.

On lines 19..22 we check if any of the deltas is greater than uint.MaxValue, we’ll dig into that in a bit.

The last part of the loop is when we deal with detlaInts, where we are doing some interesting things. We are pulling the low bits from all four elements in the vector to the start of the vector. Then we write the deltas to the output entries buffer.

However, we write 8 values, but we increment the position by 4, so we’ll actually overwrite the last 4 items in the next loop iteration. The entries buffer is sized based on the entries buffer, so we are ensured to have enough space, and the double writes give us better performance.

The end result of the inner loop is that we are performing running delta over the items, converting them from longs to ints. Then we can call the ProcessBlock() method to actually do the code of the FastPFor algorithm:

I already explained how this works in this post.

It is important to understand that we have multiple data channels that we operate on here:

  • The entries output, where we write the packed bits
  • Metadata, where we describe each block
  • Exceptions, where we store the exception data

The encoding process runs through all of them, but isn’t actually going to be writing them anywhere, that happens later.

We process the entries in the array in blocks of 256 numbers each, but what happens when we get to the end? The last block size may not be a multiple of 256, after all. For the last item, we use simple delta encoding plus variable-sized integers. Nothing really fancy is needed at this point, since that is very small.

Finally, we run over all the exceptions data that we created during the encoding process and compute their size. The final result we have is the total size we would need to store all the encoded data as a single buffer.

What about handling numbers whose delta is greater than uint.MaxValue?

You can see that we expect this to be rare, so we handle that to be on the safe side, but aren’t trying to be too optimal about it. We simply take the high portion of the current delta and write it to the metadata. That is wasteful in terms of space, but it is simple and doesn’t require complex code. We’ll see how that is being used during the decoding process in the next post.

Finally, it is now time to actually write things out to the buffer. We have to be careful here, since we need to support partial writes. Let’s see how that works:

We define a header for the encoded data, which looks like this:

image

The header contains the initial value of the encoded data, the exceptions bitmap (discussed here) and the exception and metadata offsets. Because we always limit the size of the buffer to a maximum of 8KB, I used ushort here.

The first value we set in the header is the baseline, here we need to explain a bit. When we encode the data, we start from the very first item in the list, and go forward. The problem with doing things in this manner is that when we need to split the encoded data, we need to use the baseline that is the previous value that we encoded with. For that reason, we use either the first item in the entries array or the item from the end of the previous block that we consumed.

Next, we need to deal with the exceptions, as a reminder, we have a set of exceptions that are shared across blocks. When we do a partial write, we need to keep track of how many exceptions we consumed from each of the exceptions array.

The most interesting part comes when we start iterating over the metadata values, which tells us how to deal with the data:

Here we start with the exceptional cases, of a metadata value that is greater than the max delta or the varint batch marker (which occurs in the end).

You can see that there isn’t much there, except that we verify that we have enough space for the written buffers. For fun, I’m also using a goto here, this is needed to break out of the loop, since I cannot use a break statement in a switch case.

The more interesting aspect happens when we process full blocks of compressed 256 numbers:

This looks like a lot of code, but what it is actually doing is to mostly checking that we have enough space in the buffer for the block we are about to write. Of particular interest to us here is that we may have enough space to write the block, but we also need to ensure that we have space for the metadata and exceptions data that is associated with the current and all previous blocks.

If we have enough space, we continue to write more blocks to the output, or we exit if we are about to run out of room. After the loop, we have the header and compressed integers data in the output buffer, but we still need to write the metadata and exceptions data, this is handled here:

We compute the bitmap of all the exceptions, writing them out as packed segments (basically, 256 blocks of packed bits). Note that we have to account for how many of them we actually write out to the output, since we may have stopped in the middle.

Then we can simply write out the metadata buffer and return the total size we took.

I find it very amusing that we use the same bit packing routines to store the numbers and the exceptions data as well.

Overall, most of the code isn’t particularly interesting, but there are a few highlights that I would like to point out:

  • We compute the deltas and the narrowing from int64 to int32 on the buffer during the Encode()
  • We do the actual bit packing during the Write()

There is some argument to make here that we can do the bit packing during the Encode() as well, but it is unlikely to matter, I believe.

In the next post, I’m going to be diving into how we implement decoding for FastPFor. That is actually of far more interest to me, since that happens a lot more often. It’s going to be interesting…

time to read 6 min | 1155 words

In this series so far, I explored several ways that we can implement integer compression. I focused on the FastPFor algorithm and dove deeply into how it works. In the last post, I showed how we can use the SIMD intrinsics to generate vectorized code that can process a lot of data very quickly.

With this baseline established, we need to understand the use case we have in RavenDB for integer compression. We are actually using this for the classic scenario, posting lists as part of the indexing engine. However, there are quite a few differences in how we want to actually utilize it.

Let’s take Lucene as a good example, since that is one of the classic examples for a search engine. From the ground up, Lucene is meant to be used in batching scenarios. That has a lot of impact on the design. In particular, it means that a posting list for Lucene is something that you write once in a big batch of operations. For RavenDB, however, we explicitly reject his approach. We want to allow to do on-the-fly indexing and we cannot afford to pay the cost of merging many small writes to big files over time.

The FastPFor algorithm is meant to get a lot of data, encode that in a compressed form and then allow you to iterate over the results as fast as possible. But RavenDB’s needs are different. At the most basic level, RavenDB thinks about things in 8KB pages. A posting list for RavenDB is composed of the following:

  • Single – just a single value (common when indexing unique values)
  • Small – can fit in under 4KB
  • Large – requires more than 4KB of data, in which case we split it on 8KB boundaries.

What’s the point here? The idea is that when we need to update a posting list (add or remove items), we have one of those three options:

  • Single – we just update the value, duh!
  • Small – FastPFor is fast, right? It is cheap enough to unpack the compressed values, merge with the new items (or the removals) and compress again.
  • Large – here we are talking about a lot of values, potentially tens of millions or higher. How do we handle updates in this case?

The answer is that we are going to be making use of a B+ Tree, with some caveats. The basic idea is that we have two types of pages in the tree. A branch page is the classic B+ Tree concept, which points to leaf pages below it that contain some range of values. However, the leaf pages, on the other hand, are very different.

Each leaf page is just an 8KB buffer that holds the compressed data, nothing more. When we want to update the values in the posting list, we find the relevant page that needs to be updated, then we just unpack, merge & compress the values. It’s actually cheaper to go this route than any other alternative.

There is another important aspect. Lucene (and most other indexing systems) assume that your posting list is using 32 bits integers. That limits the number of items in an index to 2.1 billion (up to 4.2 billion, for some implementations). I don’t like that limit, not at today’s scale. So we internally use 64bits integers in our posting lists.

That… however, is actually too big. In practice, we use 54 bits for the indexed entries. That gives us a maximum number of 18,014,398,509,481,984 items and at a range of 100,000 items per second, we are good for 5,712 years.  Those 10 bits allow us to encode additional information about the actual entry that is being indexed, and reduce the number of external lookups we need to do. Taking 10 bits from the entry id means that we are going to hit the 32 bits limit at around 2 million entries, which is really low.

Then there is another problem, if I have a posting list I want to compress that has 100,000 items in it, I would need to compress that into multiple pages, but figuring out how to split the compressed stream in a way that makes it usable to work with is not trivial at all.

In short, we have two major concerns with using FastPFor in RavenDB, int64 and paged writes. There is another aspect that I didn’t mention, however. FastPFor will take an array of numbers and compress them. If you want to also use delta compression, that is something that you need to add on top of that.

Now that we understand the constraints we have to operate around, let’s talk about the kind of API that I would like to build:

image

What is going on here?  Well, the basic idea is that you’ll create an encoder as part of the indexing process. For each posting list that you want to process, you’ll call the Encode() method, passing it the data to be encoded.

The Encode() method will return the required size for encoding the entire list as a single operation. Note that it does not actually write the data out (indeed, it has nowhere to write it). The calling code is then able to make decisions based on the returned size. If they have a buffer big enough to hold the encoded data, they can pass it to the Write() method.

Things get more interesting when we don’t have a suitable buffer. Remember that we have a limit of 4KB for small posting lists and 8KB per page for large posting lists. The idea with Write() is that it will write as much as it can to the output, allowing you to use multiple buffers to store the compressed posting list.

In terms of work being done, the Encode() phase copies the post list entries to its own buffer, preparing it to be written out. The process of actually writing the data to the output buffer is mostly concerned with ensuring that we don’t exceed the size of the provided buffer.

This approach means that we only have to pay the encoding code once, which is quite nice. Moreover, the number of checks that we have to make to ensure that we can fit the data into the buffer is also greatly reduced.

What about decoding? That is handled using:

image

Here we have a very interesting observation, the Encoder is a class, but the Decoder is a struct. We assume that decoding is far more frequent, so we use a struct to avoid an unnecessary memory allocation.

In order to ensure high performance, the decoder makes assumptions about the provided buffers. We will always accept a buffer that has a minimum size, for example.

With all of this out of the way, I’m going to dedicate the next post to discussing the actual implementation and semantics we have for FastPFor encoding.

time to read 4 min | 715 words

In the code of the simdcomp library there is a 25KLOC file that does evil things to SIMD registers to get bit packing to work. When I looked at the code the first few dozen times, I had a strong desire to run away screaming. Luckily, this isn’t just some pile of complicated code, but well-thought-out set of functions that are meant to provide optimal code for specific tasks. In particular, the code is specialized for each bit width that you may want to bit pack (0 .. 32). Even better, no one actually sat down to write it out by hand, there is a Python script that would generate the code.

The first step was to understand what exactly is going on in the code, and then see how we can translate that to C#. Even just a few years ago, that would have been either an impossible dream or required the use of a native library (with the associated PInvoke overhead). However, .NET today has a very robust intrinsics support, and vectors / SIMD instructions are natively supported.

I actually had a tougher challenge, since I want this code to run on x64 and ARM64 instances. The original code is x64 only, of course. The nice thing about SIMD support for .NET is that most of that can be done in a cross platform manner, with the JIT deciding what instructions will actually be emitted. There is still a very strong correlation between the vectorized code and the instructions that are being emitted, which means that I get both great control of what is going on and the appropriate abstraction. I was actually able to implement the whole thing without dropping to architecture-specific code, which makes me very happy.

Before we get any deeper, let’s take a simple challenge. We want to take an array of 32 integers and pack each one of them in 4 bits into an array of 16 bytes. Here is what the code will look like:

This is a bit dense, but let’s see if I can explain what is going on here.

We load from the array a vector (4 items) at the 0, 4, 8, 12, 16, 20, 24, and 28 intervals. For each one of those, we shift the values by the required offset and or all of them together. Note that this means that the first item’s four bits go in the first nibble, but the second item’s bits go in the fifth nibble, etc.

The idea is that we are operating on 4 items at a time, reducing the total number of operations we have to perform. It may be easier to understand if you see those changes visually:

image

What is happening here, however, is that we are able to do this transformation in very compact code. That isn’t just an issue of high-level code, let’s take a look at the assembly instructions that this generates:

I’m going to assume that you aren’t well versed with assembly, so let’s explain what is going on. This code contains zero branches, it does four reads from memory, mask the elements, shift them and or them together.

The relevant instructions are:

  • vmovupd – read 4 integers to the register
  • vpand – binary and with a value (masking)
  • vpslld – shift to the left
  • vpor – binary or
  • vmovdqu – write 16 bytes to memory

There are no branches, nothing to complicate the code at all. This is about as tight as you can get, at the level of machine instructions.

Of particular interest here is the C# code. The entire thing is basically a couple of lines of code, and I could express the whole thing as a single expression in a readable manner. Let’s look at the C code, to compare what is going on:

Note that both should generate the same machine code, but being able to rely on operator overloading means that I can now get a far more readable code.

From that point, the remaining task was to re-write the Python script so it would generate C# code, not C. In the next post I’m going to be talking more about the constraints we have and what we are actually trying to solve with this approach.

time to read 3 min | 563 words

As I mentioned, I spent quite a lot of time trying to understand the mechanism behind how the FastPFor algorithm works. A large part of that was the fact that I didn’t initially distinguish the separate steps in the process. Finding the bits' widths, packing the bits, metadata and exception management are all integral to how this works, and I tried to grok it all in one shot. I kept getting lost in 14KLOC files with dense SIMD instructions.

When looking at the code, it is also important to keep in mind that this is not a library that you are meant to pick up & use. This is a research project. As such, there was enough work done to show the results, but all the spit & polish that are associated with making a library ready for general consumption weren’t done.

A simple example of that is that the code just assumes that the output buffer provided will be large enough (with no way to estimate upfront) and will overflow the buffer if that happens. Perfectly acceptable for research code, hell no for production usage, naturally.

In the same manner, the repository contains a lot of code. Most of that is being used as part of comparing the FastPFor algorithm to various other options. That can make it hard to understand what are the core parts of the algorithm and what is effectively chuff.

To be honest, the hardest part for me, however, was figuring out the memory allocation patterns that are going on in here. There is a high usage of C++ containers, with implicit allocations and hidden memory management. Here is a piece of code that I had to struggle to understand for quite some time, for example:

This does bit-packing using SIMD without a mask (the naming convention is something that you really have to get used to, I don’t think I encountered justlowercase before).

Note the call to fastpackwithoutmask() in there, it assumes that the provided buffer has exactly 32 elements. A lot of the code in the project assumes that you are passed a buffer of a certain size and operate on that directly. That produces great results in terms of efficiency and code density. But when I read this code it was “obvious” to me that there is an out-of-band work here. If the provided buffer isn’t perfectly aligned on 32 boundary, that method will write read or write beyond the end of the buffer.

Look at line 9 there, where the source container is being resized to ensure that this is fine, since we increase the container size to be exactly 32 elements aligned. We revert this operation later in line 20.

For some SIMD instructions, it matters the alignment of the buffers we are working on. This is handled using a custom allocator on the containers, which is not usually something that I would pay attention to.

In short, from my perspective, there is a lot of hidden behavior in non-trivial C++ code usage there that masks the actual behavior of the algorithm in question.

If you are looking at FastPFor (and if you care enough to look at this series of posts, you probably should), take into consideration that this code shouldn’t be applied as is, but should probably be adapted for your needs.

In the next post, I’m going to start discussing how we did just that for RavenDB.

time to read 9 min | 1766 words

The FastPFor is an integer compression algorithm that was published in 2012 initially. You can read the paper about it here: Decoding billions of integers per second through vectorization.

I’ve run into this algorithm many times in the past. You pretty much can’t be in the database arena and not run into that. It is an interesting paper, and it has a GitHub repository with the code, which is great. Except that I couldn’t figure out what was going on there.

I actually tried stepping through the code a bunch of times, and I always ended up getting lost. The code is in C++ and makes heavy use of templates, containers and non-trivial amounts of magic. To give some context, I gave up when I run into these code snippers:

The paper itself describes the algorithm, but not in a way that actually made sense to me. I tried looking at the code, and I basically noped out. Too complex for me to understand, it seems.

But I kept running into this, and we recently had a strong need for something similar. So I basically took the time to meditate on this for a while.

On a personal level, I realized that I was trying to understand how the whole thing worked by just stepping through the code and reading the paper. The problem was that I was mixing several different layers and wasn’t able to create the proper separation between them.

FastPFor builds upon the previously discussed simdcomp, once you understand that, you can go through this in isolation. For this reason, I first talked about SIMD bit-packing and showed an example usage, before actually diving into how FastPFor itself works.

As a reminder, simdcomp provides routines for quick packing and unpacking of integers, nothing more. FastPFor is where the actual compression happens. Let’s see what is actually going on here. Consider the following list, which we previously discussed:

1871143144

4

4

4

4

4

4

4

4

4

4

7984

4

4

4

4

If we’ll try bit-pack this list, we’ll find out that we aren’t gaining much. The first value is 31 bits in length, after all, and that means that we aren’t able to save much. We talked about Frame of Reference (FOR) as a way to handle that. We can treat every128 block of numbers as a block that has its own reference point and compute the maximum number of bits for that location. That would actually save us a lot of space, but it isn’t ideal. In the case above, most entries in the list can be packed in just 3 bits, except 2. That is where PFor comes into play, which stands for Patched Frame of Reference.

The key aspects of FastPFor are how it figures out what is the best size to use to pack the numbers, the way it detects what should be patched, and the manner in which it stores this.

The code boils down to a single function:

What is going on here? This function accepts an array of integers (whose size is 128) and computes the best number of bits to use to pack it.

How does that work? The first thing this does is compute how many numbers we have for each bit range. This is done using the asmbits() call, which is basically a LeadingZeroCount(). The end result is a frequency map that we can use to estimate how many bits will be required to store the items from this block.

We start from the maximum number of bits that we need to store all the numbers in the block, and we go down, checking what would be the best cost (in terms of space). Since we need to store exceptions somewhere, we compute the cost of that as we go down in the list.

This function gives us:

  • bestb – the number of bits needed to pack all the items in the block that would result in the smallest output size
  • bestcexception – the count of exceptions (numbers that do not fit into bestb bits
  • maxb – the maximum number of bits that we need to store for this block

That is the key function, the one that is responsible for getting the best compression ratio.

What I found that made FastPFor challenging to understand is that it has a lot of state that you need to keep track of. What I described so far is a single block, but FastPFor operates over sets of numbers, and so needs to tie all of them together.

At any given point, FastPFor has:

  • The block that is being outputted
  • The metadata about the entire process
  • 32 arrays of exceptions

The process interleaves all of them together in interesting ways, and I had a hard time figuring out how it all ties together.

Let’s talk about the way FastPFor processes a single block. We process the data in the array in blocks of 128 numbers at a time, like so:

The first thing that happens is the computation of the best bits widths for the current block. Then, we use the bc value to record the metadata about the block.

That is an array of bytes with the following format:

  1. Packed number of bits
  2. Number of exceptions

If there are exceptions for this block, we also add:

  1. Maximum number of bits
  2. Offset of exception in the block (repeated as the number of exceptions)

The metadata is shared for the entire encoding operation, across blocks.

You can see in the code that if bestcexcept (which holds the number of exceptions) is greater than zero, we find the appropriate exceptions buffer to use. That requires a better explanation.

the getBestBFromData() call gave us two important data points, the best number of bits to pack the numbers, and the maximum number. We are going to pack all the numbers in the block to the output, but what about the exceptions? Where do they go?

It turns out that we need to store the exceptions, but we don’t actually need to store max bits, only the difference between the best number of bits and the maximum. This is what thisexceptioncontainer is holding, the remainding bits for exceptions. It’s important to understand that this is done across blocks. So the thisexceptioncontainer value holds exceptions from many different blocks. That will turn out to be very important in just a little bit. We then scan the block and write the remainding bits to the container, and write to the metadata the offset of the value in the block. Since we are using blocks of 128 numbers, this is ensured to fit inside a byte.

The last step that we do for the block is to call: packblockupsimd(), this ends up calling to simdpack() and packs all the numbers from the block to bestb bits in the output.

It’s really important to understand that we still have two data items that haven’t been written. The first is the metadata for the encoding process (bits in blocks, exceptions offsets, etc). The second is the set of exceptions themselves.

This process repeats itself for each block, until the end of the buffer. It is at this point that we need to write the remaining data to the output. Here is what the code looks like:

What is going on? This is dense, and it takes a while to unpack (pun intended) what is going on here.

First, we write the size of the metadata and then we copy the metadata that we have to the output. That is the description of all the blocks that we just went over. Then, we run over the set of exceptions that we gathered. Remember, the datatobepacked is an array that holds lists of exception data for each bit range that we have to store. We iterate over all the bit widths where we wrote exception data and generate a bitmap. That will help us later during the decoding process.

Finally, we run over the same exceptions and write them out. Note that we do that using the same simd bit-packing approach as the rest of the code.

The end result is that we have the following format:

image

For decoding, we start from the metadata offset and jump to that. Then we read the exceptions bitmap. That tells us how many exceptions we have and how many bits we have in each one of them. In the image above, you can see that we have 3 exceptions list, for 4 bits, 12 bits and 19 bits.

We use the simd bit-packing to decode those values into memory. We’ll need them shortly.  Now, we start iterating over the metadata, each block has an overhead of minimum two metadata bytes (number of bits and number of exceptions). Let’s assume that we don’t have any exceptions in the block. In that case, the process of decoding is simple. Just do the unpacking from bits to numbers and you are done.

If we have exceptions, on the other hand, we have to deal with them. At that point, the next metadata byte would contain the maximum number of bits for the exceptions in this block. Using that and the normal number of bits, we can tell where the extra bits are located. Here is the relevant code:

Note that there is a special case here if the difference in the number of bits between the block bits and the maximum number of bits is one. In that case, we don’t need to store the data, we already know what the value is then (it’s one, after all). So we can compute it once and set it in the output.

For scenarios where we can’t tell from the bit width along, we read the relevant exception array based on the difference between the bits in the block and the exceptions bits. That gives us an array that is shared across blocks. The idea is that the metadata contains the offset in the block, and we read from the relevant array one item at a time. This two-step process will result in us setting the right value and getting ready for the next call, which may be done in a different block.

Note that the exception bits buffers only care about the number of bits, not where they come from. So two blocks, which have (bestb: 4, maxb: 9) and (bestb: 7: maxb: 10) will both go to the 3 bits container.

Okay, this blog post is getting long enough, so I think that would be it. Hopefully, it will make it easier to understand exactly how the FastPFor format is working. I’m going to be talking more about FastPFor and the format implications in the next post, then move on to actually using that in C#.

FUTURE POSTS

  1. Partial writes, IO_Uring and safety - about one day from now
  2. Configuration values & Escape hatches - 5 days from now
  3. What happens when a sparse file allocation fails? - 7 days from now
  4. NTFS has an emergency stash of disk space - 9 days from now
  5. Challenge: Giving file system developer ulcer - 12 days from now

And 4 more posts are pending...

There are posts all the way to Feb 17, 2025

RECENT SERIES

  1. Challenge (77):
    20 Jan 2025 - What does this code do?
  2. Answer (13):
    22 Jan 2025 - What does this code do?
  3. Production post-mortem (2):
    17 Jan 2025 - Inspecting ourselves to death
  4. Performance discovery (2):
    10 Jan 2025 - IOPS vs. IOPS
View all series

Syndication

Main feed Feed Stats
Comments feed   Comments Feed Stats
}