Ayende @ Rahien

Challenge: Giving file system developer ulcer

Mon, 03 Feb 2025 12:00:00 GMT

I’m trying to reason about the behavior of this code, and I can’t decide if this is a stroke of genius or if I’m suffering from a stroke. Take a look at the code, and then I’ll discuss what I’m trying to do below:

HANDLE hFile = CreateFileA("R:/original_file.bin", 
GENERIC_READ | GENERIC_WRITE, 
FILE_SHARE_READ | FILE_SHARE_WRITE, 
NULL, 
OPEN_ALWAYS, FILE_ATTRIBUTE_NORMAL, 
NULL);
if (hFile == INVALID_HANDLE_VALUE) {
    printf("Error creating file: %d\n", GetLastError());
    exit(__LINE__);
}
HANDLE hMapFile = CreateFileMapping(hFile, NULL,

PAGE_READWRITE, 0, 0, NULL);

if (hMapFile == NULL) {

fprintf(stderr, "Could not create file mapping object: %x\n", GetLastError());

exit(LINE);

}
char* lpMapAddress = MapViewOfFile(hMapFile, FILE_MAP_WRITE, 0, 0, 0);

if (lpMapAddress == NULL) {

fprintf(stderr, "Could not map view of file: %x\n", GetLastError());

exit(LINE);

}
for (size_t i = 2 * MB; i < 4 * MB; i++)

{

lpMapAddress[i]++;

}
HANDLE hDirect = CreateFileA("R:/original_file.bin",

GENERIC_READ | GENERIC_WRITE,

FILE_SHARE_READ | FILE_SHARE_WRITE,

NULL,

OPEN_ALWAYS,

FILE_ATTRIBUTE_NORMAL,

NULL);
SetFilePointerEx(hDirect, (LARGE_INTEGER) { 6 * MB }, & fileSize, FILE_BEGIN);

for (i = 6 ; i < 10 ; i++) {

if (!WriteFile(hDirect, lpMapAddress + i * MB, MB, &bytesWritten, NULL)) {

fprintf(stderr, "WriteFile direct failed on iteration %d: %x\n", i, GetLastError());

exit(LINE);

}

}

The idea is pretty simple, I’m opening the same file twice. Once in buffered mode and mapping that memory for both reads & writes. The problem is that to flush the data to disk, I have to either wait for the OS, or call FlushViewOfFile() and FlushFileBuffers() to actually flush it to disk explicitly.

The problem with this approach is that FlushFileBuffers() has undesirable side effects. So I’m opening the file again, this time for unbuffered I/O. I’m writing to the memory map and then using the same mapping to write to the file itself. On Windows, that goes through a separate path (and may lose coherence with the memory map).

The idea here is that since I’m writing from the same location, I can’t lose coherence. I either get the value from the file or from the memory map, and they are both the same. At least, that is what I hope will happen.

For the purpose of discussion, I can ensure that there is no one else writing to this file while I’m abusing the system in this manner. What do you think Windows will do in this case?

I believe that when I’m writing using unbuffered I/O in this manner, I’m forcing the OS to drop the mapping and refresh from the disk. That is likely the reason why it may lose coherence, because there may be already reads that aren’t served from main memory, or something like that.

This isn’t an approach that I would actually take for production usage, but it is a damn interesting thing to speculate on. If you have any idea what will actually happen, I would love to have your input.

Challenge: Efficient snapshotable state

Mon, 01 Jul 2024 12:00:00 GMT

At the heart of RavenDB, there is a data structure that we call the Page Translation Table. It is one of the most important pieces inside RavenDB.

The page translation table is basically a Dictionary<long, Page>, mapping between a page number and the actual page. The critical aspect of this data structure is that it is both concurrent and multi-version. That is, at a single point, there may be multiple versions of the table, representing different versions of the table at given points in time.

The way it works, a transaction in RavenDB generates a page translation table as part of its execution and publishes the table on commit. However, each subsequent table builds upon the previous one, so things become more complex. Here is a usage example (in Python pseudo-code):

table = {}


with wtx1 = write_tx(table):
  wtx1.put(2, 'v1')
  wtx1.put(3, 'v1')
  wtx1.publish(table)


# table has (2 => v1, 3 => v1)


with wtx2 = write_tx(table):
  wtx2.put(2, 'v2')
  wtx2.put(4, 'v2')
  wtx2.publish(table)


# table has (2 => v2, 3 => v1, 4 => v2)

This is pretty easy to follow, I think. The table is a simple hash table at this point in time.

The catch is when we mix read transactions as well, like so:

# table has (2 => v2, 3 => v1, 4 => v2)


with rtx1 = read_tx(table):


        with wtx3 = write_tx(table):
                wtx3.put(2, 'v3')
                wtx3.put(3, 'v3')
                wtx3.put(5, 'v3')


                with rtx2 = read_tx(table):
                        rtx2.read(2) # => gives, v2
                        rtx2.read(3) # => gives, v1
                        rtx2.read(5) # => gives, None


                wtx3.publish(table)


# table has (2 => v3, 3 => v3, 4 => v2, 5 => v3)
# but rtx2 still observe the value as they were when
# rtx2 was created


        rtx2.read(2) # => gives, v2
        rtx2.read(3) # => gives, v1
        rtx2.read(5) # => gives, None

In other words, until we publish a transaction, its changes don’t take effect. And any read translation that was already started isn’t impacted. We also need this to be concurrent, so we can use the table in multiple threads (a single write transaction at a time, but potentially many read transactions). Each transaction may modify hundreds or thousands of pages, and we’ll only clear the table of old values once in a while (so it isn’t infinite growth, but may certainly reach respectable numbers of items).

The implementation we have inside of RavenDB for this is complex! I tried drawing that on the whiteboard to explain what was going on, and I needed both the third and fourth dimensions to illustrate the concept.

Given these requirements, how would you implement this sort of data structure?

Challenge: Spot the bug

Tue, 19 Sep 2023 12:00:00 GMT

The following bug cost me a bunch of time, can you see what I’m doing wrong?

For fun, it’s so nasty because usually, it will accidentally work.

Filtering negative numbers, fast: Beating memcpy()

Fri, 15 Sep 2023 12:00:00 GMT

In the previous post, I was able to utilize AVX to get some nice speedups. In general, I was able to save up to 57%(!) of the runtime in processing arrays of 1M items. That is really amazing, if you think about it. But my best effort only gave me a 4% improvement when using 32M items.

I decided to investigate what is going on in more depth, and I came up with the following benchmark. Given that I want to filter negative numbers, what would happen if the only negative number in the array was the first one?

In other words, let’s see what happens when we could write this algorithm as the following line of code:

array[1..].CopyTo(array);

The idea here is that we should measure the speed of raw memory copy and see how that compares to our code.

Before we dive into the results, I want to make a few things explicit. We are dealing here with arrays of long, when I’m talking about an array with 1M items, I’m actually talking about an 8MB buffer, and for the 32M items, we are talking about 256MB of memory.

I’m running these benchmarks on the following machine:

    AMD Ryzen 9 5950X 16-Core Processor

    Base speed:    3.40 GHz
     L1 cache:    1.0 MB
     L2 cache:    8.0 MB
     L3 cache:    64.0 MB

    Utilization    9%
     Speed    4.59 GHz

In other words, when we look at this, the 1M items (8MB) can fit into L2 (barely, but certainly backed by the L3 cache. For the 32M items (256MB), we are far beyond what can fit in the cache, so we are probably dealing with memory bandwidth issues.

I wrote the following functions to test it out:

Let’s look at what I’m actually testing here.

CopyTo() – using the span native copying is the most ergonomic way to do things, I think.
MemoryCopy() – uses a built-in unsafe API in the framework. That eventually boils down to a heavily optimized routine, which… calls to Memove() if the buffer overlaps (as they do in this case).
MoveMemory() – uses a pinvoke to call to the Windows API to do the moving of memory for us.

Here are the results for the 1M case (8MB):

Method	N	Mean	Error	StdDev	Ratio
FilterCmp	1048599	441.4 us	1.78 us	1.58 us	1.00
FilterCmp_Avx	1048599	141.1 us	2.70 us	2.65 us	0.32
CopyTo	1048599	872.8 us	11.27 us	10.54 us	1.98
MemoryCopy	1048599	869.7 us	7.29 us	6.46 us	1.97
MoveMemory	1048599	126.9 us	0.28 us	0.25 us	0.29

We can see some real surprises here. I’m using the FilterCmp (the very basic implementation) that I wrote.

I cannot explain why CopyTo() and MemoryMove() are so slow.

What is really impressive is that the FilterCmp_Avx() and MoveMemory() are so close in performance, and so much faster. To put it in another way, we are already at a stage where we are within shouting distance from the MoveMemory() performance. That is.. really impressive.

That said, what happens with 32M (256MB) ?

Method	N	Mean	Error	StdDev	Ratio
FilterCmp	33554455	22,763.6 us	157.23 us	147.07 us	1.00
FilterCmp_Avx	33554455	20,122.3 us	214.10 us	200.27 us	0.88
CopyTo	33554455	27,660.1 us	91.41 us	76.33 us	1.22
MemoryCopy	33554455	27,618.4 us	136.16 us	127.36 us	1.21
MoveMemory	33554455	20,152.0 us	166.66 us	155.89 us	0.89

Now we are faster in the FilterCmp_Avx than MoveMemory. That is… a pretty big wow, and a really nice close for this blog post series, right? Except that we won’t be stopping here.

The way the task I set out works, we are actually filtering just the first item out, and then we are basically copying the memory. Let’s do some math: 256MB in 20.1ms means 12.4 GB/sec!

On this system, I have the following memory setup:

    64.0 GB

    Speed:    2133 MHz
     Slots used:    4 of 4
     Form factor:    DIMM
     Hardware reserved:    55.2 MB

I’m using DDR4 memory, so I can expect a maximum speed of 17GB/sec. In theory, I might be able to get more if the memory is located on different DIMMs, but for the size in question, that is not likely.

I’m going to skip the training montage of VTune, understanding memory architecture and figuring out what is actually going on.

Let’s drop everything and look at what we have with just AVX vs. MoveMemory:

Method	N	Mean	Error	StdDev	Median	Ratio
FilterCmp_Avx	1048599	141.6 us	2.28 us	2.02 us	141.8 us	1.12
MoveMemory	1048599	126.8 us	0.25 us	0.19 us	126.8 us	1.00

FilterCmp_Avx	33554455	21,105.5 us	408.65 us	963.25 us	20,770.4 us	1.08
MoveMemory	33554455	20,142.5 us	245.08 us	204.66 us	20,108.2 us	1.00

The new baseline is MoveMemory, and in this run, we can see that the AVX code is slightly slower.

It’s sadly not uncommon to see numbers shift by those ranges when we are testing such micro-optimizations, mostly because we are subject to so many variables that can affect performance. In this case, I dropped all the other benchmarks, which may have changed things.

At any rate, using those numbers, we have 12.4GB/sec for MoveMemory() and 11.8GB/sec for the AVX version. The hardware maximum speed is 17GB/sec. So we are quite close to what can be done.

For that matter, I would like to point out that the trivial code completed the task in 11GB/sec, so that is quite respectable and shows that the issue here is literally getting the memory fast enough to the CPU.

Can we do something about that? I made a pretty small change to the AVX version, like so:

What are we actually doing here? Instead of loading the value and immediately using it, we are loading the next value, then we are executing the loop and when we iterate again, we will start loading the next value and process the current one. The idea is to parallelize load and compute at the instruction level.

Sadly, that didn’t seem to do the trick. I saw a 19% additional cost for that version compared to the vanilla AVX one on the 8MB run and a 2% additional cost on the 256MB run.

I then realized that there was one really important test that I had to also make, and wrote the following:

In other words, let's test the speed of moving memory and filling memory as fast as we possibly can. Here are the results:

Method	N	Mean	Error	StdDev	Ratio	RatioSD	Code Size
MoveMemory	1048599	126.8 us	0.36 us	0.33 us	0.25	0.00	270 B
FillMemory	1048599	513.5 us	10.05 us	10.32 us	1.00	0.00	351 B

MoveMemory	33554455	20,022.5 us	395.35 us	500.00 us	1.26	0.02	270 B
FillMemory	33554455	15,822.4 us	19.85 us	17.60 us	1.00	0.00	351 B

This is really interesting, for a small buffer (8MB), MoveMemory is somehow faster. I don’t have a way to explain that, but it has been a pretty consistent result in my tests.

For the large buffer (256MB), we are seeing results that make more sense to me.

MoveMemory – 12.5 GB / sec
FIllMemory – 15.8 GB / sec

In other words, for MoveMemory, we are both reading and writing, so we are paying for memory bandwidth in both directions. For filling the memory, we are only writing, so we can get better performance (no need for reads).

In other words, we are talking about hitting the real physical limits of what the hardware can do. There are all sorts of tricks that one can pull, but when we are this close to the limit, they are almost always context-sensitive and dependent on many factors.

To conclude, here are my final results:

Method	N	Mean	Error	StdDev	Ratio	RatioSD	Code Size
FilterCmp_Avx	1048599	307.9 us	6.15 us	12.84 us	0.99	0.05	270 B
FilterCmp_Avx_Next	1048599	308.4 us	6.07 us	9.26 us	0.99	0.03	270 B
CopyTo	1048599	1,043.7 us	15.96 us	14.93 us	3.37	0.11	452 B
ArrayCopy	1048599	1,046.7 us	15.92 us	14.89 us	3.38	0.14	266 B
UnsafeCopy	1048599	309.5 us	6.15 us	8.83 us	1.00	0.04	133 B
MoveMemory	1048599	310.8 us	6.17 us	9.43 us	1.00	0.00	270 B

FilterCmp_Avx	33554455	24,013.1 us	451.09 us	443.03 us	0.98	0.02	270 B
FilterCmp_Avx_Next	33554455	24,437.8 us	179.88 us	168.26 us	1.00	0.01	270 B
CopyTo	33554455	32,931.6 us	416.57 us	389.66 us	1.35	0.02	452 B
ArrayCopy	33554455	32,538.0 us	463.00 us	433.09 us	1.33	0.02	266 B
UnsafeCopy	33554455	24,386.9 us	209.98 us	196.42 us	1.00	0.01	133 B
MoveMemory	33554455	24,427.8 us	293.75 us	274.78 us	1.00	0.00	270 B

As you can see, just the AVX version is comparable or (slightly) beating the MoveMemory function.

I tried things like prefetching memory, both the next item, the next cache item and from the next page, using non-temporal load and stores and many other things, but this is a pretty tough challenge.

What is really interesting is that the original, very simple and obvious implementation, clocked at 11 GB/sec. After pulling pretty much all the stops and tricks, I was able to hit 12.5 GB / sec.

I don’t think anyone can look / update / understand the resulting code in any way without going through deep meditation. That is not a bad result at all. And along the way, I learned quite a bit about how the lowest level of the machine architecture is working.

Filtering negative numbers, fast: AVX

Wed, 13 Sep 2023 12:00:00 GMT

In the previous post I discussed how we can optimize the filtering of negative numbers by unrolling the loop, looked into branchless code and in general was able to improve performance by up to 15% from the initial version we started with. We pushed as much as we could on what can be done using scalar code. Now it is the time to open a whole new world and see what we can do when we implement this challenge using vector instructions.

The key problem with such tasks is that SIMD, AVX and their friends were designed by… an interesting process using a perspective that makes sense if you can see in a couple of additional dimensions. I assume that at least some of that is implementation constraints, but the key issue is that when you start using SIMD, you realize that you don’t have general-purpose instructions. Instead, you have a lot of dedicated instructions that are doing one thing, hopefully well, and it is your role to compose them into something that would make sense. Oftentimes, you need to turn the solution on its head in order to successfully solve it using SIMD. The benefit, of course, is that you can get quite an amazing boost in speed when you do this.

The algorithm we use is basically to scan the list of entries and copy to the start of the list only those items that are positive. How can we do that using SIMD? The whole point here is that we want to be able to operate on multiple data, but this particular task isn’t trivial. I’m going to show the code first, then discuss what it does in detail:

We start with the usual check (if you’ll recall, that ensures that the JIT knows to elide some range checks, then we load the PremuteTable. For now, just assume that this is magic (and it is). The first interesting thing happens when we start iterating over the loop. Unlike before, now we do that in chunks of 4 int64 elements at a time. Inside the loop, we start by loading a vector of int64 and then we do the first odd thing. We call ExtractMostSignificantBits(), since the sign bit is used to mark whether a number if negative or not. That means that I can use a single instruction to get an integer with the bits set for all the negative numbers. That is particularly juicy for what we need, since there is no need for comparisons, etc.

If the mask we got is all zeroes, it means that all the numbers we loaded to the vector are positives, so we can write them as-is to the output and move to the next part. Things get interesting when that isn’t the case.

We load a permute value using some shenanigans (we’ll touch on that shortly) and call the PermuteVar8x32() method. The idea here is that we pack all the non-negative numbers to the start of the vector, then we write the vector to the output. The key here is that when we do that, we increment the output index only by the number of valid values. The rest of this method just handles the remainder that does not fit into a vector.

The hard part in this implementation was to figure out how to handle the scenario where we loaded some negative numbers. We need a way to filter them, after all. But there is no SIMD instruction that allows us to do so. Luckily, we have the Avx2.PermuteVar8x32() method to help here. To confuse things, we don’t actually want to deal with 8x32 values. We want to deal with 4x64 values. There is Avx2.Permute4x64() method, and it will work quite nicely, with a single caveat. This method assumes that you are going to pass it a constant value. We don’t have such a constant, we need to be able to provide that based on whatever the masked bits will give us.

So how do we deal with this issue of filtering with SIMD? We need to move all the values we care about to the front of the vector. We have the method to do that, PermuteVar8x32() method, and we just need to figure out how to actually make use of this. PermuteVar8x32() accepts an input vector as well as a vector of the premutation you want to make. In this case, we are basing this on the 4 top bits of the 4 elements vector of int64. As such, there are a total of 16 options available to us. We have to deal with 32bits values rather than 64bits, but that isn’t that much of a problem.

Here is the premutation table that we’ll be using:

What you can see here is that when we have a 1 in the bits (shown in comments) we’ll not copy that to the vector. Let’s take a look at the entry of 0101, which may be caused by the following values [1,-2,3,-4].

When we look at the right entry at index #5 in the table: 2,3,6,7,0,0,0,0

What does this mean? It means that we want to put the 2nd int64 element in the source vector and move it as the first element of the destination vector, take the 3rd element from the source as the second element in the destination and discard the rest (marked as 0,0,0,0 in the table).

This is a bit hard to follow because we have to compose the value out of the individual 32 bits words, but it works quite well. Or, at least, it would work, but not as efficiently. This is because we would need to load the PermuteTableInts into a variable and access it, but there are better ways to deal with it. We can also ask the JIT to embed the value directly. The problem is that the pattern that the JIT recognizes is limited to ReadOnlySpan<byte>, which means that the already non-trivial int32 table got turned into this:

This is the exact same data as before, but using ReadOnlySpan<byte> means that the JIT can package that inside the data section and treat it as a constant value.

The code was heavily optimized, to the point where I noticed a JIT bug where these two versions of the code give different assembly output:

Here is what we get out:

This looks like an unintended consequence of Roslyn and the JIT each doing their (separate jobs), but not reaching the end goal. Constant folding looks like it is done mostly by Roslyn, but it does a scan like that from the left, so it wouldn’t convert $A * 4 * 8 to $A * 32. That is because it stopped evaluating the constants when it found a variable. When we add parenthesis, we isolate the value and now understand that we can fold it.

Speaking of assembly, here is the annotated assembly version of the code:

And after all of this work, where are we standing?

Method	N	Mean	Error	StdDev	Ratio	RatioSD	Code Size
FilterCmp	23	285.7 ns	3.84 ns	3.59 ns	1.00	0.00	411 B
FilterCmp_NoRangeCheck	23	272.6 ns	3.98 ns	3.53 ns	0.95	0.01	397 B
FilterCmp_Unroll_8	23	261.4 ns	1.27 ns	1.18 ns	0.91	0.01	672 B
FilterCmp_Avx	23	261.6 ns	1.37 ns	1.28 ns	0.92	0.01	521 B

FilterCmp	1047	758.7 ns	1.51 ns	1.42 ns	1.00	0.00	411 B
FilterCmp_NoRangeCheck	1047	756.8 ns	1.83 ns	1.53 ns	1.00	0.00	397 B
FilterCmp_Unroll_8	1047	640.4 ns	1.94 ns	1.82 ns	0.84	0.00	672 B
FilterCmp_Avx	1047	426.0 ns	1.62 ns	1.52 ns	0.56	0.00	521 B

FilterCmp	1048599	502,681.4 ns	3,732.37 ns	3,491.26 ns	1.00	0.00	411 B
FilterCmp_NoRangeCheck	1048599	499,472.7 ns	6,082.44 ns	5,689.52 ns	0.99	0.01	397 B
FilterCmp_Unroll_8	1048599	425,800.3 ns	352.45 ns	312.44 ns	0.85	0.01	672 B
FilterCmp_Avx	1048599	218,075.1 ns	212.40 ns	188.29 ns	0.43	0.00	521 B

FilterCmp	33554455	29,820,978.8 ns	73,461.68 ns	61,343.83 ns	1.00	0.00	411 B
FilterCmp_NoRangeCheck	33554455	29,471,229.2 ns	73,805.56 ns	69,037.77 ns	0.99	0.00	397 B
FilterCmp_Unroll_8	33554455	29,234,413.8 ns	67,597.45 ns	63,230.70 ns	0.98	0.00	672 B
FilterCmp_Avx	33554455	28,498,115.4 ns	71,661.94 ns	67,032.62 ns	0.96	0.00	521 B

So it seems that the idea of using SIMD instruction has a lot of merit. Moving from the original code to the final version, we see that we can complete the same task in up to half the time.

I’m not quite sure why we aren’t seeing the same sort of performance on the 32M, but I suspect that this is likely because we far exceed the CPU cache and we have to fetch it all from memory, so that is as fast as it can go.

If you are interested in learning more, Lemire solves the same problem in SVE (SIMD for ARM) and Paul has a similar approach in Rust.

If you can think of further optimizations, I would love to hear your ideas.

Filtering negative numbers, fast: Unroll

Tue, 12 Sep 2023 12:00:00 GMT

In the previous post, we looked into what it would take to reduce the cost of filtering negative numbers. We got into the assembly and analyzed exactly what was going on. In terms of this directly, I don’t think that even hand-optimized assembly would take us further. Let’s see if there are other options that are available for us to get better speed.

The first thing that pops to mind here is to do a loop unrolling. After all, we have a very tight loop, if we can do more work per loop iteration, we might get better performance, no? Here is my first version:

And here are the benchmark results:

Method	N	Mean	Error	StdDev	Ratio	Code Size
FilterCmp	23	274.6 ns	0.40 ns	0.35 ns	1.00	411 B
FilterCmp_Unroll	23	257.5 ns	0.94 ns	0.83 ns	0.94	606 B

FilterCmp	1047	748.1 ns	2.91 ns	2.58 ns	1.00	411 B
FilterCmp_Unroll	1047	702.5 ns	5.23 ns	4.89 ns	0.94	606 B

FilterCmp	1048599	501,545.2 ns	4,985.42 ns	4,419.45 ns	1.00	411 B
FilterCmp_Unroll	1048599	446,311.1 ns	3,131.42 ns	2,929.14 ns	0.89	606 B

FilterCmp	33554455	29,637,052.2 ns	184,796.17 ns	163,817.00 ns	1.00	411 B
FilterCmp_Unroll	33554455	29,275,060.6 ns	145,756.53 ns	121,713.31 ns	0.99	606 B

That is quite a jump, 6% – 11% savings is no joke. Let’s look at what is actually going on at the assembly level and see if we can optimize this further.

As expected, the code size is bigger, 264 bytes versus the 55 we previously got. But more importantly, we got the range check back, and a lot of them:

The JIT isn’t able to reason about our first for loop and see that all our accesses are within bounds, which leads to doing a lot of range checks, and likely slows us down. Even with that, we are still showing significant improvements here.

Let’s see what we can do with this:

With that, we expect to have no range checks and still be able to benefit from the unrolling.

Method	N	Mean	Error	StdDev	Ratio	RatioSD	Code Size
FilterCmp	23	275.4 ns	2.31 ns	2.05 ns	1.00	0.00	411 B
FilterCmp_Unroll	23	253.6 ns	2.59 ns	2.42 ns	0.92	0.01	563 B

FilterCmp	1047	741.6 ns	5.95 ns	5.28 ns	1.00	0.00	411 B
FilterCmp_Unroll	1047	665.5 ns	2.38 ns	2.22 ns	0.90	0.01	563 B

FilterCmp	1048599	497,624.9 ns	3,904.39 ns	3,652.17 ns	1.00	0.00	411 B
FilterCmp_Unroll	1048599	444,489.0 ns	2,524.45 ns	2,361.38 ns	0.89	0.01	563 B

FilterCmp	33554455	29,781,164.3 ns	361,625.63 ns	320,571.70 ns	1.00	0.00	411 B
FilterCmp_Unroll	33554455	29,954,093.9 ns	588,614.32 ns	916,401.59 ns	1.01	0.04	563 B

That helped, by quite a lot, it seems, for most cases, the 32M items case, however, was slightly slower, which is quite a surprise.

Looking at the assembly, I can see that we still have branches, like so:

And here is why this is the case:

Now, can we do better here? It turns out that we can, by using a branchless version of the operation. Here is another way to write the same thing:

What happens here is that we are unconditionally setting the value in the array, but only increment if the value is greater than or equal to zero. That saves us in branches and will likely result in less code. In fact, let’s see what sort of assembly the JIT will output:

What about the performance? I decided to pit the two versions (normal and branchless) head to head and see what this will give us:

Method	N	Mean	Error	StdDev	Ratio	Code Size
FilterCmp_Unroll	23	276.3 ns	4.13 ns	3.86 ns	1.00	411 B
FilterCmp_Unroll_Branchleses	23	263.6 ns	0.95 ns	0.84 ns	0.96	547 B

FilterCmp_Unroll	1047	743.7 ns	9.41 ns	8.80 ns	1.00	411 B
FilterCmp_Unroll_Branchleses	1047	733.3 ns	3.54 ns	3.31 ns	0.99	547 B

FilterCmp_Unroll	1048599	502,631.1 ns	3,641.47 ns	3,406.23 ns	1.00	411 B
FilterCmp_Unroll_Branchleses	1048599	495,590.9 ns	335.33 ns	297.26 ns	0.99	547 B

FilterCmp_Unroll	33554455	29,356,331.7 ns	207,133.86 ns	172,966.15 ns	1.00	411 B
FilterCmp_Unroll_Branchleses	33554455	29,709,835.1 ns	86,129.58 ns	71,922.10 ns	1.01	547 B

Surprisingly enough, it looks like the branchless version is very slightly slower. That is a surprise, since I would expect reducing the branches to be more efficient.

Looking at the assembly of those two, the branchless version is slightly bigger (10 bytes, not that meaningful). I think that the key here is that there is a 0.5% chance of actually hitting the branch, which is pretty low. That means that the branch predictor can likely do a really good job and we aren’t going to see any big benefits from the branchless version.

That said… what would happen if we tested that with 5% negatives? That difference in behavior may cause us to see a different result. I tried that, and the results were quite surprising. In the case of the 1K and 32M items, we see a slightl cost for the branchless version (additional 1% – 4%) while for the 1M entries there is an 18% reduction in latency for the branchless version.

I ran the tests again with a 15% change of negative, to see what would happen. In that case, we get:

Method	N	Mean	Error	StdDev	Ratio	RatioSD	Code Size
FilterCmp_Unroll	23	273.5 ns	3.66 ns	3.42 ns	1.00	0.00	537 B
FilterCmp_Unroll_Branchleses	23	280.2 ns	4.85 ns	4.30 ns	1.03	0.02	547 B

FilterCmp_Unroll	1047	1,675.7 ns	29.55 ns	27.64 ns	1.00	0.00	537 B
FilterCmp_Unroll_Branchleses	1047	1,676.3 ns	16.97 ns	14.17 ns	1.00	0.02	547 B

FilterCmp_Unroll	1048599	2,206,354.4 ns	6,141.19 ns	5,444.01 ns	1.00	0.00	537 B
FilterCmp_Unroll_Branchleses	1048599	1,688,677.3 ns	11,584.00 ns	10,835.68 ns	0.77	0.01	547 B

FilterCmp_Unroll	33554455	205,320,736.1 ns	2,757,108.01 ns	2,152,568.58 ns	1.00	0.00	537 B
FilterCmp_Unroll_Branchleses	33554455	199,520,169.4 ns	2,097,285.87 ns	1,637,422.86 ns	0.97	0.01	547 B

As you can see, we have basically the same cost under 15% negatives for small values, a big improvement on the 1M scenario and not much improvement on the 32M scenario.

All in all, that is very interesting information. Digging into the exact why and how of that means pulling a CPU instruction profiler and starting to look at where we have stalls, which is a bit further that I want to invest in this scenario.

What if we’ll try to rearrange the code a little bit. The code looks like this (load the value and AddToOutput() immediately):

AddToOutput(ref itemsRef, Unsafe.Add(ref itemsRef, i + 0));

What if we split it a little bit, so the code will look like this:

The idea here is that we are trying to get the JIT / CPU to fetch the items before they are actually needed, so there would be more time for the memory to arrive.

Remember that for the 1M scenario, we are dealing with 8MB of memory and for the 32M scenario, we have 256MB. Here is what happens when we look at the loop prolog, we can see that it is indeed first fetching all the items from memory, then doing the work:

In terms of performance, that gives us a small win (1% – 2% range) for the 1M and 32M entries scenario.

The one last thing that I wanted to test is if we’ll unroll the loop even further, what would happen if we did 8 items per loop, instead of 4.

There is some improvement, (4% in the 1K scenario, 1% in the 32M scenario) but also slowdowns (2% in the 1M scenario).

I think that this is probably roughly the end of the line as far as we can get for scalar code.

We already made quite a few strides in trying to parallelize the work the CPU is doing by just laying out the code as we would like it to be. We tried to control the manner in which it touches memory and in general, those are pretty advanced techniques.

To close this post, I would like to take a look at the gains we got. I’m comparing the first version of the code, the last version we had on the previous post and the unrolled version for both branchy and branchless with 8 operations at once and memory prefetching.

Method	N	Mean	Error	StdDev	Ratio	RatioSD	Code Size
FilterCmp	23	277.3 ns	0.69 ns	0.64 ns	1.00	0.00	411 B
FilterCmp_NoRangeCheck	23	270.7 ns	0.42 ns	0.38 ns	0.98	0.00	397 B
FilterCmp_Unroll_8	23	257.6 ns	1.45 ns	1.21 ns	0.93	0.00	672 B
FilterCmp_Unroll_8_Branchless	23	259.9 ns	1.96 ns	1.84 ns	0.94	0.01	682 B

FilterCmp	1047	754.3 ns	1.38 ns	1.22 ns	1.00	0.00	411 B
FilterCmp_NoRangeCheck	1047	749.0 ns	1.81 ns	1.69 ns	0.99	0.00	397 B
FilterCmp_Unroll_8	1047	647.2 ns	2.23 ns	2.09 ns	0.86	0.00	672 B
FilterCmp_Unroll_8_Branchless	1047	721.2 ns	1.23 ns	1.09 ns	0.96	0.00	682 B

FilterCmp	1048599	499,675.6 ns	2,639.97 ns	2,469.43 ns	1.00	0.00	411 B
FilterCmp_NoRangeCheck	1048599	494,388.4 ns	600.46 ns	532.29 ns	0.99	0.01	397 B
FilterCmp_Unroll_8	1048599	426,940.7 ns	1,858.57 ns	1,551.99 ns	0.85	0.01	672 B
FilterCmp_Unroll_8_Branchless	1048599	483,940.8 ns	517.14 ns	458.43 ns	0.97	0.00	682 B

FilterCmp	33554455	30,282,334.8 ns	599,306.15 ns	531,269.30 ns	1.00	0.00	411 B
FilterCmp_NoRangeCheck	33554455	29,410,612.5 ns	29,583.56 ns	24,703.61 ns	0.97	0.02	397 B
FilterCmp_Unroll_8	33554455	29,102,708.3 ns	42,824.78 ns	40,058.32 ns	0.96	0.02	672 B
FilterCmp_Unroll_8_Branchless	33554455	29,761,841.1 ns	48,108.03 ns	42,646.51 ns	0.98	0.02	682 B

The unrolled 8 version is the winner by far, in this scenario (0.5% negatives). Since that is the scenario we have in the real code, that is what I’m focusing on.

Is there anything left to do here?

My next step is to explore whether using vector instructions will be a good option for us.

Filtering negative numbers, fast: Scalar

Mon, 11 Sep 2023 12:00:00 GMT

While working deep on the guts of RavenDB, I found myself with a seemingly simple task. Given a list of longs, I need to filter out all negative numbers as quickly as possible.

The actual scenario is that we run a speculative algorithm, given a potentially large list of items, we check if we can fulfill the request in an optimal fashion. However, if that isn’t possible, we need to switch to a slower code path that does more work.

Conceptually, this looks something like this:

That is the setup for this story. The problem we have now is that we now need to filter the results we pass to the RunManually() method.

There is a problem here, however. We marked the entries that we already used in the list by negating them. The issue is that RunManually() does not allow negative values, and its internal implementation is not friendly to ignoring those values.

In other words, given a Span<long>, I need to write the code that would filter out all the negative numbers. Everything else about the list of numbers should remain the same (the order of elements, etc).

From a coding perspective, this is as simple as:

Please note, just looking at this code makes me cringe a lot. This does the work, but it has an absolutely horrible performance profile. It allocates multiple arrays, uses a lambda, etc.

We don’t actually care about the entries here, so we are free to modify them without allocating a new value. As such, let’s see what kind of code we can write to do this work in an efficient manner. Here is what I came up with:

The way this works is that we scan through the list, skipping writing the negative lists, so we effectively “move down” all the non-negative lists on top of the negative ones. This has a cost of O(N) and will modify the entire array, the final output is the number of valid items that we have there.

In order to test the performance, I wrote the following harness:

We compare 1K, 1M and 32M elements arrays, each of which has about 0.5% negative, randomly spread across the range. Because we modify the values directly, we need to sprinkle the negatives across the array on each call. In this case, I’m testing two options for this task, one that uses a direct comparison (shown above) and one that uses bitwise or, like so:

I’m testing the cost of sprinkling negatives as well, since that has to be done before each benchmark call (since we modify the array during the call, we need to “reset” its state for the next one).

Given the two options, before we discuss the results, what would you expect to be the faster option? How would the size of the array matter here?

I really like this example, because it is simple, there isn’t any real complexity in what we are trying to do. And there is a very straightforward implementation that we can use as our baseline. That also means that I get to analyze what is going on at a very deep level. You might have noticed the disassembler attribute on the benchmark code, we are going to dive deep into that. For the same reason, we aren’t using exactly 1K, 1M, or 32M arrays, but slightly higher than that, so we’ll have to deal with remainders later on.

Let’s first look at what the JIT actually did here. Here is the annotated assembly for the FilterCmp function:

For the FilterOr, the code is pretty much the same, except that the key part is:

As you can see, the cmp option is slightly smaller, in terms of code size. In terms of performance, we have:

Method	N	Mean
FilterOr	1047	745.6 ns
FilterCmp	1047	745.8 ns
—	–	–
FilterOr	1048599	497,463.6 ns
FilterCmp	1048599	498,784.8 ns
—	–	–
FilterOr	33554455	31,427,660.7 ns
FilterCmp	33554455	30,024,102.9 ns

The costs are very close to one another, with Or being very slightly faster on low numbers, and Cmp being slightly faster on the larger sizes. Note that the difference level between them is basically noise. They have the same performance.

The question is, can we do better here?

Looking at the assembly, there is an extra range check in the main loop that the JIT couldn’t elide (the call to items[output++]). Can we do something about it, and would it make any difference in performance? Here is how I can remove the range check:

Here I’m telling the JIT: “I know what I’m doing”, and it shows.

Let’s look at the assembly changes between those two methods, first the prolog:

Here you can see what we are actually doing here. Note the last 4 instructions, we have a range check for the items, and then we have another check for the loop. The first will get you an exception, the second will just skip the loop. In both cases, we test the exact same thing. The JIT had a chance to actually optimize that, but didn’t.

Here is a funny scenario where adding code may reduce the amount of code generated. Let’s do another version of this method:

In this case, I added a check to handle the scenario of items being empty. What can the JIT do with this now? It turns out, quite a lot. We dropped 10 bytes from the method, which is a nice result of our diet. Here is the annotated version of the assembly:

A lot of the space savings in this case come from just not having to do a range check, but you’ll note that we still do an extra check there (lines 12..13), even though we already checked that. I think that the JIT knows that the value is not zero at this point, but has to consider that the value may be negative.

If we’ll change the initial guard clause to: items.Length <= 0, what do you think will happen? At this point, the JIT is smart enough to just elide everything, we are at 55 bytes of code and it is a super clean assembly (not a sentence I ever thought I would use). I’ll spare you going through more assembly listing, but you can find the output here.

And after all of that, where are we at?

Method	N	Mean	Error	StdDev	Ratio	RatioSD	Code Size
FilterCmp	23	274.5 ns	1.91 ns	1.70 ns	1.00	0.00	411 B
FilterCmp_NoRangeCheck	23	269.7 ns	1.33 ns	1.24 ns	0.98	0.01	397 B

FilterCmp	1047	744.5 ns	4.88 ns	4.33 ns	1.00	0.00	411 B
FilterCmp_NoRangeCheck	1047	745.8 ns	3.44 ns	3.22 ns	1.00	0.00	397 B

FilterCmp	1048599	502,608.6 ns	3,890.38 ns	3,639.06 ns	1.00	0.00	411 B
FilterCmp_NoRangeCheck	1048599	490,669.1 ns	1,793.52 ns	1,589.91 ns	0.98	0.01	397 B

FilterCmp	33554455	30,495,286.6 ns	602,907.86 ns	717,718.92 ns	1.00	0.00	411 B
FilterCmp_NoRangeCheck	33554455	29,952,221.2 ns	442,176.37 ns	391,977.84 ns	0.99	0.02	397 B

There is a very slight benefit to the NoRangeCheck, but even when we talk about 32M items, we aren’t talking about a lot of time.

The question what can we do better here?

Not all O(1) operations are considered equal

Wed, 30 Aug 2023 12:00:00 GMT

At some point in any performance optimization sprint, you are going to run into a super annoying problem: The dictionary.

The reasoning is quite simple. One of the most powerful optimization techniques is to use a cache, which is usually implemented as a dictionary. Today’s tale is about a dictionary, but surprisingly enough, not about a cache.

Let’s set up the background, I’m looking at optimizing a big indexing batch deep inside RavenDB, and here is my current focus:

You can see that the RecordTermsForEntries take 4% of the overall indexing time. That is… a lot, as you can imagine.

What is more interesting here is why. The simplified version of the code looks like this:

Basically, we are registering, for each entry, all the terms that belong to it. This is complicated by the fact that we are doing the process in stages:

Create the entries
Process the terms for the entries
Write the terms to persistent storage (giving them the recorded term id)
Update the entries to record the term ids that they belong to

The part of the code that we are looking at now is the last one, where we already wrote the terms to persistent storage and we need to update the entries. This is needed so when we read them, we’ll be able to find the relevant terms.

At any rate, you can see that this method cost is absolutely dominated by the dictionary call. In fact, we are actually using an optimized method here to avoid doing a TryGetValue() and then Add() in case the value is not already in the dictionary.

If we actually look at the metrics, this is actually kind of awesome. We are calling the dictionary almost 400 million times and it is able to do the work in under 200 nanoseconds per call.

That is pretty awesome, but that still means that we have over 2% of our total indexing time spent doing lookups. Can we do better?

In this case, absolutely. Here is how this works, instead of doing a dictionary lookup, we are going to store a list. And the entry will record the index of the item in the list. Here is what this looks like:

There isn’t much to this process, I admit. I was lucky that in this case, we were able to reorder things in such a way that skipping the dictionary lookup is a viable method.

In other cases, we would need to record the index at the creation of the entry (effectively reserving the position) and then use that later.

And the result is…

That is pretty good, even if I say so myself. The cost went down from 3.6 microseconds per call to 1.3 microseconds. That is almost 3 folds improvement.

A twisted tale of memory optimization

Thu, 24 Aug 2023 12:00:00 GMT

I was looking into reducing the allocation in a particular part of our code, and I ran into what was basically the following code (boiled down to the essentials):

As you can see, this does a lot of allocations. The actual method in question was a pretty good size, and all those operations happened in different locations and weren’t as obvious.

Take a moment to look at the code, how many allocations can you spot here?

The first one, obviously, is the string allocation, but there is another one, inside the call to GetBytes(), let’s fix that first by allocating the buffer once (I’m leaving aside the allocation of the reusable buffer, you can assume it is big enough to cover all our needs):

For that matter, we can also easily fix the second problem, by avoiding the string allocation:

That is a few minutes of work, and we are good to go. This method is called a lot, so we can expect a huge reduction in the amount of memory that we allocated.

Except… that didn’t happen. In fact, the amount of memory that we allocate remained pretty much the same. Digging into the details, we allocate roughly the same number of byte arrays (how!) and instead of allocating a lot of strings, we now allocate a lot of character arrays.

I broke the code apart into multiple lines, which made things a lot clearer. (In fact, I threw that into SharpLab, to be honest). Take a look:

This code: buffer[..len] is actually translated to:

char[] charBuffer= RuntimeHelpers.GetSubArray(buffer, Range.EndAt(len));

That will, of course, allocate. I had to change the code to be very explicit about the types that I wanted to use:

This will not allocate, but if you note the changes in the code, you can see that the use of var in this case really tripped me up. Because of the number of overloads and automatic coercion of types that didn’t happen.

For that matter, note that any slicing on arrays will generate a new array, including this code:

This makes perfect sense when you realize what is going on and can still be a big surprise, I looked at the code a lot before I figured out what was going on, and that was with a profiler output that pinpointed the fault.

On Moq & SponsorLink: Some thoughts

Thu, 10 Aug 2023 12:00:00 GMT

Today I ran into this Reddit post, detailing how Moq is now using SponsorLink to encourage users to sponsor the project.

The idea is that if you are using the project, you’ll sponsor it for some amount, which funds the project. You’ll also get something like this:

This has been rolled out for some projects for quite some time, it seems. But Moq is a far more popular project and it got quite a bit of attention.

It is an interesting scenario, and I gave some thought to what this means.

I’m not a user of Moq, just to note.

I absolutely understand the desire to be paid for Open Source work. It takes a lot of time and effort and looking at the amount of usage people get out of your code compared to the compensation is sometimes ridiculous.

For myself, I can tell you that I made 800 USD out of Rhino.Mocks directly when it was one of the most popular mocking frameworks in the .NET world. That isn’t a sale, that is the total amount of compensation that I got for it directly.

I literally cannot total the number of hours that I spent on it. But OpenHub estimates it as 245 man-years. I… disagree with that estimate, but I certainly put a lot of time there.

From a commercial perspective, I think that this direction is a mistake. Primarily because of the economies of software purchases. You can read about the implementation of SponsorLink here. The model basically says that it will check whether the individual user has sponsored the project.

That is… not really how it works. Let’s say that a new developer is starting to work on an existing project. It is using a SponsorLink project. What happens then? That new developer is being asked to sponsor the project?

If this is a commercial project, I certainly support the notion that there should be some payment. But it should not be on the individual developer, it should be on the company that pays for the project.

That leaves aside all the scenarios where this is being used for an open source project, etc. Let’s ignore those for now.

The problem is that this isn’t how you actually get paid for software. If you are targeting commercial usage, you should be targeting companies, not individual users. More to the point, let’s say that a developer wants to pay, and their company will compensate them for that.

The process for actually doing that is atrocious beyond belief. There are tax implications (if they sponsor with 5$ / month and their employer gives them a 5$ raise, that would be taxed, for example), so you need to submit a receipt for expenses, etc.

A far better model would be to have a way to get the company to pay for that, maybe on a per project basis. Then you can detect if the project is sponsored, for example, by looking at the repository URL (and accounting for forks).

Note that at this point, we are talking about the actual process of getting money, nothing else about this issue.

Now, let’s get to the reason that this caused much angst for people. The way SponsorLink works is that it fetches your email from the git configuration file and check wether:

You are registered as a SponsorLink sponsor
You are sponsoring this particular project

It does both checks using what appears to be: base62(sha256(email));

If you are already a SponsorLink sponsor, you have explicitly agreed to sharing your email, so not a problem there. So the second request is perfectly fine.

The real problem is the first check, when you check if you are a SponsorLink sponsor in the first place. Let’s assume that you aren’t, what happens then.

Well, there is a request made that looks something like this:

HEAD /azure-blob-storage/path/app/3uVutV7zDlwv2rwBwfOmm2RXngIwJLPeTO0qHPZQuxyS

The server will return a 404 if you are not a sponsor at this point.

The email hash above is my own, by the way. As I mentioned, I’m not a sponsor, so I assume this will return 404. The question is what sort of information is being provided to the server in this case?

Well, there is the hashed email, right? Is that a privacy concern?

It is indeed. While reversing SHA256 in general is not possible, for something like emails, that is pretty trivial. It took me a few minutes to find an online tool that does just that.

The cost is around 0.00045 USD / email, just to give some context. So the end result is that using SponsorLink will provide the email of the user (without their express or implied consent) to the server. It takes a little bit of extra work, but it most certainly does.

Note that practically speaking, this looks like it hits Azure Blob Storage, not a dedicated endpoint. That means that you can probably define logging to check for the requests and then gather the information from there. Not sure what you would do with this information, but it certainly looks like this falls under PII definition on the GDPR.

There are a few ways to resolve this problem. The first would be to not use email at all, but instead the project repository URL. That may require a bit more work to resolve forks, but would alleviate most of the concerns regarding privacy. A better option would be to just check for an included file in the repository, to be honest. Something like: .sponsored.projects file.

That would include the ids of the projects that were sponsored by this project, and then you can run a check to see that they are actually sponsored. There is no issue with consent here, since adding that file to the repository will explicitly consent for the process.

Assuming that you want / need to use the emails still, the problem is much more complex. You cannot use the same process as k-Anonymity as you can use for passwords. The problem is that a SHA256 of an email is as good as the email itself.

I think that something like this would work, however. Given the SHA256 of the email, you send to the server the following details:

prefix = SHA256(email)[0 .. 6]
key = read(“/dev/urandom”, 32bytes)
hash = HMAC-SHA256(key, SHA256(email)

The prefix is the first 6 letters of the SHA256 hash. The key has cryptography strength of 32 random bytes.

The hash is taking the SHA256 and hashing it again usung HMAC with the provided key.

The idea is that on the server side, you can load all the hashes that you stored that match the provided prefix. Then you compute the keyed HMAC for all of those values and attempt to check if there is a match.

We are trying to protect against a malicious server here, remember. So the idea is that if there is a match, we pinged the server with an email that it knows about. If we ping the server with an email that it does not know about, on the other hand, it cannot tell you anything about the value.

The first 6 characters of the SHA256 will tell you nothing about the value, after all. And the fact that we use a random key to sending the actual hash to the server means that there is no point trying to figure it out. Unlike trying to guess an email, guessing a hash of an email is likely far harder, to the point that it is not feasible.

Note, I’m not a cryptography expert, and I wouldn’t actually implement such a thing without consulting with one. I’m just writing a blog post with my ideas.

That would at least alleviate the privacy concern. But there are other issues.

The SponsorLink is provided as a closed-source obfuscated library. People have taken the time to de-obfuscate it, and so far it appears to be matching the documented behavior. But the mere fact that it is actually obfuscated and closed-source inclusion in an open-source project raises a lot of red flags.

Finally, there is the actual behavior when it detects that you are not sponsoring this project. Here is what the blog post states will happen:

It will delay the build (locally, on your machine, not on CI).

That… is really bad. I assume that this happens on every build (not sure, though). If that is the case, that means that the feedback cycle of "write a test, run it, write code, run a test", is going to hit significant slowdowns.

I would consider this to be a breaking point even excluding everything else.

As I previously stated, I’m all for paying for Open Source software. But this is not the way to do that, there is a whole bunch of friction and not much that can indicate a positive outcome for the project.

Monetization strategies for Open Source projects are complex. Open core, for example, likely would not work for this scenario. Nor would you be likely to get support contracts. The critical aspect is that beyond just the technical details, any such strategy requires a whole bunch of infrastructure around it. Marketing, sales, contract negotiation, etc. There is no easy solution here, I’m afraid.

A performance profile mystery: The cost of Stopwatch

Wed, 09 Aug 2023 12:00:00 GMT

Measuring the length of time that a particular piece of code takes is a surprising challenging task. There are two aspects to this, the first is how do you ensure that the cost of getting the start and end times won’t interfere with the work you are doing. The second is how to actually get the time (potentially many times a second) in as efficient way as possible.

To give some context, Andrey Akinshin does a great overview of how the Stopwatch class works in C#. On Linux, that is basically calling to the clock_gettime system call, except that this is not a system call. That is actually a piece of code that the Kernel sticks inside your process that will then integrate with other aspects of the Kernel to optimize this. The idea is that this system call is so frequent that you cannot pay the cost of the Kernel mode transition. There is a good coverage of this here.

In short, that is a very well-known problem and quite a lot of brainpower has been dedicated to solving it. And then we reached this situation:

What you are seeing here is us testing the indexing process of RavenDB under the profiler. This is indexing roughly 100M documents, and according to the profiler, we are spending 15% of our time gathering metrics?

The StatsScope.Start() method simply calls Stopwatch.Start(), so we are basically looking at a profiler output that says that Stopwatch is accounting for 15% of our runtime?

Sorry, I don’t believe that. I mean, it is possible, but it seems far-fetched.

In order to test this, I wrote a very simple program, which will generate 100K integers and test whether they are prime or not. I’m doing that to test compute-bound work, basically, and testing calling Start() and Stop() either across the whole loop or in each iteration.

I run that a few times and I’m getting:

Windows: 311 ms with Stopwatch per iteration and 312 ms without
Linux: 450 ms with Stopwatch per iteration and 455 ms without

On Linux, there is about 5ms overhead if we use a per iteration stopwatch, on Windows, it is either the same cost or slightly cheaper with per iteration stopwatch.

Here is the profiler output on Windows:

And on Linux:

Now, that is what happens when we are doing a significant amount of work, what happens if the amount of work is negligible? I made the IsPrime() method very cheap, and I got:

So that is a good indication that this isn’t free, but still…

Comparing the costs, it is utterly ridiculous that the profiler says that so much time is spent in those methods.

Another aspect here may be the issue of the profiler impact itself. There are differences between using Tracing and Sampling methods, for example.

I don’t have an answer, just a lot of very curious questions.

Production postmortem: The dog ate my request

Mon, 24 Jul 2023 12:00:00 GMT

A customer called us, quite upset, because their RavenDB cluster was failing every few minutes. That was weird, because they were running on our cloud offering, so we had full access to the metrics, and we saw absolutely no problem on our end.

During the call, it turned out that every now and then, but almost always immediately after a new deployment, RavenDB would fail some requests. On a fairly consistent basis, we could see two failures and a retry that was finally successful.

Okay, so at least there is no user visible impact, but this was still super strange to see. On the backend, we couldn’t see any reason why we would get those sort of errors.

Looking at the failure stack, we narrowed things down to an async operation that was invoked via DataDog. Our suspicions were focused on this being an error in the async machinery customization that DataDog uses for adding non-invasive monitoring.

We created a custom build for the user that they could test and waited to get the results from their environment. Trying to reproduce this locally using DataDog integration didn’t raise any flags.

The good thing was that we did find a smoking gun, a violation of the natural order and invariant breaking behavior.

The not so good news was that it was in our own code. At least that meant that we could fix this.

Let’s see if I can explain what is going on. The customer was using a custom configuration: FastestNode. This is used to find the nearest / least loaded node in the cluster and operate from it.

How does RavenDB know which is the fastest node? That is kind of hard to answer, after all. It checks.

Every now and then, RavenDB replicates a read request to all nodes in the cluster. Something like this:

The idea is that we send the request to all the nodes, and wait for the first one to arrive. Since this is the same request, all servers will do the same amount of work, and we’ll find the fastest node from our perspective.

Did you notice the cancellation token in there? When we return from this function, we cancel the existing requests. Here is what this looks like from the monitoring perspective:

This looks exactly like every few minutes, we have a couple of failures (and failover) in the system and was quite confusing until we figured out exactly what was going on.

Solving heap corruption errors in managed applications

Wed, 05 Jul 2023 12:00:00 GMT

RavenDB is a .NET application, written in C#. It also has a non trivial amount of unmanaged memory usage. We absolutely need that to get the proper level of performance that we require.

With managing memory manually, there is also the possibility that we’ll mess it up. We run into one such case, when running our full test suite (over 10,000 tests) we would get random crashes due to heap corruption. Those issues are nasty, because there is a big separation between the root cause and the actual problem manifesting.

I recently learned that you can use the gflags tool on .NET executables. We were able to narrow the problem to a single scenario, but we still had no idea where the problem really occurred. So I installed the Debugging Tools for Windows and then executed:

 &"C:\Program Files (x86)\Windows Kits\10\Debuggers\x64\gflags.exe" /p /enable C:\Work\ravendb-6.0\test\Tryouts\bin\release\net7.0\Tryouts.exe

What this does is enable a special debug heap at the executable level, which applies to all operations (managed and native memory alike).

With that enabled, I ran the scenario in question:

PS C:\Work\ravendb-6.0\test\Tryouts> C:\Work\ravendb-6.0\test\Tryouts\bin\release\net7.0\Tryouts.exe
42896
Starting to run 0
Max number of concurrent tests is: 16
Ignore request for setting processor affinity. Requested cores: 3. Number of cores on the machine: 32.
         To attach debugger to test process (x64), use proc-id: 42896. Url http://127.0.0.1:51595
Ignore request for setting processor affinity. Requested cores: 3. Number of cores on the machine: 32. License limits: A: 3/32. Total utilized cores: 3. Max licensed cores: 1024
http://127.0.0.1:51595/studio/index.html#databases/documents?&database=Should_correctly_reduce_after_updating_all_documents_1&withStop=true&disableAnalytics=true
Fatal error. System.AccessViolationException: Attempted to read or write protected memory. This is often an indication that other memory is corrupt.
    at Sparrow.Server.Compression.Encoder3Gram`1[[System.__Canon, System.Private.CoreLib, Version=7.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e]].Encode(System.ReadOnlySpan`1<Byte>, System.Span`1<Byte>)
    at Sparrow.Server.Compression.HopeEncoder`1[[Sparrow.Server.Compression.Encoder3Gram`1[[System.__Canon, System.Private.CoreLib, Version=7.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e]], Sparrow.Server, Version=6.0.0.0, Culture=neutral, PublicKeyToken=37f41c7f99471593]].Encode(System.ReadOnlySpan`1<Byte> ByRef, System.Span`1<Byte> ByRef)
    at Voron.Data.CompactTrees.PersistentDictionary.ReplaceIfBetter[[Raven.Server.Documents.Indexes.Persistence.Corax.CoraxDocumentTrainEnumerator, Raven.Server, Version=6.0.0.0, Culture=neutral, PublicKeyToken=37f41c7f99471593],[Raven.Server.Documents.Indexes.Persistence.Corax.CoraxDocumentTrainEnumerator, Raven.Server, Version=6.0.0.0, Culture=neutral, PublicKeyToken=37f41c7f99471593]](Voron.Impl.LowLevelTransaction, Raven.Server.Documents.Indexes.Persistence.Corax.CoraxDocumentTrainEnumerator, Raven.Server.Documents.Indexes.Persistence.Corax.CoraxDocumentTrainEnumerator, Voron.Data.CompactTrees.PersistentDictionary)
    at Raven.Server.Documents.Indexes.Persistence.Corax.CoraxIndexPersistence.Initialize(Voron.StorageEnvironment)

That pinpointed things so I was able to know exactly where we are messing up.

I was also able to reproduce the behavior on the debugger:

This saved me hours or days of trying to figure out where the problem actually is.

Bug chasing, the process is more important than the result

Thu, 04 May 2023 12:00:00 GMT

I’m doing a pretty major refactoring inside of RavenDB right now. I was able to finish a bunch of work and submitted things to the CI server for testing. RavenDB has several layers of tests, which we run depending on context.

During development, we’ll usually run the FastTests. About 2,300 tests are being run to validate various behaviors for RavenDB, and on my machine, they take just over 3 minutes to complete. The next tier is the SlowTests, which run for about 3 hours on the CI server and run about 26,000 tests. Beyond that we actually have a few more layers, like tests that are being run only on the nightly builds and stress tests, which can take several minutes each to complete.

In short, the usual process is that you write the code and write the relevant tests. You also validate that you didn’t break anything by running the FastTests locally. Then we let CI pick up the rest of the work. At the last count, we had about 9 dedicated machines as CI agents and given our workload, an actual full test run of a PR may complete the next day.

I’m mentioning all of that to explain that when I reviewed the build log for my PR, I found that there were a bunch of tests that failed. That was reasonable, given the scope of my changes. I sat down to grind through them, fixing them one at a time. Some of them were quite important things that I didn’t take into account, after all. For example, one of the tests failed because I didn’t account for sorting on a dynamic numeric field. Sorting on a numeric field worked, and a dynamic text field also worked. But dynamic numeric field didn’t. It’s the sort of thing that I would never think of, but we got the tests to cover us.

But when I moved to the next test, it didn’t fail. I ran it again, and it still didn’t fail. I ran it in a loop, and it failed on the 5th iteration. That… sucked. Because it meant that I had a race condition in there somewhere. I ran the loop again, and it failed again on the 5th. In fact, in every iteration I tried, it would only fail on the 5th iteration.

When trying to isolate a test failure like that, I usually run that in a loop, and hope that with enough iterations, I’ll get it to reproduce. Having it happen constantly on the 5th iteration was… really strange. I tried figuring out what was going on, and I realized that the test was generating 1000 documents using a random. The fact that I’m using Random is the reason it is non-deterministic, of course, except that this is the code inside my test base class:

So this is properly initialized with a seed, so it will be consistent.

Read the code again, do you see the problem?

That is a static value. So there are two problems here:

I’m getting the bad values on the fifth run in a consistent manner because that is the set of results that reproduce the error.
This is a shared instance that may be called from multiple tests at once, leading to the wrong result because Random is not thread safe.

Before fixing this issue so it would run properly, I decided to use an ancient debugging technique. It’s called printf().

In this case, I wrote out all the values that were generated by the test and wrote a new test to replay them. That one failed consistently.

The problem was that it was still too big in scope. I iterated over this approach, trying to end up with a smaller section of the codebase that I could invoke to repeat this issue. That took most of the day. But the end result is a test like this:

As you can see, in terms of the amount of code that it invokes, it is pretty minimal. Which is pretty awesome, since that allowed me to figure out what the problem was:

I’ve been developing software professionally for over two decades at this point. I still get caught up with things like that, sigh.

Fight for every byte it takes: Decoding the entries

Mon, 01 May 2023 12:00:00 GMT

In this series so far, we reduced the storage cost of key/value lookups by a lot. And in the last post we optimized the process of encoding the keys and values significantly. This is great, but the toughest challenge is ahead of us, because as much as encoding efficiency matters, the absolute cost we have is doing lookups. This is the most basic operation, which we do billions of times a second. Any amount of effort we’ll spend here will be worth it. That said, let’s look at the decoding process we have right now. It was built to be understandable over all else, so it is a good start.

What this code does is to accept a buffer and an offset into the buffer. But the offset isn’t just a number, it is composed of two values. The first 12 bits contain the offset in the page, but since we use 2-byte alignment for the entry position, we can just assume a zero bit at the start. That is why we compute the actual offset in the page by clearing the first four bits and then shifting left by three bits. That extracts the actual offset to the file, (usually a 13 bits value) using just 12 bits. The first four bits in the offset are the indicator for the key and value lengths. There are 15 known values, which we computed based on probabilities and one value reserved to say: Rare key/val length combination, the actual sizes are stored as the first byte in the entry.

Note that in the code, we handle that scenario by reading the key and value lengths (stored as two nibbles in the first byte) and incrementing the offset in the page. That means that we skip past the header byte in those rare situations.

The rest of the code is basically copying the key and value bytes to the relevant variables, taking advantage of partial copy and little-endian encoding.

The code in question takes 512 bytes and has 23 branches. In terms of performance, we can probably do much better, but the code is clear in what it is doing, at least.

The first thing I want to try is to replace the switch statement with a lookup table, just like we did before. Here is what the new version looks like:

The size of the function dropped by almost half and we have only 7 more branches involved. There are also a couple of calls to the memory copy routines that weren’t inlined. In the encoding phase, we reduced branches due to bound checks using raw pointers, and we skipped the memory calls routines by copying a fixed size value at varied offsets to be able to get the data properly aligned. In this case, we can’t really do the same. One thing that we have to be aware of is the following situation:

In other words, we may have an entry that is at the end of the page, if we’ll try to read unconditionally 8 bytes, we may read past the end of the buffer. That is not something that we can do. In the Encode() case, we know that the caller gave us a buffer large enough to accommodate the largest possible size, so that isn’t an issue. That complicates things, sadly, but we can go the other way around.

The Decode() function will always be called on an entry, and that is part of the page. The way we place entries means that we are starting at the top and moving down. The structure of the page means that we can never actually place an entry below the first 8 bytes of the page. That is where the header and the offsets array are going, after all. Given that, we can do an unconditional read backward from the entry. As you can see in the image below, we are reading some data that we don’t care about, but this is fine, we can fix it later, and without any branches.

The end result is that we can have the following changes:

I changed the code to use a raw pointer, avoiding bound checks that we already reasoned about. Most interesting is the ReadBackward function. This is an inner function, and was properly inlined during JIT compilation, it implements the backward reading of the value. Here is what the assembly looks like:

With this in place, we are now at 133 bytes and a single branch operation. That is pretty awesome, but we can do better still. Consider the following code (explanations to follow):

Note that the first element in the table here is different, it is now setting the 4th bit. This is because we are making use of that. The structure of the bytes in the table are two nibbles, but no other value in the table sets the 4th bit. That means that we can operate on that.

Indeed, what we are doing is use the decoder byte to figure out what sort of shift we want. We have the byte from the table and the byte from the buffer. And we use the fact that masking this with 8 gives (just for this value) the value of 8. We can then use that to select the appropriate byte. If we have an offloaded byte, then we’ll shift the value by 8, getting the byte from the buffer. For any other value, we’ll get 0 as the shift index, resulting in us getting the value from the table. That gives us a function with zero branches, and 141 bytes.

I spent a lot of time thinking about this, so now that we have those two approaches, let's benchmark them. The results were surprising:

|                  Method |       Mean |    Error |   StdDev |
|------------------------ |-----------:|---------:|---------:|
|  DecodeBranchlessShifts | 2,107.1 ns | 20.69 ns | 18.34 ns |
|           DecodeBranchy |   936.2 ns |  1.89 ns |  1.68 ns |

It turns out that the slightly smaller code with the branches is able to beat up the branchless code. When looking into what they are doing, I think that I can guess why. Branches aren’t a huge problem if they are predictable, and in our case, the whole point of all of this is that the offload process where we need to go to the entry to get the value is meant to be a rare event. In branchless code, on the other hand, you have to do something several times to avoid a branch (like shifting the value from the buffer up and maybe shifting it down, etc).

You can’t really argue with a difference like that. We also tried an AVX version, to see if this would have better performance. It turns out that there is really no way for us to beat the version with the single branch. Everything else was at least twice as slow.

At this point, I believe that we have a winner.

Fight for every byte it takes: Fitting 64 values in 4 bits

Thu, 27 Apr 2023 12:00:00 GMT

Moving to nibble encoding gave us a measurable improvement in the density of the entries in the page. The problem is that we pretty much run out of room to do so. We are currently using a byte per entry to hold the size of the entry (as two nibbles, of 4 bits each). You can’t really go lower than that.

Let’s review again what we know about the structure of the data, we have an 8KB page, with three sections, fixed size header and variable size offsets and entries array. Here is what this looks like:

This is called a slotted page design. The idea is that the offset array at the bottom of the page is maintaining the sort orders of the entries, and that we can write the entries from the top of the page. When we need to sort the entries, we just need to touch the offsets array (shown in yellow in the image).

Given that we are talking about size and density, we spent a lot of time trying to reduce the size of the entries, but can we do something with the header or the offsets? The header is just 4 bytes right now, two shorts that denote the location of the bottom and the top position in the page. Given that the page is 8KB in size, we have to use 16 bits integer to cover the range. For offsets, the situation is the same. We have to be able to point to the entry location on the page, and that means that we have to reach 8KB. So the offsets are actually 16 bits ints and take two bytes.

In other words, there is a hidden overhead of 2 bytes per entry that we didn’t even consider. In the case of our latest success, we were able to push 759 entries into the page, which means that we are actually using 18.5% of the page just to hold the offsets of the entries. That is 1.48 KB that is being used.

The problem is that we need to use this. We have to be able to point to an entry anywhere in the page, which means that we have to reach 0 .. 8192. The minimum size we can use is 16 bits or two bytes.

Or do we?

16 bits gives us a range of 0 .. 65,535, after all. That is far in excess of what we need. We could use a 64KB page, but there are other reasons to want to avoid that.

To cover 8KB, we only need 13 bits to cover the range we need, after all. For that matter, we can extend that bit by 25%. If we decide that an entry should be 2 bytes aligned, we can access the entire page in 12 bits.

That means that we have 4 whole free bits to play with. The first idea is to change the offsets array from 16 bits ints to 12 bits ints. That would save us 380 bytes at 759 entries per page. That is quite a lot. Unfortunately, working with bits in this manner would be super awkward. We are doing a lot of random access and moves while we are building the page. It is possible to do this using bits, but not fun.

So we can set things up so we have a nibble free to use. We just used nibbles to save on the cost of variable size ints, to great success.

However, we don’t need just a nibble, we need two of them. We need to store the size of the key and the value in bytes. Actually, we don’t need two nibbles. The size of the key and the value maxes at 8 bytes, after all. We can encode that in 3 bits. In other words, we need 6 bits to encode this information.

We only have 4 bits, however. It is a really nice idea, however, and I kept turning that in my head, trying to come up with all sorts of clever ways to figure out how we can push 64 values in 4 bits. The impact of that would be pretty amazing.

Eventually, I realized that it is fairly easy to prove, using math, that there is no way to do so. Faced with this failure, I realigned my thinking and found a solution. I don’t need to have a perfect answer, I can have a good one.

4 bits give me a range of 16 values (out of the possible 64). If I give up on trying to solve the whole problem, can I solve a meaningful part of it?

And I came up with the following idea. We can do a two-stage approach, we’ll map the most common 15 values of key and value sizes to those 4 bits. The last value will be a marker that you have to go and look elsewhere.

Using just the data in the offset, I’m able to figure out what the location of the entry in the page is as well as the size of the key and value for most cases. For the (hopefully rare) scenarios where that is not the case, we fall back to storing the size information as two nibbles preceding the entry data.

This is a pretty neat idea, even if I say so myself, and it has a good chance to allow us to save about 1 byte per entry in the common case. In fact, I tested that and about 90% of the cases in my test case are covered by the top 15 cases. That is a pretty good indication that I’m on the right track.

All of that said, let’s look at how this looks in code:

I’m using a switch expression here for readability, so it is clear what is going on. If the key and value sizes are in one of the known patterns, we can put that in the nibble we’ll return. If the value is not, we’ll write it to the entry buffer.

The Set method itself had to change in some subtle but crucial ways, let’s look at it first, then I’ll discuss those changes:

As before, we encode the entry into a temporary buffer. Now, in addition to getting the length of the entry, we are also getting the nibble that we’ll need to store.

You can see the changes in how we work with the offsets array following that. When we need to update an existing value, we are using this construction to figure out the actual entry offset:

var actualEntryOffset = ((offsets[idx] & 0xFFF0) >> 3);

What exactly is going on here? Don’t try to figure it out yet, let’s see how we are writing the data:

top = (ushort)((top - reqEntryLen) & ~1); // align on two bytes boundary 
offsets[idx] = (ushort)(top << 3 | nibble);

Those two code snippets may look very odd, so let’s go over them in detail.

First, remember that we have an 8KB page to work with, but we need to use 4 bits for the size nibble we got from encoding the entry. To address the full 8,192 values in the page, we’ll need to reserve 13 bits. That is… a problem. We solve that by saying that the entry addresses must always be aligned on two bytes boundary. That is handled by clearing the first bit in the new top computation. Since we are growing down, that has the effect of ensuring aligned-by-two.

Then, we merge the top location and the nibble together. We know that the bottom-most of the top is cleared, so we can just move the value by 3 bits and we know that we’ve 4 cleared bits ready.

Conversely, when we want to read, we clear the first 4 bits and then we shift by three. That has the effect of returning us back to the original state.

A little bit confusing, but we managed to get to squeeze 784 entries into the page using the realistic dataset and 765 using the full one. That is another 3.5% of space savings over the previous nibble attempt and over 10% increase in capacity from the variable integer approach.

And at this point, I don’t believe that there is anything more that I can do to reduce the size in a significant manner without having a negative impact elsewhere.

We are not done yet, however. We are done with the size aspect, but we also have much to do in terms of performance and optimizations for runtime.

In the meantime, you can see my full code here. In the next post, we are going to start talking about the actual machine code and how we can optimize it.

Fight for every byte it takes: Variable size data

Tue, 25 Apr 2023 12:00:00 GMT

In my previous post, we stored keys and values as raw numbers inside the 8KB page. That was simple, but wasteful. For many scenarios, we are never going to need to utilize the full 8 bytes range for a long. Most numbers are far smaller.

In the example I gave in the last post, we are storing the following range of numbers (file offsets, basically). I’m using two test scenarios, one where I’m testing the full range (for correctness) and one where I’m testing files under 512 GB in size. Given that we are trying to compress the space, once we hit the 512GB mark, it is probably less urgent, after all.

Here are the number generations that I’m using:

What this means is:

Full data set

Realistic data set

3% in the first 128 bytes
7% in the first 64 KB
25% in the first 8 MB
35% in the first 2 GB
15% in the first 512 GB
5% in the first 128 TB
3% in the first 32 Petabytes
2% in the first 4 Exabytes

1% in the first 128 bytes
2% in the first 64 KB
27% in the first 8 MB
35% in the first 2 GB
25% in the first 512 GB

This is meant to verify that we can handle any scenario, in practice, we can usually focus on the first 512 GB, which is far more common.

Using both approaches, I can fit using my previous approach, up to 511 entries per page. That makes sense, we are storing the data raw, so how can we do better? Most of the time, we don’t need anywhere near 8 bytes per value. For that reason, we have variable length encoding, which has many names, such as variable size int, 7 bits integers, etc. I adapted some methods from the .NET codebase to allow me to operate on Spans, like so:

Let’s check what sort of savings we can get using this approach:

Under 127 bytes– 1 byte
128 bytes .. 32 KB – 2 bytes
32KB .. 8MB – 3 bytes
8MB .. 2GB – 4 bytes
2 GB .. 512 GB – 5 bytes
512GB .. 128 TB – 6 bytes
128TB .. 32 Petabytes – 7 bytes
32 Petabytes .. 8 Exabytes – 8 bytes
Greater than 8 Exabytes – 9 bytes

That is really cool, since for the realistic data set, we can pack a lot more data into the page.

It comes with a serious issue, however. The data is no longer fixed size (well, that is the point, no?). Why is that a problem? Because we want to be able to do a binary search on that, which means that we need to be able to access the data by index. As usual, the solution is to utilize indirection. We’ll dedicate the bottom of the page to an array of fixed-size int (16 bits – sufficient to cover the 8KB range of the page) that will point to the actual location of the entry. Like before, we are going to reserve the first few bytes as a header, in this case we’ll use 4 bytes, divided into two shorts. Those will keep track of the writes to the bottom and the top of the page.

At the bottom, we’ll have the actual offsets that point to the entries, and at the top, we write the actual entries. Here is what this looks like:

Let’s see how our reading from the page will look now. As you can see, it is very similar to what we had before, but instead of going directly to the key by its offset, we have to use the indirection:

The offsets array contains the location of the entry in the page, and that is laid out as the [varint-key][varint-val]. So we read (and discard) the key from the offset we found (we have to do that to discover its size) and then we read and return the actual value.

Let’s look at how we implemented the actual binary search in the page:

This is a bog standard binary search, with the only interesting bit that we are going through the offsets array to find the actual location of the key, which we then read using variable size decoding.

The interesting part of this model happens when we need to set a value. Here is what this looks like, with my notes following the code.

This is quite a lot, I’ll admit. Let’s try to break up into individual pieces what is going on here.

First, we get the header values (bottom, top) and initialize them if empty (note that bottom is set to 4, after the header, while top is set to the end of the buffer). The idea is that the bottom grows up and the top grows down. This is called Slotted Page design and it is a staple of database design.

We then encode the key and the value into a temporary buffer. We need to do that so we’ll know what size the entry will take. Then we need to figure out if we are updating an existing record or creating a new one.

Updating an existing record is complex. This is because the size of the new record may be greater than the size of the old one. So we can’t put it in the same location. I’m handling this by just allocating new space for this entry, ignoring the old space that was allocated to it.

I’m not handling any deletes / space reclamation on this series. That is a separate subject, not complex, but fairly tedious to do properly. So I’m going to focus solely on writes.

Updates to an existing entry that also change its size aren’t in my test dataset, so I’m not worried about it too much here. I mention this to point out that variable length records bring with them considerations that we wouldn’t have run into with the fixed-size model.

And after all of this work? What are the results?

With the fixed-size version, we could fit 511 entries into the page. With the variable size int, however, we can do better.

For the realistic dataset, I can fit 712 entries for the page, and for the full dataset, 710 (there are very few very big elements even there, but we can see that it has an impact).

511 vs. 712 may not sound like much, but that is 40% increase in the number of entries that I can fit. To give some context, using 8KB pages, that is a difference of 5 MB per million entries. That adds up.

The question is, can we do better? More on that in my next post…

Fight for every byte it takes: Storing raw numbers

Mon, 24 Apr 2023 12:00:00 GMT

I write databases for a living, which means that I’m thinking a lot about persistence. Here is a fun challenge that we went through recently. We have the need to store a list of keys and values and then lookup a value by key. Pretty standard stuff. The keys and values are both 64 bits integers. In other words, what I would like to have is:

Dictionary<long,long> lookup;

That would be perfect, except that I’ve to persist the data, which means that I have to work with raw bytes. It’s easiest to think about it if we have some code in front of us. Here is the interface that I need to implement:

As you can see, we have a byte buffer (8KB in size) and we want to add or lookup values from the buffer. All the data resides in the buffer, nothing is external. And we cannot unpack it in memory, because this is used for lookups, so this needs to be really fast.

The keys we are storing are file offsets, so they correlate quite nicely to the overall size of the file. Meaning that you’ll have a lot of small values, but also large ones. Given a key, we need to be able to look its value quickly, since we may run this lookup billions of times.

Given that I have 8KB of data, I can do the following, just treat the buffer as a sorted array, which means that I get a pretty easy way to search for a particular value and a simple way to actually store things.

Theoretically, given an 8KB page, and 16 bytes per each (key, value) entry, we can store up to 512 entries per page. But it turns out that this is just a theory. We also need to keep track of the number of items that we have, and that takes some space. Just a couple of bytes, but it means that we don’t have those bytes available. A page can now contain up to 511 entries, and even at full capacity, we have 14 bytes wasted (2 for the number of entries, and the rest are unused).

Here is what this looks like in code:

As you can see, we are creating two arrays, the keys are growing from the bottom of the page and the values are growing from the top. The idea is that I can utilize the BinarySearch() method to quickly find the index of a key (or where it ought) to go. From there, I can look at the corresponding values array to get the actual value. The fact that they are growing separately (and toward each other) means that I don’t need to move as much memory if I’m getting values out of order.

For now, I want to set up the playground in which we’ll operate. The type of data that you write into such a system is important. I decided to use the following code to generate the test set:

The idea is that we’ll generate a random set of numbers, in the given distribution. Most of the values are in the range of 8MB to 512GB, representing a pretty good scenario overall, I think.

And with that, we just need to figure out what metrics we want to use for this purpose. My goal is to push as many values as I can into the buffer, while maintaining the ability to get a value by its key as fast as possible.

The current approach, for example, does a binary search on a sorted array plus an extra lookup to the companion values array. You really can’t beat this, if you allow to store arbitrary keys. Here is my test bench:

This will insert key/value pairs into the page until it is full. Note that we allow duplicates (we’ll just update the value), so we need to keep track of the number of entries inserted, not just the number of insertions. We also validate the structure at any step of the way, to ensure that we always get the right behavior.

This code runs as expected and we can put 511 values into the page before it gives up. This approach works, it is simple to reason about and has very few flaws. It is also quite wasteful in terms of information density. I would like to do better than 511 entries / pager. Is it possible to drop below 16 bytes per entry?

Give it some thought, I’m going to present several ways of doing just that in my next post…

Looking into Corax’s posting lists: Part III

Mon, 17 Apr 2023 12:00:00 GMT

We looked into the internal of Corax’s posting list and in the last post I mentioned that we have a problem with the Baseline of the page.

We are by no means the first people to work with posting lists, and there is a wide body of knowledge on the topic. When it came time to figure out the compression format for our posting list, we used the PFOR format (Patched Frame of Reference). It, like pretty much all other integer compression methods, uses 32 bits integers. Corax utilizes 64 bits integer as the document ids, so that was a problem. We solved that problem by using a Baseline for each page. In other words, each page would be able to contain values in a range of 2.1 billion of one another. That is a very reasonable range, given that a page is 8KB in size.

There was a problem as we built more features into Corax. The posting list needed to store not just the document id, but also the frequency of the term in the source document. It turns out that we need 8 bits to do so, and we already have 64 bits range so… Instead of creating another location to store the frequencies, we put them directly inside the posting list. But that meant that we reserved a range of bits. We have 64 bits overall, so not a big problem, right? Except that on a page basis, we have a lot less. Before, a page could contain a range of 2.1 billion, but we reserved 10 bits (frequency and tag, the details of which are not important to our story) and we ended up with a range that is 4 million per page. That little tidbit meant that we could only store in a page items that were within 4MB of one another. And that was a problem. Whenever we had a posting list where two values would be more than 4MB from one another, we would need to split the page. And since the posting list and the entries live on the same file, having more page splits means that entries are now further apart.

Here is an example of what this looks like:

The index is taking more space than the source data, and most of that is used to store… nothing, since we ended up with a very wide posting list containing very few entries. One of the cases of two independent issues compounding each other very quickly.

So we changed things again, instead of limiting ourselves to 32 bits range per page, we changed the PFor format to allow for 64 bits integers directly. Once again, that leads to simplification in the codebase and has greatly reduced the amount of disk space that we are actually using.

To give some context, here is the current amount of disk space taken by the same entity that previously took 800+GB:

The problem wasn’t with the number of entries, but that each entry would consume 8KB of disk space on its own, and in the image, you are seeing the number of posting lists, not the number of posting lists entries.

Looking into Corax’s posting lists: Part II

Fri, 14 Apr 2023 12:00:00 GMT

In a previous post (which went out a long time ago) I explained that we have the notion of a set of uint64 values that are used for document IDs. We build a B+Tree with different behaviors for branch pages and leaf pages, allowing us to pack a lot of document IDs (thousands or more) per page.

The problem is that this structure hold the data compressed, so when we add or remove a value, we don’t know if it exists already or not. That is a problem, because while we are able to do any relevant fixups to skip duplicates and erase removed values, we end up in a position where the number of entries in the set is not accurate. That is a Problem, with a capital P, since we use that for query optimizations.

The solution for that is to move to a different manner of storing the data in the leaf page, instead of going with a model where we add the data directly to the page and compress when the raw values section overflows, we’ll use the following format instead:

Basically, I removed the raw values section from the design entirely. That means that whenever we want to add a new value, we need to find the relevant compressed segment inside the page and add to it (potentially creating a page split, etc).

Obviously, that is not going to perform well for write. Since on each addition, we’ll need to decompress the segment, add to it and then compress it again.

The idea here is that we don’t need to do that. Instead of trying to store the entries in the set immediately, we are going to keep them in memory for the duration of the transaction. Just before we commit the transaction, we are going to have two lists of document IDs to go through. One of added documents and one of removed documents. We can then sort those ids and then start walking over the list, find the relevant page for each section in the list, and merging it with the compressed values.

By moving the buffering stage from the per-page model to the per-transaction model, we actually gain quite a lot of performance, since if we have a lot of changes to a single page, we can handle compression of the data only once. It is a very strange manner of working, to be honest, because I’m used to doing the operation immediately. By delaying the cost to the end of the transaction, we are able to gain two major benefits. First, we have a big opportunity for batching and optimizing work on large datasets. Second, we have a single code path for this operation. It’s always: “Get a batch of changes and apply them as a unit”. It turns out that this is far easier to understand and work with. And that is for the writes portion of Corax.

Remember, however, that Corax is a search engine, so we expect a lot of reads. For reads, we can now stream the results directly from the compressed segments. Given that we can usually pack a lot of numbers into a segment, and that we don’t need to compare to the uncompressed portion, that ended up benefiting us significantly on the read side as well, surprisingly.

Of course, there is also another issue, look at the Baseline in the Page Header? We’ll discuss that in the next post, turned out that it wasn’t such a good idea.