The following bug cost me a bunch of time, can you see what I’m doing wrong?
For fun, it’s so nasty because usually, it will accidentally work.
The following bug cost me a bunch of time, can you see what I’m doing wrong?
For fun, it’s so nasty because usually, it will accidentally work.
In the previous post, I was able to utilize AVX to get some nice speedups. In general, I was able to save up to 57%(!) of the runtime in processing arrays of 1M items. That is really amazing, if you think about it. But my best effort only gave me a 4% improvement when using 32M items.
I decided to investigate what is going on in more depth, and I came up with the following benchmark. Given that I want to filter negative numbers, what would happen if the only negative number in the array was the first one?
In other words, let’s see what happens when we could write this algorithm as the following line of code:
array[1..].CopyTo(array);
The idea here is that we should measure the speed of raw memory copy and see how that compares to our code.
Before we dive into the results, I want to make a few things explicit. We are dealing here with arrays of long, when I’m talking about an array with 1M items, I’m actually talking about an 8MB buffer, and for the 32M items, we are talking about 256MB of memory.
I’m running these benchmarks on the following machine:
AMD Ryzen 9 5950X 16-Core Processor
Base speed: 3.40 GHz
L1 cache: 1.0 MB
L2 cache: 8.0 MB
L3 cache: 64.0 MBUtilization 9%
Speed 4.59 GHz
In other words, when we look at this, the 1M items (8MB) can fit into L2 (barely, but certainly backed by the L3 cache. For the 32M items (256MB), we are far beyond what can fit in the cache, so we are probably dealing with memory bandwidth issues.
I wrote the following functions to test it out:
Let’s look at what I’m actually testing here.
Here are the results for the 1M case (8MB):
Method | N | Mean | Error | StdDev | Ratio |
---|---|---|---|---|---|
FilterCmp | 1048599 | 441.4 us | 1.78 us | 1.58 us | 1.00 |
FilterCmp_Avx | 1048599 | 141.1 us | 2.70 us | 2.65 us | 0.32 |
CopyTo | 1048599 | 872.8 us | 11.27 us | 10.54 us | 1.98 |
MemoryCopy | 1048599 | 869.7 us | 7.29 us | 6.46 us | 1.97 |
MoveMemory | 1048599 | 126.9 us | 0.28 us | 0.25 us | 0.29 |
We can see some real surprises here. I’m using the FilterCmp (the very basic implementation) that I wrote.
I cannot explain why CopyTo() and MemoryMove() are so slow.
What is really impressive is that the FilterCmp_Avx() and MoveMemory() are so close in performance, and so much faster. To put it in another way, we are already at a stage where we are within shouting distance from the MoveMemory() performance. That is.. really impressive.
That said, what happens with 32M (256MB) ?
Method | N | Mean | Error | StdDev | Ratio |
---|---|---|---|---|---|
FilterCmp | 33554455 | 22,763.6 us | 157.23 us | 147.07 us | 1.00 |
FilterCmp_Avx | 33554455 | 20,122.3 us | 214.10 us | 200.27 us | 0.88 |
CopyTo | 33554455 | 27,660.1 us | 91.41 us | 76.33 us | 1.22 |
MemoryCopy | 33554455 | 27,618.4 us | 136.16 us | 127.36 us | 1.21 |
MoveMemory | 33554455 | 20,152.0 us | 166.66 us | 155.89 us | 0.89 |
Now we are faster in the FilterCmp_Avx than MoveMemory. That is… a pretty big wow, and a really nice close for this blog post series, right? Except that we won’t be stopping here.
The way the task I set out works, we are actually filtering just the first item out, and then we are basically copying the memory. Let’s do some math: 256MB in 20.1ms means 12.4 GB/sec!
On this system, I have the following memory setup:
64.0 GB
Speed: 2133 MHz
Slots used: 4 of 4
Form factor: DIMM
Hardware reserved: 55.2 MB
I’m using DDR4 memory, so I can expect a maximum speed of 17GB/sec. In theory, I might be able to get more if the memory is located on different DIMMs, but for the size in question, that is not likely.
I’m going to skip the training montage of VTune, understanding memory architecture and figuring out what is actually going on.
Let’s drop everything and look at what we have with just AVX vs. MoveMemory:
Method | N | Mean | Error | StdDev | Median | Ratio |
---|---|---|---|---|---|---|
FilterCmp_Avx | 1048599 | 141.6 us | 2.28 us | 2.02 us | 141.8 us | 1.12 |
MoveMemory | 1048599 | 126.8 us | 0.25 us | 0.19 us | 126.8 us | 1.00 |
FilterCmp_Avx | 33554455 | 21,105.5 us | 408.65 us | 963.25 us | 20,770.4 us | 1.08 |
MoveMemory | 33554455 | 20,142.5 us | 245.08 us | 204.66 us | 20,108.2 us | 1.00 |
The new baseline is MoveMemory, and in this run, we can see that the AVX code is slightly slower.
It’s sadly not uncommon to see numbers shift by those ranges when we are testing such micro-optimizations, mostly because we are subject to so many variables that can affect performance. In this case, I dropped all the other benchmarks, which may have changed things.
At any rate, using those numbers, we have 12.4GB/sec for MoveMemory() and 11.8GB/sec for the AVX version. The hardware maximum speed is 17GB/sec. So we are quite close to what can be done.
For that matter, I would like to point out that the trivial code completed the task in 11GB/sec, so that is quite respectable and shows that the issue here is literally getting the memory fast enough to the CPU.
Can we do something about that? I made a pretty small change to the AVX version, like so:
What are we actually doing here? Instead of loading the value and immediately using it, we are loading the next value, then we are executing the loop and when we iterate again, we will start loading the next value and process the current one. The idea is to parallelize load and compute at the instruction level.
Sadly, that didn’t seem to do the trick. I saw a 19% additional cost for that version compared to the vanilla AVX one on the 8MB run and a 2% additional cost on the 256MB run.
I then realized that there was one really important test that I had to also make, and wrote the following:
In other words, let's test the speed of moving memory and filling memory as fast as we possibly can. Here are the results:
Method | N | Mean | Error | StdDev | Ratio | RatioSD | Code Size |
---|---|---|---|---|---|---|---|
MoveMemory | 1048599 | 126.8 us | 0.36 us | 0.33 us | 0.25 | 0.00 | 270 B |
FillMemory | 1048599 | 513.5 us | 10.05 us | 10.32 us | 1.00 | 0.00 | 351 B |
MoveMemory | 33554455 | 20,022.5 us | 395.35 us | 500.00 us | 1.26 | 0.02 | 270 B |
FillMemory | 33554455 | 15,822.4 us | 19.85 us | 17.60 us | 1.00 | 0.00 | 351 B |
This is really interesting, for a small buffer (8MB), MoveMemory is somehow faster. I don’t have a way to explain that, but it has been a pretty consistent result in my tests.
For the large buffer (256MB), we are seeing results that make more sense to me.
In other words, for MoveMemory, we are both reading and writing, so we are paying for memory bandwidth in both directions. For filling the memory, we are only writing, so we can get better performance (no need for reads).
In other words, we are talking about hitting the real physical limits of what the hardware can do. There are all sorts of tricks that one can pull, but when we are this close to the limit, they are almost always context-sensitive and dependent on many factors.
To conclude, here are my final results:
Method | N | Mean | Error | StdDev | Ratio | RatioSD | Code Size |
---|---|---|---|---|---|---|---|
FilterCmp_Avx | 1048599 | 307.9 us | 6.15 us | 12.84 us | 0.99 | 0.05 | 270 B |
FilterCmp_Avx_Next | 1048599 | 308.4 us | 6.07 us | 9.26 us | 0.99 | 0.03 | 270 B |
CopyTo | 1048599 | 1,043.7 us | 15.96 us | 14.93 us | 3.37 | 0.11 | 452 B |
ArrayCopy | 1048599 | 1,046.7 us | 15.92 us | 14.89 us | 3.38 | 0.14 | 266 B |
UnsafeCopy | 1048599 | 309.5 us | 6.15 us | 8.83 us | 1.00 | 0.04 | 133 B |
MoveMemory | 1048599 | 310.8 us | 6.17 us | 9.43 us | 1.00 | 0.00 | 270 B |
FilterCmp_Avx | 33554455 | 24,013.1 us | 451.09 us | 443.03 us | 0.98 | 0.02 | 270 B |
FilterCmp_Avx_Next | 33554455 | 24,437.8 us | 179.88 us | 168.26 us | 1.00 | 0.01 | 270 B |
CopyTo | 33554455 | 32,931.6 us | 416.57 us | 389.66 us | 1.35 | 0.02 | 452 B |
ArrayCopy | 33554455 | 32,538.0 us | 463.00 us | 433.09 us | 1.33 | 0.02 | 266 B |
UnsafeCopy | 33554455 | 24,386.9 us | 209.98 us | 196.42 us | 1.00 | 0.01 | 133 B |
MoveMemory | 33554455 | 24,427.8 us | 293.75 us | 274.78 us | 1.00 | 0.00 | 270 B |
As you can see, just the AVX version is comparable or (slightly) beating the MoveMemory function.
I tried things like prefetching memory, both the next item, the next cache item and from the next page, using non-temporal load and stores and many other things, but this is a pretty tough challenge.
What is really interesting is that the original, very simple and obvious implementation, clocked at 11 GB/sec. After pulling pretty much all the stops and tricks, I was able to hit 12.5 GB / sec.
I don’t think anyone can look / update / understand the resulting code in any way without going through deep meditation. That is not a bad result at all. And along the way, I learned quite a bit about how the lowest level of the machine architecture is working.
I have been doing Open Source work for just under twenty years at this point. I have been paying my mortgage from Open Source software for about 15. I’m stating that to explain that I have spent quite a lot of time struggling with the inherent tension between having an Open Source project and getting paid.
I wrote about it a few times in the past. It is not a trivial problem, and the core of the issue is not something that you can easily solve with technical means. I ran into this fascinating thread on Twitter that over the weekend:
you just described licensing. As you missed 1 important aspect: if an org isn't obligated to pay, they won't. So you need a form of making them pay by giving them a token they paid which then makes them able to use your software. Any other form will fail.
— Frans Bouma (@FransBouma) August 11, 2023
And another part of that is here:
The thing is, businesses spend significant amounts of money on software licenses, whether on-prem or as-a-service.
— Udi Dahan (@UdiDahan) August 11, 2023
They understand and accept this, as do their shareholders and investors. It is a cost of doing business.
Donations? Not so much.
I’m quoting the most relevant pieces, but the idea is pretty simple.
Donations don’t work, period. They don’t work not because companies are evil or developers don’t want to pay for Open Source. They don’t work because it takes a huge amount of effort to actually get paid.
If you are an independent developer, your purchasing process goes something like this:
Did you note step 2? The part about needing to pay?
If you don’t have that step, what will happen? Same scenario, an independent developer:
That is in the best-case scenario where the thought of donating actually crossed your mind. In most likelihood, the process is more:
Now, what happens if you are not an independent developer? Let’s say that you are a contract worker for a company. You need to talk to your contact person, they will need to get purchasing approval. Depending on the amount, that may require escalating upward a few levels, etc.
Let’s say that the amount is under 100$, so basically within the budgetary discretion of the first manager you run into. They would still need to know what they are paying for, what they are getting out of that (they need to justify that). If this is a donation, welcome to the beauty of tax codes in multiple jurisdictions and what counts as such. If this is not a donation, what do they get? That means that you now have to do a meeting, potentially multiple ones. Present your case, open a new supplier at the company, etc.
The cost of all of those is high, both in time and money. Or… you can just nuget add-package and move on.
In the case of RavenDB, it is an Open Source software (a license to match, code is freely available), but we treat it as a commercial project for all intents and purposes. If you want to install RavenDB, you’ll get a popup saying you need a license, directing you to a page where you see how much we would like to get and what do you get in return, etc. That means that from a commercial perspective, we are in a familiar ground for companies. They are used to paying for software, and there isn’t an option to just move on to the next task.
There is another really important consideration here. In the ideal Open Source donation model, money just shows up in your account. In the commercial world, there is a huge amount of work that is required to get things done. That is when you have a model where “the software does not work without a purchase”. To give some context, 22% is Sales & Marketing and they spent around 21.8 billion in 2022 on Sales & Marketing. That is literally billions being spent to make sales.
If you want to make money, you are going to invest in sales, sales strategy, etc. I’m ignoring marketing here because if you are expected to make money from Open Source, you likely already have a project well-known enough to at least get started.
That means that you need to figure out what you are charging for, how do you get customers, etc. In the case of RavenDB, we use the per-core model, which is a good indication of how much use the user is getting from RavenDB. LLBLGen Pro, on the other hand, they are charging per seat. Particular’s NServiceBus uses a per endpoint / number of messages a day model.
There is no one model that fits all. And you need to be able to tailor your pricing model to how your users think about your software.
So pricing strategy, creating a proper incentive to purchase (hard limit, usually) and some sales organization to actually drive all of that are absolutely required.
Notice what is missing here? GitHub. It simply has no role at all up to this point. So why the title of this post?
There is one really big problem with getting paid that GitHub can solve for Open Source (and in general, I guess).
The whole process of actually getting paid is absolutely atrocious. In the best case, you need to create a supplier at the customer, fill up various forms (no, we don’t use child labor or slaves, indeed), figure out all sorts of weird roles (German tax authority requires special dispensation, and let’s not talk about getting paid from India, etc). Welcome to Anti Money Laundering roles and GDPR compliance with Known Your Customer and SOC 2 regulations. The last sentence is basically nonsense words, but I understand that if you chant it long enough, you get money in the end.
What GitHub can do is be a payment pipe. Since presumably your organization is already set up with them in place, you can get them to do the invoicing, collecting the payment, etc. And in the end, you get the money.
That sounds exactly like GitHub Sponsorships, right? Except that in this case, this is no a donation. This is a flat-out simple transaction, with GitHub as the medium. The idea is that you have a limit, which you enforce, on your usage, and GitHub is how you are paid. The ability to do it in this fashion may make things easier, but I would assume that there are about three books worth of regulations and EULAs to go through to make it actually successful.
Yet, as far as I’m concerned, that is really the only important role that we have for GitHub here.
That is not a small thing, mind. But it isn’t a magic bullet.
In my previous post I discussed how we could store the exact same information in several ways, leading to space savings of 66%! That leads to interesting questions with regard to actually making use of this technique in the real world.
The reason I posted about this topic is that we just gained a very significant reduction in memory (and we obviously care about reducing resource usage). The question is whether this is something that you want to do in general.
Let’s look at that in detail. For this technique to be useful, you should be using structs in the first place. That is… not quite true, actually. Let’s take a look at the following declarations:
We define the same shape twice. Once as a class and once as a structure. How does this look in memory?
Here you can find some really interesting differences. The struct is smaller than the class, but the amount of wasted space is much higher in the struct. What is the reason for that?
The class needs to carry 16 bytes of metadata. That is the object header and the pointer to the method table. You can read more about the topic here. So the memory overhead for a class is 16 bytes at a minimum. But look at the rest of it.
You can see that the layout in memory of the fields is different in the class versus the structure. C# is free to re-order the fields to reduce the padding and get better memory utilization for classes, but I would need [StructLayout(LayoutKind.Auto)] to do the same for structures.
The difference between the two options can be quite high, as you can imagine. Note that automatically laying out the fields in this manner means that you’re effectively declaring that the memory layout is an implementation detail. This means that you cannot persist it, send it to native code, etc. Basically, the internal layout may change at any time. Classes in C# are obviously not meant for you to poke into their internals, and LayoutKind.Auto comes with an explicit warning about its behavior.
Interestingly enough, [StructLayout] will work on classes, you can use to force LayoutKind.Sequential on a class. That is by design, because you may need to pass a part of your class to unmanaged code, so you have the ability to control memory explicitly. (Did I mention that I love C#?)
Going back to the original question, why would you want to go into this trouble? As we just saw, if you are using classes (which you are likely to default to), you already benefit from the automatic layout of fields in memory. If you are using structs, you can enable LayoutKind.Auto to get the same behavior.
This technique is for the 1% of the cases where that is not sufficient, when you can see that your memory usage is high and you can benefit greatly from manually doing something about it.
That leads to the follow-up question, if we go about implementing this, what is the overhead over time? If I want to add a new field to an optimized struct, I need to be able to understand how it is laid out in memory, etc.
Like any optimization, you need to maintain that. Here is a recent example from RavenDB.
In this case, we used to have an optimization that had a meaningful impact. The .NET code changed, and the optimization now no longer makes sense, so we reverted that to get even better perf.
At those levels, you don’t get to rest on your laurels. You have to keep checking your assumptions.
If you got to the point where you are manually optimizing memory layouts for better performance, there are two options:
You can make that easier by adding tests to verify those assumptions. For example, verifying the amount of padding in structs match expectation. A simple test that would verify the size of a struct would mean that any changes to that are explicit. You’ll need to modify the test as well, and presumably that is easier to catch / review / figure out than just adding a field and not noticing the impact.
In short, this isn’t a generally applicable pattern. This is a technique that is meant to be applied in case of true need, where you’ll happily accept the additional maintenance overhead for better performance and reduced resource usage.
Consider a warehouse that needs to keep track of items. For the purpose of discussion, we have quite a few fields that we need to keep track of. Here is how this looks like in code:
And the actual Warehouse class looks like this:
The idea is that this is simply a wrapper to the list of items. We use a struct to make sure that we have good locality, etc.
The question is, what is the cost of this? Let’s say that we have a million items in the warehouse. That would be over 137MB of memory. In fact, a single struct instance is going to consume a total of 144 bytes.
That is… a big struct, I have to admit. Using ObjectLayoutInspector I was able to get the details on what exactly is going on:
Type layout for 'WarehouseItem' Size: 144 bytes. Paddings: 62 bytes (%43 of empty space)
As you can see, there is a huge amount of wasted space here. Most of which is because of the nullability. That injects an additional byte, and padding and layout issues really explode the size of the struct.
Here is an alternative layout, which conveys the same information, much more compactly. The idea is that instead of having a full byte for each nullable field (with the impact on padding, etc), we’ll have a single bitmap for all nullable fields. Here is how this looks like:
If we look deeper into this, we’ll see that this saved a lot, the struct size is now 96 bytes in size. It’s a massive space-savings, but…
Type layout for 'WarehouseItem'
Size: 96 bytes. Paddings: 24 bytes (%25 of empty space)
We still have a lot of wasted space. This is because we haven’t organized the struct to eliminate padding. Let’s reorganize the structs fields to see what we can achieve. The only change I did was re-arrange the fields, and we have:
And the struct layout is now:
We have no wasted space, and we are 50% of the previous size.
We can actually do better, note that Fragile and IsHazarous are Booleans, and we have some free bits on _nullability that we can repurpose.
For that matter, RgbColor only needs 24 bits, not 32. Do we need alcohol content to be a float, or can we use a byte? If that is the case, can we shove both of them together into the same 4 bytes?
For dates, can we use DateOnly instead of DateTime? What about ShelfLife, can we measure that in hours and use a short for that (giving us a maximum of 7 years)?
After all of that, we end up with the following structure:
And with the following layout:
In other words, we are now packing everything into 48 bytes, which means that we are one-third of the initial cost. Still representing the same data. Our previous Warehouse class? It used to take 137MB for a million items, it would now take 45.7 MB only.
In RavenDB’s case, we had the following:
That is the backing store of the dictionary, and as you can see, it isn’t a nice one. Using similar techniques we are able to massively reduce the amount of storage that is required to process indexing.
Here is what this same scenario looks like now:
But we aren’t done yet , there is still more that we can do.
RavenDB is a .NET application, written in C#. It also has a non trivial amount of unmanaged memory usage. We absolutely need that to get the proper level of performance that we require.
With managing memory manually, there is also the possibility that we’ll mess it up. We run into one such case, when running our full test suite (over 10,000 tests) we would get random crashes due to heap corruption. Those issues are nasty, because there is a big separation between the root cause and the actual problem manifesting.
I recently learned that you can use the gflags tool on .NET executables. We were able to narrow the problem to a single scenario, but we still had no idea where the problem really occurred. So I installed the Debugging Tools for Windows and then executed:
&"C:\Program Files (x86)\Windows Kits\10\Debuggers\x64\gflags.exe" /p /enable C:\Work\ravendb-6.0\test\Tryouts\bin\release\net7.0\Tryouts.exe
What this does is enable a special debug heap at the executable level, which applies to all operations (managed and native memory alike).
With that enabled, I ran the scenario in question:
PS C:\Work\ravendb-6.0\test\Tryouts> C:\Work\ravendb-6.0\test\Tryouts\bin\release\net7.0\Tryouts.exe
42896
Starting to run 0
Max number of concurrent tests is: 16
Ignore request for setting processor affinity. Requested cores: 3. Number of cores on the machine: 32.
To attach debugger to test process (x64), use proc-id: 42896. Url http://127.0.0.1:51595
Ignore request for setting processor affinity. Requested cores: 3. Number of cores on the machine: 32. License limits: A: 3/32. Total utilized cores: 3. Max licensed cores: 1024
http://127.0.0.1:51595/studio/index.html#databases/documents?&database=Should_correctly_reduce_after_updating_all_documents_1&withStop=true&disableAnalytics=true
Fatal error. System.AccessViolationException: Attempted to read or write protected memory. This is often an indication that other memory is corrupt.
at Sparrow.Server.Compression.Encoder3Gram`1[[System.__Canon, System.Private.CoreLib, Version=7.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e]].Encode(System.ReadOnlySpan`1<Byte>, System.Span`1<Byte>)
at Sparrow.Server.Compression.HopeEncoder`1[[Sparrow.Server.Compression.Encoder3Gram`1[[System.__Canon, System.Private.CoreLib, Version=7.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e]], Sparrow.Server, Version=6.0.0.0, Culture=neutral, PublicKeyToken=37f41c7f99471593]].Encode(System.ReadOnlySpan`1<Byte> ByRef, System.Span`1<Byte> ByRef)
at Voron.Data.CompactTrees.PersistentDictionary.ReplaceIfBetter[[Raven.Server.Documents.Indexes.Persistence.Corax.CoraxDocumentTrainEnumerator, Raven.Server, Version=6.0.0.0, Culture=neutral, PublicKeyToken=37f41c7f99471593],[Raven.Server.Documents.Indexes.Persistence.Corax.CoraxDocumentTrainEnumerator, Raven.Server, Version=6.0.0.0, Culture=neutral, PublicKeyToken=37f41c7f99471593]](Voron.Impl.LowLevelTransaction, Raven.Server.Documents.Indexes.Persistence.Corax.CoraxDocumentTrainEnumerator, Raven.Server.Documents.Indexes.Persistence.Corax.CoraxDocumentTrainEnumerator, Voron.Data.CompactTrees.PersistentDictionary)
at Raven.Server.Documents.Indexes.Persistence.Corax.CoraxIndexPersistence.Initialize(Voron.StorageEnvironment)
That pinpointed things so I was able to know exactly where we are messing up.
I was also able to reproduce the behavior on the debugger:
This saved me hours or days of trying to figure out where the problem actually is.
We got a support call from a client, in the early hours of the morning, they were getting out-of-memory errors from their database and were understandably perturbed by that. They are running on a cloud system, so the first inclination of the admin when seeing the problem was deploying the server on a bigger instance, to at least get things running while they investigate. Doubling and then quadrupling the amount of memory that the system has had no impact. A few minutes after the system booted, it would raise an error about running out of memory.
Except that it wasn’t actually running out of memory. A scenario like that, when we give more memory to the system and still have out-of-memory errors can indicate a leak or unbounded process of some kind. That wasn’t the case here. In all system configurations (including the original one), there was plenty of additional memory in the system. Something else was going on.
When our support engineer looked at the actual details of the problem, it was quite puzzling. It looked something like this:
System.OutOfMemoryException: ENOMEM on Failed to munmap at Sparrow.Server.Platform.Posix.Syscall.munmap(IntPtr start, UIntPtr length);
That error made absolutely no sense, as you can imagine. We are trying to release memory, not allocate it. Common sense says that you can’t really fail when you are freeing memory. After all, how can you run out of memory? I’m trying to give you some, damn it!
It turns out that this model is too simplistic. You can actually run out of memory when trying to release it. The issue is that it isn’t you that is running out of memory, but the kernel. Here we are talking specifically about the Linux kernel, and how it works.
Obviously a very important aspect of the job of the kernel is managing the system memory, and to do that, the kernel itself needs memory. For managing the system memory, the kernel uses something called VMA (virtual memory area). Each VMA has its own permissions and attributes. In general, you never need to be aware of this detail.
However, there are certain pathological cases, where you need to set up different permissions and behaviors on a lot of memory areas. In the case we ran into, RavenDB was using an encrypted database. When running on an encrypted database, RavenDB ensures that all plain text data is written to memory that is locked (cannot be stored on disk / swapped out).
A side effect of that is that this means that for every piece of memory that we lock, the kernel needs to create its own VMA. Since each of them is operated on independently of the others. The kernel is using VMAs to manage its own map of the memory. and eventually, the number of the items in the map exceeds the configured value.
In this case, the munmap call released a portion of the memory back, which means that the kernel needs to split the VMA to separate pieces. But the number of items is limited, this is controlled by the vm.max_map_count value.
The default is typically 65530, but database systems often require a lot more of those. The default value is conservative, mind.
Adjusting the configuration would alleviate this problem, since that will give us sufficient space to operate normally.
On its face, we have a simple requirement:
Generating the next number in the sequence is literally as simple as ++, so surely that is a trivial task, right?
The problem is with the second requirement. The need to ensure that there are no gaps comes often when dealing with things like invoices. The tax authorities are really keen on “show me all your invoices”, and if there are gaps in the numbers, you may have to provide Good Answers.
You may think that the third one, running in a distributed environment, is the tough challenge, but that isn’t actually the case. If we are running in a single location, that is fairly easy. Run the invoice id generation as a transaction, and you are done. But the normal methods of doing that are usually wrong in edge cases.
Let’s assume that we use an Oracle database, which uses the following mechanism to generate the new invoice id:
invoice_seq.NEXTVAL
Or SQL Server with an identity column:
CREATE TABLE invoices ( invoice_id INT IDENTITY(1,1) PRIMARY KEY, ... );
In both cases, we may insert a new value to the invoices table, consuming an invoice id. At some later point in time, we may roll back the transaction. Care to guess what happens then?
You have INVOICE #1000 and INVOICE #1002, but nothing in between. In fact, no way to even tell what happened, usually.
In other words, identity, sequence, serial, or autonumber – regardless of what database platform you use, are not suitable for generating gapless numbers.
The reasoning is quite simple. Assume that you have two concurrent transactions, which generate two new invoices at roughly the same time. You commit the later one before the first one, and roll back the first. You now have:
However, you don’t have Invoice #1001, and you cannot roll back the sequence value there, because if you do so, it will re-generate the #1002 on the next call.
Instead, for gapless numbers, we need to create this as a dedicated part of the transaction. So there would be a record in our system that contains the NextInvoiceId, which will be incremented as part of the new invoice creation.
In order to ensure that there are no gaps, you need to ensure that the NextInvoideId record increment is handled as a user operation, not a database operation. In other words, in SQL Server, that is a row in a table, that you manually increment as part of adding a new invoice. Here is what this will look like:
As you can see, we increment the row directly. So it will be rolledback as well.
The downside here is that we can no longer create two invoices concurrently. The second transaction would have to wait on the lock on the row in the next_gapless_ids table.
All of that happens inside a single database server. What happens when we are running in a distributed environment?
The answer in this case, is the exact same thing. You need a transaction, a distributed one, using a consensus algorithm. Here is how you can achieve this using RavenDB’s cluster wide transactions, which use the Raft protocol behind the scenes:
The idea is simple, we have a transaction that modifies the gapless ids document and creates a new invoice at the same time. We have to handle a concurrency exception if two transactions try to create a new invoice at the same time (because they both want to use the same invoice id value), but in essence this is pretty much exactly the same behavior as when you are running on a single node.
In other words, to ensure the right behavior, you need to use a transaction. And if you need a distributed transaction, that is just a flag away with RavenDB.
A long while ago I had this project, which allows you to define software licenses that you can distribute. The basic idea is pretty simple, we want to be able to define a key (needs to be short and copy/pastable) that we’ll be able to provide to our code that is running in a separate environment. In particular, we have to deal with the issue of not being connected to a central server.
That sounds like a really hard problem, but it turns out that it is a pretty simple solution, if we use public key cryptography. Typically you’ll utilize public key cryptography for encryption, but you can also use that for signing. The idea is that we can use the ability to sign the key using a private key, then validate it (potentially offline) using the public key.
Licensing can be complex, so we are going to punt all of that to someone else. In this post I’m going to just discuss how we can sign a piece of data and then verify it. We’ll start by generating the keys, this is an action that you should only do once:
Here is the output of this code:
You can now embed the public key in your software, while keeping the private key hidden.
With that in place, we can now sign a license. But what is a license? At its most basic form, it is a set of attributes that describe what your license allows. As such, we can use a Dictionary<string,string> to define the capabilities of the license.
With that in place, we can write very simple code to generate the license text:
And here is the result:
The last part to deal with is verifying that the license is valid, we can do that using the following function:
If the license options are the same as the one we signed and the cryptographic signature is a match, we can safely return the license options. Otherwise, we’ll return null.
As I said, this is pretty simple code all around. And doesn’t do much, but it means that you can let the framework carry most of the load.
Features such as enabling specific capabilities for a license, expiration, etc are all possible. Define an ExpirationDate property and have your software react to that, for example.
A few words about this implementation, however. I’m relying heavily on the .NET framework, obviously. But beyond just using the cryptographic primitives, we also have to take into account a bunch of other aspects.
For example, I’m not bothering to normalize the license. Instead, I rely on the fact that the .NET Dictionary will iterate over keys in insertion order. Note that any change in the actual data will result in a verification fails.
This is a pretty simple design, but it ensures that you cannot “crack” the algorithm used to generate the license keys. Of course, users can still always just patch the isValidLicense function, instead .
After this saga, I wanted to close the series with some numbers about the impact of this algorithm.
If you’ll recall, I started this whole series discussing variable-sized integers. I was using this list of numbers to compare the values. There are 44,956 items in the list.
Algorithm | Size |
Raw | 359.648 |
Varint | 224,780 |
Delta + Varint | 45,970 |
Delta + Varint + Gzip | 5,273 |
FastPFor | 24,717 |
You can see some interesting details about the various options. Delta + Varint + Gzip is a really good setup, if you can assume that the output pattern has a high degree of repetition. In some cases, that is possible, but that isn’t a generic property. Aside from that, FastPFor is winning hands down in terms of the output size of the data.
There is also another important aspect. The size of the output here is too big to fit in a single 8KB buffer, so I’ll need to use multiple for that. This is not something that you can really do with GZip. Here is the cost across multiple buffers:
This particular list would fit into 4 pages:
But the compression ratio is only part of that. Let’s talk about the performance aspect. On my machine, you can run the encoding process in 1,115 ticks and decoding in just 462 ticks.
To give some context, that means that you can do the encoding at a rate of ~400,000,000 / second and decoding at a rate of about 1 billion per second.
My perf team also asked me to mention that they haven’t gotten the chance to look at the code yet, so things are likely to improve.
The entire premise of FastPFor inside of RavenDB relies on these fast decoding & encoding times. It makes it super cheap to iterate over a vast amount of numbers, and the B+Tree layout we have means that updating a posting list’s page is trivial. Read it, merge, and encode it back. It’s fast enough that there is really no other place for meaningful optimizations / complexity.
No future posts left, oh my!