The previous post has a code sample in it that was figuratively* physically painful for me to write. Avoiding the number of syscalls that are invoked, the code isn’t all too efficient as I now measure things, it uses way too much managed memory and it is subject to failures as we increase the amount of data we push through. For this post, I’m going to be rewriting the CodexWriter class as I would for code that is going into RavenDB.
* I literally know what literally used to mean, amazing.
I’m sorry, there is going to be a big jump in the complexity of the code, because I’m going to try to handle performance, parallelism and resource utilization all at once. The first thing to do is to go into the project’s settings and enable both unsafe code (without which is it nearly impossible to write high performance code) and C# 7.3 features, we’ll need these.
We can divide the task of gather the inputs into several stages. First, we need to write the data to the file. This is similar to the way we did it before, here is the Add() method:
As you can see, there isn’t really much that changed here, but we have this notion of a segment, which is created every million keys. But what is this segment?
It is a way to refer to a specific section of records in the file. In particular, it has just one primary role, it exists to sort the records. Let’s take a look at the code:
There are a few key points. Instead of using file I/O directly, we are using memory mapped files. Why is that? Because, as we have seen, the cost of syscalls is non trivial in the extreme, and using memory mapped files means that we can access the data natively without having to pay any price aside from page fault if the data isn’t already in memory.
The EnsureSorted() method is also interesting, it spawns a new task to sort the entries inside the segment in parallel with inserting the data to the main file. The actual sort is handled in the Compare() methods.
As we write the data into the codex, we sort the data as we run through it, but what happens in the end? In this case, we have about 13 million items that we inserted, so we have 13 segments that are each individually sorted. To get the final sort, we basically merge from all of them. Here is the relevant code:
This used the SortedSet as a heap, to always get the minimum value from the sorted inner values in the set. Note that we need to wait for the parallel searches to complete, then merge from all of them to the final result. We can write the result of the sort directly to the end of the file.
Overall, this process takes: 59.01 seconds to complete. Remember that this is when we are pushing unsorted data through. If we pass the data sorted, we get a significant improvement and only take: 35.91 seconds.
To compare, I run the same sort of test on Voron, and I got: 59.15 seconds for the unsorted case and for the sorted case: 13.85 seconds. This is when Voron is also doing ACID writes, which we obviously don’t in Codex.
I guess that spending four to five years with a whole team doing performance optimization is a better way to get storage performance than a couple of evenings hacking before I go to bed, who knew?