Guys, we still have about 2 available purchases on our end of year sale. This is for a 32% discount for the following products:
This applies to:
You can use coupon code: 0x20-twentysomething
Guys, we still have about 2 available purchases on our end of year sale. This is for a 32% discount for the following products:
This applies to:
You can use coupon code: 0x20-twentysomething
Take a look at the following profilers results. They both showcase pretty much the same codebase, without any major changes in between. However, they have been both run on different machines, and they result in very different performance optimization paths.
Main machine:
Laptop:
As you can see, when running the code on my main machine, the most expensive thing is actually writing to disk (WriteToJournal). However, on my laptop, writing to disk is actually very cheap. The likely reason is, again, my laptop is faking writing to disk in favor of buffering writes, even though we explicitly tell it not to. That means that it behaves as if it is we already optimized writing to disk. My main machine behaves more like a server, and we see a lot more I/O costs, which pretty much mask any other costs that we might have.
Using the profiler to guide you with regards to how to optimize the system on any one of those machines would lead you in very different paths. With Voron, we are testing it in a variety of scenarios, to make sure that we aren’t optimizing for one particular machine and hurting another.
So, we found out that the major cost of random writes in our tests was actually writing to disk. Writing 500K sequential items resulted in about 300 MB being written. Writing 500K random items resulted in over 2.3 GB being written.
Note: I would like to point out Alex’s comment, which helped setup this post.
So the obvious thing to do would be to use compression. I decided to try and see what that would give us, and the easiest thing to do is to just enable file compression at the NTFS level. But we can probably do better. The major cost we have is writes to the journal file. And we only ever read from the journal file when we recover the database. We are also usually writing multiple pages at a time for most transactions, and for the scenario we care about, writing 100 random values in a single transaction, we are usually write about a 100 pages or so anyway. That means that we have got a pretty big buffer that we can try to compress all at once.
Sadly, we don’t really see a meaningful improvement under that scenario. Using NTFS compression slow us down considerably, while both LZ4 and Snappy resulted in greatly reduce file writes, but roughly the same performance.
Note that I have done just the minimal amount of work required to test this out. I changed how we are writing to the journal, and disabled reading from it. That was pretty disappointing, to tell you the truth, I fully expected that we’ll see some interesting improvements there. The LZ4 compression factor is about x10 for our data set. That means that 100 – 121 pages usually are compressed to 10 – 13 pages.
The good thing about this is that running this through the profiler shows that we are running on a high I/O machine, where the amount and speed of I/O doesn’t impact our performance much.
To test it out, I run the same test on an Amazon m3.2xlarge machine, and got the following results (writing to the C: drive):
So for now I think that I’ll do the following. We’ll focus on the other costs beyond I/O, and profile heavily on systems with higher I/O costs.
Compression should work, in fact, my first thought was that we pay for less I/O with more CPU, but that is a guess without looking at the numbers.
Here is the cost of committing using compression (lz4):
And here it is without compression:
So the good news is that we can visible see that we reduced the I/O cost from 8 seconds to 3.8 seconds. And even with compression cost of 2.6 seconds, we are still ahead.
However, the cost of the collections used is actually a lot worse from our perspective. So we are back to reducing computation costs first, then looking at compression again at a later point in time.
A while ago we added monitoring capabilities to RavenDB via performance counters. The intent was to give our users the ability to easily see exactly what is going on with RavenDB.
The actual usage, however, was a lot more problematic.
I would have been willing to assume that we are doing something really stupid. Except that SignalR had similar issues.
What is worse, it appears that using the performance counters needs to iterate the list of printers on the machine.
This is really ridiculous. And it is to the point in which I am willing to just give it up entirely, if only I had something that I could use to replace it with.
Currently I threw a lot of try /catch and a background thread to hide that, but it is ugly and brittle.
After pretty much getting to where we wanted with sequential writes, it is time to try to tackle the cost of random writes. Here is Esent & Voron doing random writes for 1 million entries:
Yes, it is sad that this is the case, because we can do about 40 times better for sequential writes.
However… what is the reason for the slowness? As usual, the answer is to profile, profile & profile. In fact, we want to do profiling on multiple machines, to check that we don’t have a machine bias.
At any rate, as you can see, we are actually pretty close to Esent in random writes, but why are we so much slower than our performance during sequential writes?
Well, we actually covered that, actually. The major reason for the cost is actually the fact that we have to deal with very different B-Tree shapes. You can see the difference between sequential and randomly inserted trees here. The more random you are, the bigger your tree is, the more pages you have to touch whenever you update. Our scenario calls for making 100 random modification in every transaction. That means that on every transaction, we have to touch many different parts of the tree.
Here are the results (in pages) for how much it cost us (in this test, we run 5,000 transactions of 100 items each.
Sequential | Random | |
Average | 7.06 | 111.37 |
Max | 11.00 | 134.00 |
Total | 35,320.00 | 556,973.00 |
Std Dev | 0.35 | 7.09 |
In other words, this means that on average, a sequential transaction wrote 7 pages, or about 28Kb. While the average random transaction wrote about 450Kb!
While the total size written for sequential was 137MB, and allocate 319MB. However, random sizes wrote 2.1 GB for 2.3 GB that were allocated. And that explains quite a lot about the whole thing. (Note that we count for writes only the journal writes, not the data writes, because those are the limiting factor for write performance).
Esent, by the way, writes sequentially about 465 MB and 2.3GB for random writes. Now we know why we are pretty close, because the vast majority of the time is actually spent just writing to the disk, and we are stalled by the sheer amount of data we have.
Interestingly enough, I don’t think that there is a good way to handle that nicely. Certainly not with a B-Tree. We’ll see how we can work with that at a future post.
Guys, we still have about 5 – 7 available purchases on our end of year sale. This is for a 32% discount for the following products:
This applies to:
You can use coupon code: 0x20-twentysomething
The following code is taken from the leveldb codebase:
This code is responsible for merging concurrent transactions. It operates using a mutex and conditional variables. And the idea really impressed me when I first saw it.
I implemented that in Voron because I think that this is a great way to gain higher concurrency rates while still having a single writer mode (which is much easier to work with).
However, as it turned out, this isn’t a really good idea in practice. The way conditional variables work, you lock the mutex, check your condition, and then wait on the conditional variable. Waking up from a conditional variable means that you have re-acquired the mutex.
What I want to happen:
However, because of the way conditional variables are implemented, what will actually happen is that they each will wake up one at a time, acquiring the mutex lock, then releasing it. However, while they are being released, they compete with one another, and they also compete with the next write.
We profiled this approach, and the result was that we could see how we were spending pretty much all of our time in just synchronizing threads.
Instead, we moved to a different approach, in which the first write will actually start off a new thread, dedicated to writing batches. The way it works, when a thread want to write a batch, it will add that to a queue and wake up the background writer thread, then wait on an event. The background writer will read all the current batches and merge them into a single write transaction.
When it is done, it will wake up all the sleeping writers. We tried to use TaskCompletionSource for that, initially, but we found that the inline nature of tasks made that too expensive to use. Instead, we use ManualResetEventSlim, and we explicitly wait / wake them. We even reuse events, so we don’t have to keep creating and disposing them.
The end result is that we have a pretty sequential process for actually doing concurrent writes, which turn it into a simple producers/consumer problem from threading perspective, and the actual writes into a simple write things out as fast as you can process them.
This also gives us the chance to do some re-ordering of operations to get better performance overall.
For the past few months, we have been doing a lot of work on performance for Voron. Performance for a storage engine is a very complex problem, composed of many competing issues.
For example, sequential writes vs. random writes. Or the cost of writes vs. the cost of reads. In this post, I am going to focus specifically on the single scenario we have focused most of our efforts on, sequential writes.
Sequential writes are important because they present the absolute sweet spot for a storage engine. In other words, this is the very first thing that needs to be fast, because if that ain’t fast ,nothing else would be. And because this represent the absolute sweat spot, it is perfectly fine and very common to explicitly design your system to ensure that this is what you’ll use.
We have been doing a lot there. In particular, some of our changes included:
There have probably been others ,but those have been the major ones. Let us look at the numbers, shall we? I am going to compare us to Esent, since this is our current storage engine.
The test is writing out 10 million items, in 100,000 transactions, with 100 items per transaction. This is done in purely sequential manner. Item size is 128 bytes value and 16 bytes key.
Here are the original code numbers.
Note that this was run on a W520 Lenovo with an SSD drive. The actual details don’t really matter, what matters is that we are comparing two runs happening on the same machine.
And now, let us compare Esent and Voron as they currently stand:
Yes, you see clearly.
The big jump in Esent numbers have happened because we also worked on the test suite, and optimized things there, but the important thing is that we are now significantly faster than Esent.
In fact, we are fast enough that it took me a while to accept that we are actually doing it. But yes, you are seeing an order of magnitude improvement over what we used to have.
Feel free to take a moment to do a victory dance, that is what I’m doing right now.
However, note the title of this post. This is for a single threaded sequential inserts. The numbers are a lot worse for random writes, but we’ll deal with that in another post.
Well, tomorrow I’ll be 0x20. Leaving aside the fact that I am just entering my twenties (finally, it feels like I was a 0xTeenager for over a decade), there is the tradition to uphold.
Therefor, we have a 32% discount until the end of the year*.
You can use coupon code: 0x20-twentysomething
This applies to:
* Limited to the first 0x32, and not applicable if you have to ask why you get a 32% discount.
One of the surprising points for improvement in our performance run was the following logic, responsible for copying the data from the user to our own memory:
1: using (var ums = new UnmanagedMemoryStream(pos, value.Length, value.Length, FileAccess.ReadWrite))2: {
3: value.CopyTo(ums);4: }
Those three lines of code were responsible for no less than 25% of our performance. It was obvious that something needed to be done. My belief is that the unmanaged memory stream is just not optimized for this scenario, resulting in a lot of copying, allocations and costs.
Here is what we did instead. We create a temporary space that is allocated once, like this:
You can see that we are doing some interesting stuff there. In particular, we are allocated a managed buffer, but also force the GC to pin it. We keep this around for the entire lifetime of the database, too. The idea here is that we want to avoid the cost of pinning & unpinning it all the time, even if it means that we have an unmovable memory.
At any rate that important thing about this is that it gives us access to the same memory from managed and unmanaged perspectives. And that, in turn, leads to the following code:
We first read the values from the stream into the managed buffer, then copy them from the unmanaged pointer to the buffer to our own memory.
The idea here is that we accept a Stream abstraction, and that can only work with managed buffers, so we have to go through this route, instead of having to copy the memory directly. The reason we do that is that we don’t want to force the user of our API to materialize the data fully. We want to be able to stream it into the database.
At any rate, this has made some serious improvement to our performance, but I’ll be showing the details on a future post.
No future posts left, oh my!