Ayende @ Rahien

Hi!
My name is Oren Eini
Founder of Hibernating Rhinos LTD and RavenDB.
You can reach me by email or phone:

ayende@ayende.com

+972 52-548-6969

, @ Q c

Posts: 6,371 | Comments: 47,290

filter by tags archive

Machine bias in profiler based optimizations

time to read 2 min | 228 words

Take a look at the following profilers results. They both showcase pretty much the same codebase, without any major changes in between. However, they have been both run on different machines, and they result in very different performance optimization paths.

Main machine:

Laptop:

image

As you can see, when running the code on my main machine, the most expensive thing is actually writing to disk (WriteToJournal). However, on my laptop, writing to disk is actually very cheap.  The likely reason is, again, my laptop is faking writing to disk in favor of buffering writes, even though we explicitly tell it not to. That means that it behaves as if it is we already optimized writing to disk. My main machine behaves more like a server, and we see a lot more I/O costs, which pretty much mask any other costs that we might have.

Using the profiler to guide you with regards to how to optimize the system on any one of those machines would lead you in very different paths. With Voron, we are testing it in a variety of scenarios, to make sure that we aren’t optimizing for one particular machine and hurting another.

Reducing the cost of writing to disk

time to read 4 min | 633 words

So, we found out that the major cost of random writes in our tests was actually writing to disk. Writing 500K sequential items resulted in about 300 MB being written. Writing 500K random items resulted in over 2.3 GB being written.

Note: I would like to point out Alex’s comment, which helped setup this post.

So the obvious thing to do would be to use compression. I decided to try and see what that would give us, and the easiest thing to do is to just enable file compression at the NTFS level. But we can probably do better. The major cost we have is writes to the journal file. And we only ever read from the journal file when we recover the database. We are also usually writing multiple pages at a time for most transactions, and for the scenario we care about, writing 100 random values in a single transaction, we are usually write about a 100 pages or so anyway. That means that we have got a pretty big buffer that we can try to compress all at once.

Sadly, we don’t really see a meaningful improvement under that scenario. Using NTFS compression slow us down considerably, while both LZ4 and Snappy resulted in greatly reduce file writes, but roughly the same performance.

image  image

Note that I have done just the minimal amount of work required to test this out. I changed how we are writing to the journal, and disabled reading from it. That was pretty disappointing, to tell you the truth, I fully expected that we’ll see some interesting improvements there. The LZ4 compression factor is about x10 for our data set. That means that 100 – 121 pages usually are compressed to 10 – 13 pages.

The good thing about this is that running this through the profiler shows that we are running on a high I/O machine, where the amount and speed of I/O doesn’t impact our performance much.

To test it out, I run the same test on an Amazon m3.2xlarge machine, and got the following results (writing to the C: drive):

image

So for now I think that I’ll do the following. We’ll focus on the other costs beyond I/O, and profile heavily on systems with higher I/O costs.

Compression should work, in fact, my first thought was that we pay for less I/O with more CPU, but that is a guess without looking at the numbers.

Here is the cost of committing using compression (lz4):

image

And here it is without compression:

image

So the good news is that we can visible see that we reduced the I/O cost from 8 seconds to 3.8 seconds. And even with compression cost of 2.6 seconds, we are still ahead.

However, the cost of the collections used is actually a lot worse from our perspective. So we are back to reducing computation costs first, then looking at compression again at a later point in time.

Performance counters sucks

time to read 2 min | 204 words

A while ago we added monitoring capabilities to RavenDB via performance counters. The intent was to give our users the ability to easily see exactly what is going on with RavenDB.

The actual usage, however, was a lot more problematic.

  • Performance counters API can just hang, effectively killing us (since we try to initialize it as part of setup db routine).
  • They require specific system permissions, and can fail without them.
  • They get corrupted, for mysterious reasons, and then you need to reset them all.
  • Even after you created them, they can still die on you for no apparent reason.

I would have been willing to assume that we are doing something really stupid. Except that SignalR had similar issues.

What is worse, it appears that using the performance counters needs to iterate the list of printers on the machine. 

This is really ridiculous. And it is to the point in which I am willing to just give it up entirely, if only I had something that I could use to replace it with.

Currently I threw a lot of try /catch and a background thread to hide that, but it is ugly and brittle.

The cost of random writes

time to read 3 min | 502 words

After pretty much getting to where we wanted with sequential writes, it is time to try to tackle the cost of random writes. Here is Esent & Voron doing random writes for 1 million entries:

image

Yes, it is sad that this is the case, because we can do about 40 times better for sequential writes.

However… what is the reason for the slowness? As usual, the answer is to profile, profile & profile. In fact, we want to do profiling on multiple machines, to check that we don’t have a machine bias.

At any rate, as you can see, we are actually pretty close to Esent in random writes, but why are we so much slower than our performance during sequential writes?

Well, we actually covered that, actually. The major reason for the cost is actually the fact that we have to deal with very different B-Tree shapes. You can see the difference between sequential and randomly inserted trees here. The more random you are, the bigger your tree is, the more pages you have to touch whenever you update. Our scenario calls for making 100 random modification in every transaction. That means that on every transaction, we have to touch many different parts of the tree.

Here are the results (in pages) for how much it cost us (in this test, we run 5,000 transactions of 100 items each.

 

Sequential

Random

Average

7.06

111.37

Max

11.00

134.00

Total

35,320.00

556,973.00

Std Dev

0.35

7.09

In other words, this means that on average, a sequential transaction wrote 7 pages, or about 28Kb. While the average random transaction wrote about 450Kb!

While the total size written for sequential was 137MB, and allocate 319MB. However, random sizes wrote 2.1 GB for 2.3 GB that were allocated. And that explains quite a lot about the whole thing. (Note that we count for writes only the journal writes, not the data writes, because those are the limiting factor for write performance).

Esent, by the way, writes sequentially about 465 MB and 2.3GB for random writes. Now we know why we are pretty close, because the vast majority of the time is actually spent just writing to the disk, and we are stalled by the sheer amount of data we have.

Interestingly enough, I don’t think that there is a good way to handle that nicely. Certainly not with a B-Tree. We’ll see how we can work with that at a future post.

Transaction merging, locks, background threads and performance

time to read 3 min | 472 words

The following code is taken from the leveldb codebase:

image

This code is responsible for merging concurrent transactions. It operates using a mutex and conditional variables. And the idea really impressed me when I first saw it.

I implemented that in Voron because I think that this is a great way to gain higher concurrency rates while still having a single writer mode (which is much easier to work with).

However, as it turned out, this isn’t a really good idea in practice. The way conditional variables work, you lock the mutex, check your condition, and then wait on the conditional variable. Waking up from a conditional variable means that you have re-acquired the mutex.

What I want to happen:

  • Multiple concurrent threads will try to write at the same time.
  • One of them goes through and start writing, it takes all the pending writes and write them to disk.
  • It then releases all the waiting threads whose work it already completed.
  • They all move on from there without having to wait on one another.

However, because of the way conditional variables are implemented, what will actually happen is that they each will wake up one at a time, acquiring the mutex lock, then releasing it. However, while they are being released, they compete with one another, and they also compete with the next write.

We profiled this approach, and the result was that we could see how we were spending pretty much all of our time in just synchronizing threads.

Instead, we moved to a different approach, in which the first write will actually start off a new thread, dedicated to writing batches. The way it works, when a thread want to write a batch, it will add that to a queue and wake up the background writer thread, then wait on an event. The background writer will read all the current batches and merge them into a single write transaction.

When it is done, it will wake up all the sleeping writers. We tried to use TaskCompletionSource for that, initially, but we found that the inline nature of tasks made that too expensive to use. Instead, we use ManualResetEventSlim, and we explicitly wait / wake them. We even reuse events, so we don’t have to keep creating and disposing them.

The end result is that we have a pretty sequential process for actually doing concurrent writes, which turn it into a simple producers/consumer problem from threading perspective, and the actual writes into a simple write things out as fast as you can process them.

This also gives us the chance to do some re-ordering of operations to get better performance overall.

Voron Performance, the end is nigh (sequentially)

time to read 3 min | 508 words

For the past few months, we have been doing a lot of work on performance for Voron. Performance for a storage engine is a very complex problem, composed of many competing issues.

For example, sequential writes vs. random writes. Or the cost of writes vs. the cost of reads. In this post, I am going to focus specifically on the single scenario we have focused most of our efforts on, sequential writes.

Sequential writes are important because they present the absolute sweet spot for a storage engine. In other words, this is the very first thing that needs to be fast, because if that ain’t fast ,nothing else would be. And because this represent the absolute sweat spot, it is perfectly fine and very common to explicitly design your system to ensure that this is what you’ll use.

We have been doing a lot there. In particular, some of our changes included:

  • Moving from fsync model (very slow) to unbuffered writer through writes.
  • Moving to vectored writes, reducing sys calls and the need for sequential memory space.
  • Writing our own immutable collection classes, meant specifically for what we’re doing.
  • Reduce # of tree searches.
  • Optimizing the cost of copying data from the user’s stream to our own store.

There have probably been others ,but those have been the major ones. Let us look at the numbers, shall we? I am going to compare us to Esent, since this is our current storage engine.

The test is writing out 10 million items, in 100,000 transactions, with 100 items per transaction. This is done in purely sequential manner. Item size is 128 bytes value and 16 bytes key.

Here are the original code numbers.

image

Note that this was run on a W520 Lenovo with an SSD drive. The actual details don’t really matter, what matters is that we are comparing two runs happening on the same machine.

And now, let us compare Esent and Voron as they currently stand:

image

Yes, you see clearly.

The big jump in Esent numbers have happened because we also worked on the test suite, and optimized things there, but the important thing is that we are now significantly faster than Esent.

In fact, we are fast enough that it took me a while to accept that we are actually doing it. But yes, you are seeing an order of magnitude improvement over what we used to have.

Feel free to take a moment to do a victory dance, that is what I’m doing right now.

However, note the title of this post. This is for a single threaded sequential inserts. The numbers are a lot worse for random writes, but we’ll deal with that in another post.

I’m twenty tomorrow, let us celebrate

time to read 1 min | 87 words

Well, tomorrow I’ll be 0x20. Leaving aside the fact that I am just entering my twenties (finally, it feels like I was a 0xTeenager for over a decade), there is the tradition to uphold.

Therefor, we have a 32% discount until the end of the year*.

You can use coupon code: 0x20-twentysomething

This applies to:

* Limited to the first 0x32, and not applicable if you have to ask why you get a 32% discount.

Voron Performance, the single biggest booster

time to read 4 min | 603 words

One of the surprising points for improvement in our performance run was the following logic, responsible for copying the data from the user to our own memory:

   1: using (var ums = new UnmanagedMemoryStream(pos, value.Length, value.Length, FileAccess.ReadWrite))
   2: {
   3:     value.CopyTo(ums);
   4: }

Those three lines of code were responsible for no less than 25% of our performance. It was obvious that something needed to be done. My belief is that the unmanaged memory stream is just not optimized for this scenario, resulting in a lot of copying, allocations and costs.

Here is what we did instead. We create a temporary space that is allocated once, like this:

image

You can see that we are doing some interesting stuff there. In particular, we are allocated a managed buffer, but also force the GC to pin it. We keep this around for the entire lifetime of the database, too. The idea here is that we want to avoid the cost of pinning & unpinning it all the time, even if it means that we have an unmovable memory.

At any rate that important thing about this is that it gives us access to the same memory from managed and unmanaged perspectives. And that, in turn, leads to the following code:

image

We first read the values from the stream into the managed buffer, then copy them from the unmanaged pointer to the buffer to our own memory.

The idea here is that we accept a Stream abstraction, and that can only work with managed buffers, so we have to go through this route, instead of having to copy the memory directly. The reason we do that is that we don’t want to force the user of our API to materialize the data fully. We want to be able to stream it into the database.

At any rate, this has made some serious improvement to our performance, but I’ll be showing the details on a future post.

FUTURE POSTS

No future posts left, oh my!

RECENT SERIES

  1. RavenDB 4.0 (4):
    19 May 2017 - Managing encrypted databases
  2. De-virtualization in CoreCLR (2):
    01 May 2017 - Part II
  3. re (17):
    26 Apr 2017 - Writing a Time Series Database from Scratch
  4. Performance optimizations (2):
    11 Apr 2017 - One step forward, ten steps back
  5. RavenDB Conference videos (12):
    03 Mar 2017 - Replication changes in 3.5
View all series

RECENT COMMENTS

Syndication

Main feed Feed Stats
Comments feed   Comments Feed Stats