Ayende @ Rahien

Hi!
My name is Oren Eini
Founder of Hibernating Rhinos LTD and RavenDB.
You can reach me by phone or email:

ayende@ayende.com

+972 52-548-6969

, @ Q c

Posts: 6,195 | Comments: 46,080

filter by tags archive

Digging into the CoreCLRExceptional costs, Part II

time to read 3 min | 429 words

Note, this post was written by Federico.

In my last post we talked about the cost of throwing and catching exceptions at the caller site. Some are straightforward and easily seen like the code complexity, but others are a bit deeper, like for instance how the code ends up like that (we will talk about that, but not just yet). Today we will focus on what to do when we control the call-site and are in a performance sensitive hot-spot.

There are 2 important assumptions here:

  • You own the code
  • You are in a hot-spot

If either one of these two is not true, then this is of zero importance or you are screwed. Smile

So let's modify our code a bit (check in yesterday's post if you don’t remember the details). In order to achieve the same result we will resort to a very well-known pattern that I like to call TryXXX. Many instance of such optimizations are visible in the .Net Framework like the famous int.TryParse method. Apparently someone during the course of using v1.0 (or v1.1) of the Framework figured out that the cost of exception handling for certain scenarios was a bit too much. We probably won’t know who was, but we can all be glad they have fixed it; even though we have to live with an exception based implementation (borrowed from Java style?) as obsolete code since then.

So let's see how the code would look. 

image

Pretty straightforward I might say. Now the interesting thing is what happens at the assembler level:

image

Even under shallow review, we can conclude that this code is definitely faster than the alternative. Now what did we win against the try-catch version? Essentially, we don't have a prolog and an epilog in case of the choosing the exceptional path, that’s faster than having to execute such code. The exception case also does not have to deal with non-local effects caused by unwinding the stack; but we are forced to have a hierarchy of TryXXX methods if that goes deep (the alternative of using exceptions for readability is not great either).

Now in this code we have the first glimpse of evidence of a few JIT design choices (and some current restrictions too) that are important performance wise and we will discuss them in future posts.

Digging into the CoreCLRExceptional costs, Part I

time to read 4 min | 624 words

Note, this post was written by Federico.

One guideline which is commonly known is: "Do not use exceptions for flow control." You can read more about it in many places, but this is good compendium of the most common arguments. If you are not acquainted with the reasons, give them a read first; I’ll wait.

Many of the reasons focus on the readability of the code, but remember, my work (usually) revolves around writing pretty disgusting albeit efficient code. So even though I care about readability it is mostly achieved through very lengthy comments on the code on why you shouldn't touch something if you cannot prove something will be faster.

Digression aside the question is still open. What is the impact of using exceptions for control flow (or having to deal with someone else throwing exceptions) in your performance sensitive code? Let's examine that in detail.

For that we will use a very simple code to understand what can happen. 

image

This is a code that is simple enough so that the assembler won’t get too convoluted, but at the same time sport at least some logic we can use as markers.

Let's first inspect the method CanThrow, in there what we can see is how the throwing of exceptions happen:

image

As you can see there is a lot of things to be done just to throw the exception. There in the last call we will use jump to the proper place in the stack and continue in the catch statement that we hit.

image

So here is the code of our simple method. At the assembler level, our try statement has a very important implication. Each try-catch forces the method to deal with a few control flow issues. First it has to store the exception handler in case anything inside would throw, then it has to do the actual work. If there is no exception (the happy path) we move forward and end. But what happen if we have an exception? We first need to remove the handler (we don't want to recheck this handler if we end up throwing inside the catch, right?) Then execute the catch and be done.

But now let’s contrast that to the generated code if no try-catch statement happens. The avid reader will realize that the happy path will never be executed because we are throwing, but don’t worry, the code is the same if there is no inlining happening.

image

We will talk about why the code ends up like this in a follow up post, but suffice to say that all this trouble cannot beat a check for a Boolean if you needed to fail (and could do something about it). 

It is also important to remember that this kind of work is only relevant if you are in the hot path. If you are not calling a function at least a few tens of thousands a second, don’t even bother, your performance costs are elsewhere. This is micro optimization land.

Digging into the CoreCLRJIT Introduction

time to read 3 min | 401 words

Note, this post was written by Federico.

imageThe .Net Just-In-Time compiler (or JIT for short) is one of those marvels you don't even realize is there, unless you have to smooth talk it to do your bidding. At the Voron level, most routines are so tight that the assembler emitted becomes important. It is not uncommon to have to reason about how the instructions are going to be decoded by the CPU, if the microarchitecture is able to execute two instructions independently or not or we are memory-bound (having too many cache misses).

Those with a C/C++ background know the optimizing compiler is actually quite good at its job; and even so, tricks are required to squeeze the last performance bits. The JIT is no different, with the addition that while the C++ compiler has all the time in the world (sort-of); the restrictions imposed by compiling while your code is executing are far more severe. However, even on a budget, the CoreCLR does a pretty amazing job at optimizing. Until, of course, you expect the performance feats of the offline C++ compiler from it.

So the question you may probably have is: "Why bother? Why don't you just go and write the database in C++?”. I won't go in detail on that, since we covered that already . What I can tell you is, that when you dig into the JIT world, it is pretty easy to achieve performance on par with C++ if the JIT by design doesn't get in your way with unsupported features. Good thing is the CoreCLR team is quite open to tackle those  issues if presented with a detailed analysis on the cost-effectiveness of your optimization.

In this series we will talk about the tricks we had to teach our code to squeeze those last bits of micro-optimizations. From specific optimizations requested to the development team and workarounds until they are available, to tricks we use to ensure the assembler code generated is of the highest quality possible.

One word of caution: The JIT is ever improving so it is important to understand that the optimizations opportunities presented here may not apply tomorrow or just be done automatically in future versions of CoreCLR.

Reducing allocations and resource usages when using Task.Delay

time to read 2 min | 399 words

In my previous posts, I talked about tri state waiting, which included the following line:

image

And then I did a deep dive into how timers on the CLR are implemented. Taken together, this presents me somewhat of a problem. What is the cost of calling a Task.Delay? Luckily, the code is there, so we can check.

The relevant costs are here. We allocate a new DelayPromise, and a Timer instance. As we previously saw, actually creating a Timer instance will take a global lock, and the cost of actually firing those timers is proportional to the number of timers that we have.

Let’s consider the scenario for the code above. We want to support a very large number of clients, and we typically expect them to be idle, so every second we’ll have the delay fire, we send them a heartbeat, and go on with the waiting. What will happen in such a case for the code above?

All of those async connections are going to be allocating several objects per run, and each of them is going to contend on the same lock. All of them are also going to run a lot more slowly than they should.

In order to resolve this issue, given what we know now about the timer system on the CLR, we can write the following code:

In this case, we have a single static timer (so there is no lock contention per call), and we have a single task allocation per second. All of the connections will use the same instance. In other words, if there are 10,000 connections, we just saved 20K allocations per second, as well as 10K lock contentions, not to mention that instead of waking up every single connection one at a time, we have just a single task instance is completed once, resulting in all the the timing out tasks being woken up.

This code does have the disadvantage of allocating a task every second, even if no one is listening. It is a pretty small price to pay, and fixing it can be left as an exercise for the reader Smile.

The cadence of tests speed

time to read 3 min | 402 words

RavenDB has quite a bit of tests. Over five thousands of them, on the last count. They test common things (like saving a document works) and esoteric things (like spatial query behavior near the equator). They are important, but they also take quite a lot of time to run. In our current setup, running the full test suite can take 3 – 6 hours, depending on which build agent is running and what is the configuration we have.

This means that in terms of actually running the tests, developers can’t really get good feedback. We have taken some steps to address that, by creating a core set of tests that will examine the major and most important features, fast, which developers can run in reasonable time as part of their work, and then we let the build server sort out the pull requests on its own time.

For RavenDB 4.0, we decided to go in a different direction. We split the tests into two separate projects. FastTests and SlowTests. The requirement for the FastTest is that they will run in under 1 minute. In order to do that we use xunit’s parallel test support, which allows us to run concurrent tests (this actually required a lot of changes in the test code, which assumed that we are running in a serial fashion and that each test is the sole owner of the entire system).

Part of making sure that we are running in under 1 minute is knowing when we breached that. Because the tests run concurrently, it is a bit difficult to know what is taking so long. Thankfully, it is easy to get a report out. We use the following script to do that.

We now have the top slowest tests (some of them takes tens of seconds to run, but we don’t usually care, as long as other test run concurrently parallel and the whole suite is fast), and we can figure out whatever we can optimize the test or move it to the SlowTest project.

Another change we made was to move a lot of the test code to use the async API, this allows us to run more tests concurrently without putting too much of a burden on the system. Our current tests times range between 30 – 55 seconds, which is enough to run them pretty much on every drop of a hat.

Tri state waiting with async tcp streams

time to read 4 min | 714 words

We recently had the need to develop a feature that requires a client to hold a connection to the server and listen to a certain event. Imagine that we are talking about a new document arriving to the database.

This led to a very simple design:

  • Open a TCP connection and let the server know about which IDs you care about.
  • Wait for any of those IDs to change.
  • Let the client know about it.

Effectively, it was:

Unfortunately, this simple design didn’t quite work. As it turns out, having a dedicated TCP connection per id is very expensive, so we would like to be able to use a single TCP connection in order to watch multiple documents. And we don’t know about all of them upfront, so we need to find a way to talk to the server throughout the process. Another issue that we have is the problem of steady state. If none of the documents we care about actually change for a long time, there is nothing going on over the network. This is going to lead the TCP connection to fail with a timeout.

Actually, a TCP connection that passes no packets is something that is perfectly fine in theory, but the problem is that it requires resource that that need to be maintained. As long as you have systems that are not busy, it will probably be fine, but the moment that it reaches the real world, the proxies / firewalls / network appliances along the way use a very brutal policy, “if I’m not seeing packets, I’m not maintaining the connection”, and it will be dropped, usually without even a RST packet. That makes debugging this sort of things interesting.

So our new requirements are:

  • Continue to accept IDs to watch throughout the lifetime of the connection.
  • Let the client know of any changes.
  • Make sure that we send the occasional heartbeat to keep the TCP connection alive.

This is much more fun to write, to be honest, and the code we ended up with was pretty, here it is:

There are a couple of things to note here. We are starting an async read operation from the TCP stream without waiting for it to complete, and then we go into a loop and wait for one of three options.

  1. A document that we watched has been changed (we are notified about that by the documentChanged task completion), in which case we notify the user. Note that we first replace the documentChanged task and then we drain the queue from all pending documents changes for this collection, after which we’ll go back to waiting. On the doc changed event, we first enqueue the document that was changed, and then complete the task. This ensure that we won’t miss anything.
  2. New data is available from the client. In this case we read it and add it to the IDs we are watching, while starting another async read operation (for which we’ll wait on the next loop iteration). I’m creating a new instance of the IDs collection here to avoid threading issues, and also because the number of items is likely to be very small and rarely change. If there were a lot of changes, I would probably go with a concurrent data structure, but I don’t think it is warranted at this time.
  3. Simple timeout.

Then, based on which task has been completed, we select the appropriate behavior (send message to client, accept new doc ID to watch, send heartbeat, etc).

The nice thing about this code is that errors are also handled quite nicely. If the client disconnects, we will get an error from the read, and know that it happened and exit gracefully (well, we might be getting that just when we are writing data to the client, but that is pretty much the same thing in terms of our error handling).

The really nice thing about this code is that for the common cases, where there isn’t actually anything for us to do except maintain the TCP connection, this code is almost never in runnable state, and we can support a very large number of clients with very few resources.

Fast transaction logWindows

time to read 5 min | 806 words


In my previous post, I have tested journal writing techniques on Linux, in this post, I want to do the same for Windows, and see what the impact of the various options are the system performance.

Windows has slightly different options than Linux. In particular, in Windows, the various flags and promises and very clear, and it is quite easy to figure out what is it that you are supposed to do.

We have tested the following scenarios

  • Doing buffered writes (pretty useless for any journal file, which needs to be reliable, but good baseline metric).
  • Doing buffered writes and calling FlushFileBuffers after each transaction (which is pretty common way to handle committing to disk in databases), and the equivalent of calling fsync.
  • Using FILE_FLAG_WRITE_THROUGH flag and asking the kernel to make sure that after every write, everything will be flushed to disk. Note that the disk may or may not buffer things.
  • Using FILE_FLAG_NO_BUFFERING flag to bypass the kernel’s caching and go directly to disk. This has special memory alignment considerations
  • Using FILE_FLAG_WRITE_THROUGH | FILE_FLAG_NO_BUFFERING flag to ensure that we don’t do any caching, and actually force the disk to do its work. On Windows, this is guaranteed to ask the disk to flush to persisted medium (but the disk can ignore this request).

Here is the code:

We have tested this on an AWS macine ( i2.2xlarge – 61 GB, 8 cores, 2x 800 GB SSD drive, 1GB /sec EBS), which was running Microsoft Windows Server 2012 R2 RTM 64-bits. The code was compiled for 64 bits with the default release configuration.

What we are doing is write 1 GB journal file, simulating 16 KB transactions and simulating 65,535 separate commits to disk. That is a lot of work that needs to be done.

First, again, I run it on the system drive, to compare how it behaves:

Method Time (ms) Write cost (ms)
Buffered

396

0.006

Buffered + FlushFileBuffers

121,403

1.8

FILE_FLAG_WRITE_THROUGH

58,376

0.89

FILE_FLAG_NO_BUFFERING

56,162

0.85

FILE_FLAG_WRITE_THROUGH | FILE_FLAG_NO_BUFFERING

55,432

0.84

Remember, this is us running on the system disk, not on the SSD drive. Here are those numbers, which are much more interesting for us.

Method Time (ms) Write cost (ms)
Buffered

410

0.006

Buffered + FlushFileBuffers

21,077

0.321

FILE_FLAG_WRITE_THROUGH

10,029

0.153

FILE_FLAG_NO_BUFFERING

8,491

0.129

FILE_FLAG_WRITE_THROUGH | FILE_FLAG_NO_BUFFERING

8,378

0.127

And those numbers are very significant. Unlike the system disk, where we basically get whatever spare cycles we have, in both Linux and Windows, the SSD disk provides really good performance. But even on identical machine, running nearly identical code, there are significant performance differences between them.

Let me draw it out to you:

Options

Windows

Linux

Difference

Buffered

0.006

0.03

80% Win

Buffered + fsync() / FlushFileBuffers()

0.32

0.35

9% Win

O_DSYNC / FILE_FLAG_WRITE_THROUGH

0.153

0.29

48% Win

O_DIRECT / FILE_FLAG_NO_BUFFERING

0.129

0.14

8% Win

O_DIRECT | O_DSYNC / FILE_FLAG_WRITE_THROUGH | FILE_FLAG_NO_BUFFERING

0.127

0.31

60% Win

In pretty much all cases Windows has been able to out perform Linux on this specific scenario. In many cases by a significant margin. In particular, in the scenario that I actually really care about, we see 60% performance advantage to Windows.

One of the reasons for this blog post and the detailed code and scenario is the external verification of these numbers. I’ll love to know that I missed something that would make Linux speed comparable to Windows, because right now this is pretty miserable.

I do have a hunch about those numbers, though. SQL Server is a major business for Microsoft, so they have a lot of pull in the company. And SQL Server uses FILE_FLAG_WRITE_THROUGH | FILE_FLAG_NO_BUFFERING internally to handle the transaction log it uses. Like quite a bit of other Win32 APIs (WriteGather, for example), it looks tailor made for database journaling. I’m guessing that this code path has been gone over multiple times over the years, trying to optimize SQL Server by smoothing anything in the way.

As a result, if you know what you are doing, you can get some really impressive numbers on Windows in this scenario. Oh, and just to quite the nitpickers:

image_thumb[5]

Fast transaction logLinux

time to read 5 min | 910 words

We were doing some perf testing recently, and we got some odd results when running a particular benchmark on Linux. So we decided to check this on a much deeper level.

We got an AWS macine ( i2.2xlarge – 61 GB, 8 cores, 2x 800 GB SSD drive, running Ubuntu 14.04, using kernel version 3.13.0-74-generic, 1GB/sec EBS drives ) and run the following code and run it. This tests the following scenarios on a 1GB file (pre allocated) and “committing” 65,536 (64K) transactions with 16KB of data in each. The idea is that we are testing how fast we can create write those transactions to the journal file, so we can consider them committed.

We have tested the following scenarios

  • Doing buffered writes (pretty useless for any journal file, which needs to be reliable, but good baseline metric).
  • Doing buffered writes and calling fsync after each transaction (which is pretty common way to handle committing to disk in databases)
  • Doing buffered writes and calling fdatasync (which is supposed to be slightly more efficient than calling fsync in this scenario).
  • Using O_DSYNC flag and asking the kernel to make sure that after every write, everything will be synced to disk.
  • Using O_DIRECT flag to bypass the kernel’s caching and go directly to disk.
  • Using O_DIRECT | O_DSYNC flag to ensure that we don’t do any caching, and actually force the disk to do its work.

The code is written in C, and it is written to be pretty verbose and ugly. I apologize for how it looks, but the idea was to get some useful data out of this, not to generate beautiful code. It is only quite probable that I made some mistake in writing the code, which is partly why I’m talking about this.

Here is the code, and the results of execution are below:

It was compiled using: gcc journal.c –o3 –o run && ./run

The results are quite interesting:

Method Time (ms) Write cost (ms)
Buffered

525

0.03

Buffered + fsync

72,116

1.10

Buffered + fdatasync

56,227

0.85

O_DSYNC

48,668

0.74

O_DIRECT

47,065

0.71

O_DIRECT | O_DSYNC

47,877

0.73

The results are quite interesting. The buffered call, which is useless for a journal, but important as something to compare to. The rest of the options will ensure that the data reside on disk* after the call to write, and are suitable to actually get safety from the journal.

* The caveat here is the use of O_DIRECT, the docs (and Linus) seems to be pretty much against it, and there are few details on how this works with regards to instructing the disk to actually bypass all buffers. O_DIRECT | O_DSYNC seems to be the recommended option, but that probably deserve more investigation.

Of course, I had this big long discussion on the numbers planned. And then I discovered that I was actually running this on the boot disk, and not one of the SSD drives. That was a face palm of epic proportions that I was only saved from by the problematic numbers that I was seeing.

Once I realized what was going on and fixed that, we got very different numbers.

Method Time (ms) Write cost (ms)
Buffered

522

0.03

Buffered + fsync

23,094

0.35

Buffered + fdatasync

23,022

0.35

O_DSYNC

19,555

0.29

O_DIRECT

9,371

0.14

O_DIRECT | O_DSYNC

20,595

0.31

There is about 10% difference between fsync and fdatasync when using the HDD, but there is barely any difference as far as the SSD is concerned. This is because the SSD can do random updates (such as updating both the data and the metadata) much faster, since it doesn’t need to move the spindle.

Of more interest to us is that D_SYNC is significantly faster in both SSD and HHD. About 15% can be saved by using it.

But I’m just drolling over O_DIRECT write profile performance on SSD. That is so good, unfortunately, it isn’t safe to do. We need both it and O_DSYNC to make sure that we only get confirmation on the write when it hits the physical disk, instead of the drive’s cache (that is why it is actually faster, we are writing directly to the disk’s cache, basically).

The tests were run using ext4. I played with some options (noatime, noadirtime, etc), but there wasn’t any big difference between them that I could see.

In my next post, I’ll test the same on Windows.

The cost of the async state machine

time to read 3 min | 483 words

In my previous post, I talked about high performance cost of the async state machine in certain use cases. I decided to test this out using the following benchmark code.

Here we have two identical pieces of code, sending 16 MB over the network. (Actually the loopback device, which is important, and we’ll look into exactly why later)

One of those is a using the synchronous, blocking, API, and the second is using async API. Beside the API differences, they both do the exact same thing, send 16MB over the network.

  • Sync - 0.055 seconds
  • Async - 0.380 seconds

So the async version is about 7 times worse than the sync version. But when I’m trying this out on 256MB test, the results are roughly the same.

  • Sync - 0.821 seconds
  • Async - 5.669 seconds

The reason this is the case is likely related to the fact that we are using the loopback device, in this case, we are probably almost always never blocking in the sync case, we almost always have the data available for us to read. In the async case, we have to pay for the async machinery, but we never actually get to yield the thread.

So I decided to test what would happen when we introduce some additional I/O latency. It was simplest to write the following:

And then to pipe both versions through the same proxy which gave the following results for 16MB.

  • Sync - 29.849 seconds
  • Async - 29.900 seconds

In this case, we can see that the sync benefit has effectively been eliminated. The cost of the async state machine is swallowed in the cost of the blocking waits. But unlike the sync system, when the async state machine is yielding, something else can run on this thread.

So it is all good, right?

Not quite so. As it turn out, if there are certain specific scenarios where you actually want to run a single, long, important operation, you can set things up so the TCP connection will (almost) always have buffered data in it. This is really common when you send a lot of data from one machine to the other. Not a short request, but something where you can get big buffers, and assume that the common setup is a steady state of consuming the data as fast as it comes. In RavenDB’s case, for example, one such very common scenario is the bulk insert mode. The client is able to send the data to the server fast enough in almost all cases that by the time the server process a batch, the next batch is already buffered and ready.

And in that specific case, I don’t want to be giving up my thread to something else to run on. I want this particular process to be as fast as possible, potentially at the expensive of much else.

This is likely to have some impact on our design, once we have the chance to run more tests.

The HashSet in the Hot Spot

time to read 1 min | 145 words

The title of this post reminds me of a Bones episode. This blog post is actually from an internal mail made by Federico Lois, as part of bigger perf work.

image004

Here is the calling code, note that this is typically called many times, and _pageStates is a HashSet<PagerState>.

image002

The following change was made:

image003

And here are the results:

image001

That is 230% improvement! I can identify with the sentiment.

FUTURE POSTS

  1. Voron internals: I/O costs analysis - 14 hours from now
  2. Benchmarking with Go - about one day from now
  3. Implementing Omni Search - 3 days from now
  4. Meeting the Joel Test 2.0 - 6 days from now
  5. A different view on creation - 7 days from now

There are posts all the way to Sep 06, 2016

RECENT SERIES

  1. Voron internals (3):
    30 Aug 2016 - The transaction journal & recovery
  2. Database Building 101 (8):
    25 Aug 2016 - Graph querying over large datasets
  3. Production postmortem (16):
    23 Aug 2016 - The insidious cost of managed memory
  4. Digging into the CoreCLR (3):
    12 Aug 2016 - Exceptional costs, Part II
  5. The Guts n’ Glory of Database Internals (20):
    08 Aug 2016 - Early lock release
View all series

Syndication

Main feed Feed Stats
Comments feed   Comments Feed Stats