Ayende @ Rahien

My name is Oren Eini
Founder of Hibernating Rhinos LTD and RavenDB.
You can reach me by phone or email:


+972 52-548-6969

, @ Q c

Posts: 6,195 | Comments: 46,073

filter by tags archive

Digging into the CoreCLRExceptional costs, Part I

time to read 4 min | 624 words

Note, this post was written by Federico.

One guideline which is commonly known is: "Do not use exceptions for flow control." You can read more about it in many places, but this is good compendium of the most common arguments. If you are not acquainted with the reasons, give them a read first; I’ll wait.

Many of the reasons focus on the readability of the code, but remember, my work (usually) revolves around writing pretty disgusting albeit efficient code. So even though I care about readability it is mostly achieved through very lengthy comments on the code on why you shouldn't touch something if you cannot prove something will be faster.

Digression aside the question is still open. What is the impact of using exceptions for control flow (or having to deal with someone else throwing exceptions) in your performance sensitive code? Let's examine that in detail.

For that we will use a very simple code to understand what can happen. 


This is a code that is simple enough so that the assembler won’t get too convoluted, but at the same time sport at least some logic we can use as markers.

Let's first inspect the method CanThrow, in there what we can see is how the throwing of exceptions happen:


As you can see there is a lot of things to be done just to throw the exception. There in the last call we will use jump to the proper place in the stack and continue in the catch statement that we hit.


So here is the code of our simple method. At the assembler level, our try statement has a very important implication. Each try-catch forces the method to deal with a few control flow issues. First it has to store the exception handler in case anything inside would throw, then it has to do the actual work. If there is no exception (the happy path) we move forward and end. But what happen if we have an exception? We first need to remove the handler (we don't want to recheck this handler if we end up throwing inside the catch, right?) Then execute the catch and be done.

But now let’s contrast that to the generated code if no try-catch statement happens. The avid reader will realize that the happy path will never be executed because we are throwing, but don’t worry, the code is the same if there is no inlining happening.


We will talk about why the code ends up like this in a follow up post, but suffice to say that all this trouble cannot beat a check for a Boolean if you needed to fail (and could do something about it). 

It is also important to remember that this kind of work is only relevant if you are in the hot path. If you are not calling a function at least a few tens of thousands a second, don’t even bother, your performance costs are elsewhere. This is micro optimization land.

Digging into the CoreCLRJIT Introduction

time to read 3 min | 401 words

Note, this post was written by Federico.

imageThe .Net Just-In-Time compiler (or JIT for short) is one of those marvels you don't even realize is there, unless you have to smooth talk it to do your bidding. At the Voron level, most routines are so tight that the assembler emitted becomes important. It is not uncommon to have to reason about how the instructions are going to be decoded by the CPU, if the microarchitecture is able to execute two instructions independently or not or we are memory-bound (having too many cache misses).

Those with a C/C++ background know the optimizing compiler is actually quite good at its job; and even so, tricks are required to squeeze the last performance bits. The JIT is no different, with the addition that while the C++ compiler has all the time in the world (sort-of); the restrictions imposed by compiling while your code is executing are far more severe. However, even on a budget, the CoreCLR does a pretty amazing job at optimizing. Until, of course, you expect the performance feats of the offline C++ compiler from it.

So the question you may probably have is: "Why bother? Why don't you just go and write the database in C++?”. I won't go in detail on that, since we covered that already . What I can tell you is, that when you dig into the JIT world, it is pretty easy to achieve performance on par with C++ if the JIT by design doesn't get in your way with unsupported features. Good thing is the CoreCLR team is quite open to tackle those  issues if presented with a detailed analysis on the cost-effectiveness of your optimization.

In this series we will talk about the tricks we had to teach our code to squeeze those last bits of micro-optimizations. From specific optimizations requested to the development team and workarounds until they are available, to tricks we use to ensure the assembler code generated is of the highest quality possible.

One word of caution: The JIT is ever improving so it is important to understand that the optimizations opportunities presented here may not apply tomorrow or just be done automatically in future versions of CoreCLR.

I tell you, that thing is a bona fide ZEBRA, or a tale of being utterly stupid

time to read 3 min | 414 words

We run our test suite in a loop to discover any race conditions, timing issues, errors, etc. When doing so, we got a hard crash from the dotnet.exe, and investigating the issue produced a stack trace inside the GC.

So I took a dump of the process memory, and created an issue about that with the CoreCLR repository, while giving it a very high priority internally, and having someone look at that very closely. We are using unsafe code extensively, so it was either a real GC bug or we messed up somewhere are corrupted our own state.

Very quickly Jan Kotas was able to point out that it was a heap corruption issue as well as the likely avenues for investigation.

After looking at this, we found that the problem was in our tests. In particular, in one specific test. In order to test the memory corruption, we changed it to add markers on where it overwrote the buffer, and the test passed.

This caused us additional concern, because the only thing we could think about was that maybe there is some invariant that is being broken. Our suspicion focused on the fixed statement in C# not working properly. Yes, I know, “hoof beats, horses, not zebras”.

So I went to the issue again and reported my finding, and Andy Ayers was kind enough to find the problem, and point it to me.

Here is the relevant test code:


This is during debugging, so you can see what the problem is. We defined size to be 40, and we defined an input buffer, whose size is 100.

A little bit below, we created an output buffer based on the size variable (40), and then wrote to it with the expected size of input.Length, which is 100. Everything behaved as it should, and we had a buffer overrun in the test, the heap was corrupted, and sometimes the GC died.

Also, I feel very stupid about spouting all sort of nonsense about bugs in the CLR when our code is unable to do simple arithmetic.

The good news, the bug was only in the tests, and the kind of support that you get from Microsoft on the CoreCLR is absolutely phenomenal. Thank you very much guys.

The Guts n’ Glory of Database Internals: Early lock release

time to read 2 min | 376 words

After talking about transaction merging, there is another kind of database trick that you can use to optimize your performance even further. It is called early lock release. With early lock release, we rely on the fact that the client will only know when the transaction has been committed after we told him. And that we don’t have to tell it right away.

That sounds strange, how can not telling the client that its transaction has been committed improve performance. Well, it improve throughput, but it can also improve the performance. But before we get into the details, let us see the code, and then talk about what it means:

The code is nearly identical to the one in the previous post, but unlike the previous post, here we play a little game. Instead of flushing the journal synchronously, we are going to do this in an async manner. And what is more important, we don’t want for it to complete. Why is that important?

It is important because notifying the caller that the transaction has been complete has now moved into the async callback, and we can start processing additional operations are write them to the journal file at the same time that the I/O for the previous operation completes.

As far as the client is concerned, it doesn’t matter how it works, it just need to get the confirmation that it has been successfully committed. But from the point of view of system resources, it means that we can parallelize a key aspect of the code, and we can proceed with the next set of transactions  before the previous one even hit the disk.

There are some issues to consider. Typically you have only a single pending write operation, because if the previous journal write had an error, you need to abort all future transactions that relied on its in memory state (effectively, rollback that transaction and any speculative transactions we executed assuming we can commit that transaction to disk. Another issue is that error handling is more complex, if a single part of the merge transaction failed, it will fail unrelated operations, so you need to unmerge the transaction, run them individually, and report on each operation individually.

The Guts n’ Glory of Database Internals: Merging transactions

time to read 4 min | 790 words

A repeating choice that we have to make in databases is to trade off performance for scale. In other words, in order to process more requests per time frame, we need to increase the time it takes to process a single request. Let’s see how this works with the case of transaction commits.

In the simplest model, a transaction does its work, prepares itself to be committed, writes itself to the journal, and then notifies the client that it has successfully committed. Note that the most important part of the process is writing to the journal, which is how the transaction maintains durability. It also tends to be, by far, the most expensive part of the operation.

This leads to a lot of attempts to try and change the equation. I talked about one such option when we talked about the details of the transaction journal, having a journal actor responsible for writing the transaction changes as they happen, and amortize the cost of writing them over time and many transactions. This is something that quite a few databases do, but that does have certain assumptions.

To start with, it assumes that transactions are relatively long, and spend a lot of their time waiting for network I/O. In other words, this is a common model in the SQL world, where you have to open a connection, make a query, then make another query based on the results of that, etc. The idea is that you parallelize the cost of writing the changes to the journal along with the time it takes to read/write from the network.

But other databases do not use this model. Most NoSQL databases use the concept of a single command (which may aggregate commands), but they don’t have the notion of a long conversation with the client. So there isn’t that much of a chance to spread the cost of writing to the journal on the network.

Instead, a common trick is transaction merging. Transaction merging relies on the observation that I/O costs are no actually linear to the amount of I/O that you use. Oh, sure, writing 1KB is going to be faster than writing 10 MB. But it isn’t going to be two orders of magnitude faster. In fact, it is actually more efficient to make a single 10MB write than 1024 writes on 1 KB. If this make no sense, consider buying groceries and driving back to your house. The weight of the groceries has an impact on the fuel efficiency of the car, but the length of the drive is of much higher importance in terms of how much fuel is consumed. If you buy 200 KG of groceries (you are probably planning a hell of a party) and drive 10 KB home, you are going to waste a lot less fuel then if you make 4 trips with 50 KG in the trunk.

So what is transaction merging? Put simply, instead of calling the journal directly, we queue the operation we want to make, and let a separate thread run through all the concurrent operations. Here is the code:

The secret here is that if we have a load on the system, by the time we read from the queue, there are going to be more items in there. This means that when we write to the journal file, we’ll write not just a single operation (a trip back & forth to the grocery store), but we’ll be able to pack a lot of those operations immediately (one single trip). The idea is that we buffer all of the operations, up to a limit (in the code above, we use time, but typically you would also consider space), and then flush them to the journal as a single unit.

After doing this, we can notify all of the operations that they have been successfully completed, at which point they can go and notify their callers. We traded off the performance of a single operation (because we now need to be more complex and pay more I/O) in favor of being able to scale to a much higher number of operations. If your platform support async, you can also give up the thread (and let some other request run) while you are waiting for the I/O to complete.

The number of I/O requests you are going to make is lower, and the overall throughput is higher. Just to give you an example, in one of our tests, this single change moved us from doing 200 requests / second (roughly the maximum number of fsync()/ sec that the machine could support) to supporting 4,000 requests per second (x20 performance increase)!

RavenDB 3.5 Release Candidate

time to read 2 min | 365 words

I’m very happy to announce that the Release Candidate of RavenDB is now available. We have been working hard for the past few months to incorporate feedback that we got from customers who started using the beta, and we are finally ready.

There have been quite a lot of fixes, improvements and enhancements, but in this post I want to focus on the most exciting three.

Waiting for write assurance

Take a look at the following code:

It asks RavenDB to save a new user, and then wait (on the server) for the newly saved changes to be replicated to at least another 2 servers, up to a maximum of 30 seconds.

This give you the option of choosing how safe you want to be on saves. For high value documents (a million dollar order, for example), we will accept the additional delay.

Waiting for indexes

In the same vein, we have this:

This feature is meant specific for PRG (Post, Redirect, Get) scenarios, where you add a new document and immediately query for it. In this case, you can ask the server to wait until the indexes are caught up with this particular write.

You can also set a timeout and whatever to throw or not. But more interestingly, you can specify specific indexes that you want to wait for. If you don’t specify anything, RavenDB will automatically select just the indexes that are impacted by this write.

Dynamic leader selection

In the RavenDB Conference, we showed how you can define a cluster of RavenDB servers and have them vote on a leader dynamically, then route all writes to the leader. In the presence of failure, we will select a new leader and stick to it as long as it is able.

Some of the feedback we got was related to how we select the leader. We now take into account the state of the databases on the server, and will not select a leader whose databases aren’t up to date with regards to replication. We also reduced the cost of replication in general, by significantly reducing the amount of metadata that we need to track.

The Guts n’ Glory of Database Internals: Log shipping and point in time recovery

time to read 3 min | 519 words


In my previous post I talked about the journal and how it is used to recover the database state in case of errors.

In other words, we write everything that we are going to do into the journal, and the code is written to run through the journal and apply these changes to the database.

Ponder this for a moment, and consider the implications of applying the same concept elsewhere. Log shipping is basically taking the journal file from one server and asking another server to apply it as well. That is all.

Everything else are mere details that aren’t really interesting… Well, they are, but first you need to grasp what is so special in log shipping. If you already have a journal and a recovery from it, you are probably at least 80% or so of the way there to getting log shipping to work.

Now that you understand what log shipping is, let’s talk about its implications. Depending on the way your log is structured, you might be able to accept logs from multiple servers and apply all of them (accepting the fact that they might conflict), but in practice this is almost never done in this manner. Log shipping typically mandates that one server would be designated the master (statically or dynamically), and the other(s) are designated as secondary (typically read only copies).

The process in which the logs are shipped is interesting. You can do that on a time basis (every 5 – 15 minutes), every time that you close a journal file (64MB – 256MB), etc. This is typically called offline log shipping, because there is a large gap between the master and the secondary. There is also online log shipping, in which every write to the journal is also a socket write to the other server, which accepts the new journal entries, write them to its own journal and applies them immediately, resulting in a much narrower delay between the systems.  Note that this has its own issues, because this is now a distributed system with all that it implies (if the secondary isn’t available for an hour, what does it mean, etc.).

But journals also allow us to do other fun stuff. In particular, if the journal file records the timestamp of transactions (and most do), they allow us to do what is called “point in time” recovery. Starting from a particular backup, apply all the committed transactions until a certain point in time, bring up to the state of the database at 9:07 AM (one minute before someone run an UPDATE statement without the WHERE clause).

Note that all of the code that actually implements this has already been written as part of ensuring that the database can recover from errors, and all we need to implement “point in time” recovery is to just stop at a particular point in time. That is a pretty neat thing, in my opinion.

The Guts n’ Glory of Database Internals: What goes inside the transaction journal

time to read 8 min | 1413 words

I talk a lot about journals, and how to make sure they are durable and safe. What I haven’t talked about is what you put in the journal. Carsten reminded me of this lack, and I want to thank him for that.

The first thing we need to understand is what the journal is supposed to do. The whole point of the journal is to make sure that you if you had some sort of error (including power loss), you can use the information on the journal to recover up to the same point you were at before the failure. More formally, in order to maintain ACID, we need to ensure that any transactions we committed (so clients rely on that) are there, and any ongoing transactions have been rolled back properly. The journal is the key to that, but how?

A lot of this depends on the database engine itself, and that has an impact on the file format you used in the journal. The way you work with the journal has also a lot to do with the internals of the database.

Let us take a relational database as an example. We want to insert a new row to a table. The size of the row is usually pretty small, a few dozen to a few hundred bytes. So it’s cool, but how does it play into the journal? Well, before the database can actually modify the data, it needs to write its intent to do so in the journal, and flush it. Only then can it proceed, knowing that there is a stable record in the case of an error.

One way of doing this is to write the following this record into the journal file:

{ position: 1234, value: [binary] }

After it is flushed, we can go ahead and modify the file itself. If we crash, during recovery we’ll see that we have this entry in the journal, and we’ll write it again to the same place in the data file. Either this will be a no-op, because it was previously applied, or we’ll re-apply the change. At any rate, as an external observer, the fact that the journal entry exists ensures that on recovery, the value will be the same. Hence, the fact the entry was written and flushed to the journal ensures that the transaction change was committed.

So this is the high level overview. The devil is in the details. The simplest journal file just records binary data and position, so we can write it out to the data file on recovery. But while that works, it is limiting the database engine in what it can do. If you have multiple concurrent transactions all of which are generating entries to the log, you don’t want to force them to be written to the journal only on transaction commit, you want them to be appended to the log and flushed (so you can amortize the cost), and you want to write them out immediately.

But this means that you are writing uncommitted changes to the journal file. So each of your entry also has a transaction number, and your transaction commit is another entry in the journal that orders to apply all operations from that particular transaction. At this point, you usually have different actors inside the database engine. You have an actor that is responsible for writing to the journal file, and one is typically merging multiple entries from multiple concurrent transactions to allow the maximum level of performance. Consider the implications of the following journal actor implementation:

All pending entries to the journal are actually written using buffered I/O, so they are very fast. However, there are a few things to notice here.

  • We keep a record of the hash (CRC, MD5, etc) of all the writes.
  • When we get a transaction commit entry, we write it to the disk, and then write the hash for all the entries that were written since the last transaction commit.
  • We flush to disk, ensuring that we are actually persisting on spinning rust.
  • We let the caller know the transaction has been safely journaled to disk.

This code has a lot of tiny implications that are not immediately obvious. For example, whenever we send a journal entry to be written, we don’t need to wait for it, we can proceed immediately to the next change in the transaction. And because the writes are buffered, even the I/O on the journal actor is very cheap.

However, when we send a commit transaction entry, we do a few more things. Writing the hash of the all the entries that were written since the last transaction allows us to verify that all entries that were written were written successfully. If there was a power failure while we were writing the transaction, we would know and realize that this transaction isn’t committed, and that is where the log ends.

And while I don’t need to wait for the journal entry to be flushed to disk, I do have to wait for the transaction itself to be written to disk before I let the user know that it has been committed. This type of structure also has the advantage of transactions sharing the load. If we have a long transaction that does something, its journal entries will be flushed to disk along with the other committing transactions. When we need to commit the long transaction, the amount of work we’ll actually have to do is a lot less.

The structure of the journal entry can range from the simple mode of writing what binary data to modify where, to actually have meaning for the database engine (SET [column id]=[value]). The latter is more complex, but it gives you more compact representation for journal entries, which means that you can pack more of them in the same space and I/O is at a premium.

For the technical reasons mentioned in previous posts (alignment, performance, etc.), we typically write only on page boundaries (in multiples of 4KB), so we need to consider whether it is actually worth our time to write things now, or wait a bit and see if there is additional data coming that can complete the target boundary.

What I described so far is one particular approach to handle that. It is used by databases such as PostgresSQL, Cassandra, etc. But it doesn’t work quite in this manner. In the case of something like LevelDB, the journal is written one transaction at a time, under the lock, and in multiples of 32KB. The values in the journal are operations to perform (add / delete from the database), and that is pretty much it. It can afford to do that because all of its data files are immutable. PostgresSQL uses 8KB multiples and only stores in the journal only the data it needs to re-apply the transaction. This is probably because of the MVCC structure it has.

SQL Server, on the other hand, stores both redo and undo information. It make sense, if you look at the journal actor above. Whenever we write an entry to the journal file, we don’t yet know if the transaction would be committed. So we store the information we need to redo it (if committed) and undo it (if it didn’t). Recovery is a bit more complex, and you can read about it the algorithm used most frequently (at least as the base) here: ARIES.

With Voron, however, we choose to go another way. Each transaction is going to keep track of all the pages it modified. And when the transaction commits, we take all of those pages and compress them (to reduce the amount of I/O we have to do), compute a hash of the compressed data, and write it on 4KB boundary. Recovery then becomes trivial. Run through the journal file, and verify each transaction hash at a time. Once this is done, we know that the transaction is valid, so we can decompress the dirty pages and write them directly to the data file. Instead of handling operations log, we handle full pages. This means we don’t actually care what modifications run on that page, we just care about its state.

It makes adding new facilities to Voron very simple, because there is a very strict layering throughout, changing the way we do something has no impact on the journaling / ACID layer.

Non reproducible / intermittent error handling

time to read 2 min | 370 words

We recently had to deal with a stress test that was failing (very) occasionally. We looked into that, and we figured out that this was actually exposing a high severity bug. So we looked into it pretty seriously.

And we keep hitting a wall.

There are a few reasons why you would have a non-reproducible error. The most common issue is if you have a race. For example, if two threads are modifying the same variable without proper synchronization, this will cause those kind of symptoms. We have a lot of experience in dealing with that, and all the signs were there. But we still hit a wall.

You see, the problem was that the entire section of the code we were looking at was protected by a lock. There was no possibility of an error because of threading, since this whole part of the code just couldn’t run concurrently anyway.

So why was it failing only occasionally? If it is single threaded, it should be predictable. In fact, the reason there was a lock there, instead of the more complex merge operations we used to have, was specifically to support reproducibility. The other kind of issue that can create this sort is I/O (which has its own timing issues), but the error happened in a part of the code that was operating on internal data structures, so it wasn’t that.

Eventually we figured it out. This was a long background operation, and because we needed to have the lock in place, we had a piece of code similar to this:

Because this is a long running operation, under lock, we do this in stages, and make sure that other things can run while we do that. But this was exactly what introduced the variability in the test results, and that made it so random and annoying. Once we figured that this was the cause for the difference, all we had to do was write the proper log of operations, and execute it again.

The rest was just finding out which of the 200,000 transactions executed actually caused the problem, mere detail work Smile.

Debugging CoreCLR applications in WinDBG

time to read 2 min | 208 words

One of the nicest tools that you have as a developer is the ability to debug. WinDBG isn’t what I call the best debugger in the world, but it is certainly among the most powerful. This post is meant just to walk you through setting up WinDBG with a CoreCLR application.

In particular, this is important when you are getting a crash dump from somewhere. In order to actually debug things properly, you need to load the crash dump into WinDBG and then run the following commands:

  1. .sympath srv*c:\Symbols*https://msdl.microsoft.com/download/symbols
  2. .load C:\Program Files\dotnet\shared\Microsoft.NETCore.App\1.0.0\sos.dll .
  3. .reload

The first step is to setup the symbols, so you can see method names, instead of addresses. The second will use the CoreCLR SOS dll (note, if you have different versions, you might need to get the sos.dll from the machine that the user is running along with the dump). And finally you are reloading the symbols (this will be slow on the first time).

That is it, now you can start working with WinDBG relatively normally, although I did notice some some of the commands have better UI. Looks like SOS and SOSEX had a meeting Smile.


  1. Voron’s internals: MVCC - All the moving parts - one day from now
  2. Voron internals: Cleaning up scratch buffers - 4 days from now
  3. Voron internals: The transaction journal & recovery - 5 days from now
  4. Voron internals: I/O costs analysis - 6 days from now
  5. Benchmarking with Go - 7 days from now

And 3 more posts are pending...

There are posts all the way to Sep 06, 2016


  1. Database Building 101 (8):
    25 Aug 2016 - Graph querying over large datasets
  2. Production postmortem (16):
    23 Aug 2016 - The insidious cost of managed memory
  3. Digging into the CoreCLR (3):
    12 Aug 2016 - Exceptional costs, Part II
  4. The Guts n’ Glory of Database Internals (20):
    08 Aug 2016 - Early lock release
View all series



Main feed Feed Stats
Comments feed   Comments Feed Stats