Ayende @ Rahien

My name is Oren Eini
Founder of Hibernating Rhinos LTD and RavenDB.
You can reach me by email or phone:


+972 52-548-6969

, @ Q c

Posts: 6,336 | Comments: 47,046

filter by tags archive

Externalizing the HttpClient internals for fun & profit

time to read 2 min | 382 words

In many respects, HttpClient is much better than using the old WebRequest API. It does a lot more for you and it is much easier to use in common scenarios.

In others, the API is extremely constraining. One such example is when you want to do incremental generation of the request (maybe based on other things that are going on). Using WebRequest, this is trivial, you get the request stream, and just start writing to it, but with HttpClient, the actual request stream is hidden several layers too deep to be very useful.

Sure, you can use PushStreamContent to actually generate the data to write to the stream, but it doesn’t help if you need to be called with more information. For example, let us imagine the following interface:


It is a pretty silly one, but it should explain things. We are first calling Init, and passing it the url we want to POST to, and then we upload multiple files to the servers. Using HttpClient, the usual way would be to gather all the file names during the Upload method, and then use PushStreamContent to push it all to the server in the Done method.

This is awkward if we have a lot of files, or if we want to generate and delete them after the upload. Luckily, we can cheat and get the same behavior as we can in WebRequest. Let us examine the code:

The first thing we do is spin a POST request to the server, but we are doing something strange, instead of generating the data when we are called on SerializeToStreamAsync, we are exposing it outside, and then returning another task. Effectively, we are telling the HttpContent that we are busy now, and it shouldn’t bug us with details until we’ll let it know.

Then, we wait to get the stream, and then we can start uploading each file in turn. At the end, we need to let the HttpClient that we are done sending data to the server, at which point we just need to wait for the server response, and we are done.

Getting 600 times better performance

time to read 1 min | 73 words

During the design process for the next release of RavenDB, we set ourselves a pretty crazy goal. We wanted to get a tenfold performance improvement across the board…

This is how my article about gaining 600x performance improvement the the new DZone Guide - Performance: Optimization & Monitoring starts. For the rest, head there and read it all Smile.

What did all this optimization give us?

time to read 2 min | 332 words

I’ve been writing a lot about performance and optimizations, and mostly I’m giving out percentages, because it is useful to compare to before the optimizations.

But when you start looking at the raw numbers, you see a whole different picture.

On the left, we have RavenDB 4.0 doing work (import & indexing) over about 4.5 million documents. On the right, you have RavenDB 3.5, doing the same exact work.

We are tracking allocations here, and this is part of a work we have been doing to measure our relative change in costs. In particular, we focused on the cost of using strings.

A typical application will use about 30% of memory just for strings, and you can see that RavenDB 3.5 (on the right) is no different.


On the other hand, RavenDB 4.0 is using just 2.4% of its memory for strings. But what is even more interesting is to look at the total allocations. RavenDB 3.5 allocated about 300 GB to deal with the workload, and RavenDB 4.0 allocated about 32GB.


Note that those are allocations, not total memory used, but on just about every metric. Take a look at those numbers:


RavenDB 4.0 is spending less time overall in GC than RavenDB 3.5 will spend just on blocking collections.

Amusingly enough, here are the saved profile runs:


GoTo based optimizations

time to read 2 min | 372 words

One of the things that we did recently was go over our internal data structures in RavenDB and see if we can optimize them. Some of those changes are pretty strange if you aren’t following what is actually going on. Here is an example:

Before After




What is the point in this kind of change? Well, let us look at the actual assembly generated by this, shall we?

Before After

As you can see, the second option is much shorter, and in the common case, it involves no actual jumping. This ends up being extremely efficient. Note that because we return a value from the ThrowForEmptyStack, the assembly generated is extremely short, since we can rely on the caller to clean us up.

This was run in release mode, CoreCLR, x64. I got the assembly from the debugger, so it is possible that there are some optimizations that hasn’t been applied because the debugger is attached, but it is fairly closed to what should happen for real, I think. Note that the ThrowForEmptyStack is inlined, even though it is an exception only method. If we use [MethodImpl(MethodImplOptions.NoInlining)], it will stop it, but the goto version will still generate better code.

The end result is that we are running much less code, and that makes me happy. In general, a good guide for assembly reading is that shorter == faster, and if you are reading assembly, you are very likely in optimization mode, or debugging the compiler.

I’m pretty sure that the 2.0 release of CoreCLR already fixed this kind of issues, by the way, and it should allow us to write more idiomatic code that generates very tight machine code.

Fast Dictionary and struct generic arguments

time to read 2 min | 281 words

One of the most common issues that come up with performance tuning is that dictionaries are expensive. It isn’t so much that a single dictionary lookup is expensive, it is the sheer number of them. Dictionaries are used everywhere, and they are often used in very hot codepaths (as caching).

Numerous times we have dealt with that with trying to avoid the dictionary access (often favoring an array based lookup if we can get away with it). But at some point we have decided to implement our own dictionary. Here is how it looks like:


The actual dictionary impl is very close to the standard one, but that isn’t what make it fast. Note the generic argument? If we pass a struct implementing IEqualityComparer generic argument, then in most cases, the compiler and the JIT are going to generate code that is able to eliminate all virtual calls. And if there is a trivial equality comparison, that means that you can eliminate all calls and inline the whole thing inside that generic dictionary implementation.

In other words, we eliminate a minimum of two virtual calls per key lookup, and in some cases, we can eliminate even the method calls themselves, and that turn out to be quite important when the number of key lookups is in the billions.


And this is from midway through the optimizations.

1st class patching in RavenDB 4.0

time to read 2 min | 317 words

One of the most common operations in RavenDB is to load a document, make a simple change and save it back. Usually, we tell users to just rely on the change tracking on the session and just save the document, but while it is the easiest way, it isn’t always the best. If I have a large document, I might not want to send it all the way back to the server just for a small change. That is why RavenDB has had a Patch operation for a long time. But while we had this feature, it was always a bit clumsy. It either required you to build a patch request using a somewhat cryptic and limited object graph or to write your own inline javascript to make the modifications.

With RavenDB 4.0, we are introducing patching as a first class concept, baked directly into the session, for example:

In this case, we’ll send the server a request to update the last modified date when the SaveChanges method is called. The syntax is not all that I wish it could be, but we have to operate with the limitations that Linq syntax can accept.

A more interesting case is when you want to use patching not to reduce the network load, but to allow multiple concurrent operations on the document. Let us consider the case of adding a comment to this blog post. It is fine if two users post a comment at the same time, and we can express that using:

This gives us an easy way to express such things and expose a RavenDB capability that too few users has taken advantage of. Beneath the scenes, the Linq expressions are turned into JavaScript patches, which is then used to send just the right commands for the server to work with.

It’s a really cool feature, even if I so myself.

The memory leak in the network partition

time to read 2 min | 354 words

RavenDB it meant to be a service that just runs and runs, for very long periods of time and under pretty much all scenarios. That means that as part of our testing, we are putting a lot of emphasis on its behavior. Amount of CPU used, memory utilization, etc. And we do that in all sort of scenarios. Because getting the steady state working doesn’t help if you have an issue, and then that issue kills you. So we put the system into a lot of weird states to see not only how it behaves, but what are the second order affects of that would be.

Once such configuration was a very slow network with a very short timeout setting, so effectively we’ll always be getting timeouts, and need to respond accordingly. We had a piece of code that is waiting for something to happen (an internal event, or a read from the network, or a timeout) and then does something accordingly.This is implemented as follows:

This is obviously extremely simplified, but it will reproduce the issue. If you will run this code, it will start using more and more memory. But why? On the face of it, this looks like a perfectly reasonable code.

What is actually happening is that the WaitAny will call CommonCWAnyLogic, which will call an AddCompletionAction on that task, which will track it, so we have a list of items there. So if we have a lot of waits on the same task, that is going to cause us to track all of those waits.

Here is what it looks like after a short while in the debugger.


And there is our memory leak.

The solution, by the way, was to not call WaitAny each time, but to call WhenAny, and then call Wait() on the resulting task, and keep that task around until it is completed, so we only register to the original event once.

Data checksums in Voron

time to read 7 min | 1285 words

Every time that I think about this feature, I am reminded of this song. This is a feature that is only ever going to be used in everything fails. In fact, it isn’t a feature, it is an early warning system, whose sole purpose is to tell you when you are screwed.

Checksums in RavenDB (actually, Voron, but for the purpose of discussion, there isn’t much difference) are meant to detect when the hardware has done something bad. We told it to save a particular set of data, and it didn’t do it properly, even though it had been very eager to tell us that this can never happen.

The concept of a checksum is pretty simple, whenever we write a page to disk, we’ll hash it, and store the hash in the page. When we read the page from disk, we’ll check if the hash matches the actual data that we read. If not, there is a serious error. It is important to note that this isn’t actually related to the way we are recovering from failures and midway through transactions.

That is handled by the journal, and the journal is also protected by a checksum, on a per transaction basis. However, handling this sort of errors is both expected and well handled. We know where the data is likely to fail and we know why, and we have the information required (in the journal) to recover from it.

This is different, this is validating that data that we have successfully written to disk, and flushed successfully, is actually still resident in the form that we are familiar with. This can happen because the hardware outright lied to us (can usually happen with cheap hardware) or there is some failure (cosmic rays are just one of the many options that you can run into). In particular if running on crappy hardware, this can be just because overheating or too much load on the system. As a hint, another name for crappy hardware is a cloud machine.

There are all sorts of ways that you can happen, and the literature makes for a very sad reading. In a CERN study, about 900 TB were written in the course of six months, and about 180 MB resulted in errors.

The following images are from a NetApp study shows that over a time period of 2.5 years, 8.5% of disks had silent data corruption errors. You can assume that those are not cheap off the shelves disks. Some of the causes are great reading if you are a fan of mysteries and puzzles, but kind of depressing if you build databases for a living (or rely on databases in general).


Those are just the failures that had interesting images, mind, there are a lot more there. But from the point of view of the poor database, it ends up being the same thing. The hardware lied to me. And there is very little that a database can do to protect itself against such errors.

Actually, that is a lie. There is a lot that a database can do to protect itself. It used to be common to store critical pages in multiple locations on disks (usually making sure that they are physically far away from one another), as a way to reduce the impact of the inevitable data corruption. This way, things like the pages that describe where all the rest of the data in the system reside tend to be safe from most common errors, and you can at least recover a bit.

As you probably guessed, Voron does checksums, but it doesn’t bother to duplicate information. That is already something that is handled by RavenDB itself. Most of the storage systems that are dealing with data duplication (ZFS has this notion with the copies command, for example) were typically designed to work primarily on a single primary node (such as file system that don’t have distribution capabilities). Given that RavenDB replication already does this kind of work for us, there is no point duplicating such work at the storage layer. Instead the checksum feature is meant to detect a data corruption error and abort any future work on suspect data.

In a typical cluster, this will generate an error on access, and the node can be taken down and repaired from a replica. This serves as both an early warning system and as a way to make sure that a single data corruption in one location doesn’t “infect” other locations in the database, or worse, across the network.

So now that I have written oh so much about what this feature is, let us talk a bit about what it is actually doing. Typically, a database would validate the checksum whenever it reads the data from disk, and then trust the data in memory ( is isn’t really safe to do that either, but let’s us not pull the  research on that, otherwise you’ll be reading the next post on papyrus) as long as it resides in its buffer pool.

This is simple, easy and reduce the number of validation you need to do. But Voron doesn’t work in this manner. Instead, Voron is mapping the entire file into memory, and accessing it directly. We don’t have a concept of reading from disk, or a buffer pool to manage. Instead of doing the OS work, we assume that it can do what it is supposed to do and concentrate on other things. But it does mean that we don’t control when the data is loaded from disk. Technically speaking, we could have tried to hook into the page fault mechanism and do the checks there, but that is so far outside my comfort zone that it gives me the shivers. “Wanna run my database? Sure, just install this rootkit and we can now operate properly.”

I’m sure that this would be a database administrator’s dream. I mean, sure, I can package that in a container, and then nobody would probably mind, but… the insanity has to stop somewhere.

Another option would be to validate the checksum on every read, that is possible, and quite easy, but this is going to incur a substantial performance penalty to do ensure that something that shouldn’t happen didn’t happen. Doesn’t seem like a good tradeoff to me.

What we do instead is make the best of it. We keep a bitmap of all the pages in the data file, and we’ll validate them the first time that we access them (there is a bit of complexity here regarding concurrent access, but we are racing it to success and at worst we’ll end up validating the page multiple times), and afterward, we know that we don’t need to do that again. Once we loaded the data to memory even once, we assume that is isn’t going to change beneath our feet by something. This isn’t an axiom, and there are situations where a page can be loaded from disk, valid, and then become corrupted on disk. The OS will discard it at some point, and then read the corrupt data again, but this is a much rarer circumstance than before.

The fact that we recently have verified that the the page is valid is a good indication that it will remain valid, and anything else have too much overhead for us to be able to use (and remember that we also have those replicas for those extreme rare cases). 

Independent of this post, I just found this article which injected errors in multiple databases data and examined how they behaved. Facincating reading.

The occasionally failing test

time to read 2 min | 211 words

This piece of code is part of a test that runs a scenario, and checks that the appropriate errors are logged. Very occasionally, this test would fail, and it would be nearly impossible to figure out why.

I’ve extracted the offending code from the test, ReadFromWebSocket returns a Task<string>, since that isn’t obvious from the code. Do you see the error?


Think about it, what is this doing?

This is reading from a socket, and there is absolutely no guarantee about the amount of data that will go over the wire in any particular point. Because this test assumes that the entire string will be read in a single call from the socket, if the expected value we are looking is actually going to be returned in two calls, we’ll miss it, and this will never return, leading to sadness and worry everywhere*.

* At least everywhere that care about our tests.

The fix was to remember the previous values and compare to all the data read from the socket, not just the values that were returned in the last call.


  1. Why we aren’t publishing benchmarks for RavenDB 4.0 yet - 17 hours from now
  2. Deleting highly performant code - about one day from now
  3. The bug in the platform: Partial HTTP requests & early response - 5 days from now
  4. Trying to live without ReSharper in Visual Studio 2017 - 6 days from now
  5. The cost of allocating memory and cheating like crazy for best performance - 7 days from now

And 1 more posts are pending...

There are posts all the way to Apr 06, 2017


  1. RavenDB Conference videos (12):
    03 Mar 2017 - Replication changes in 3.5
  2. Low level Voron optimizations (5):
    02 Mar 2017 - Primitives & abstraction levels
  3. Implementing low level trie (4):
    26 Jan 2017 - Digging into the C++ impl
  4. Answer (9):
    20 Jan 2017 - What does this code do?
  5. Challenge (48):
    19 Jan 2017 - What does this code do?
View all series


Main feed Feed Stats
Comments feed   Comments Feed Stats