Ayende @ Rahien

My name is Oren Eini
Founder of Hibernating Rhinos LTD and RavenDB.
You can reach me by phone or email:


+972 52-548-6969

, @ Q c

Posts: 6,229 | Comments: 46,326

filter by tags archive

Performance analysisSimple indexes

time to read 2 min | 374 words

I outlined the scenario in my previous post, and I wanted to focus on each scenario one at a time. We’ll start with the following simple map index:


In RavenDB 4.0, we have done a lot of work to make this scenario as fast as possible. In fact, this has been the focus of a lot of architectural optimizations. When running a simple index, we keep very few things in managed memory, and even those are relatively transient. Most of our data is in memory mapped files. That means, no high GC cost because we have very few objects getting pushed to Gen1/Gen2. Mostly, this is telling Lucene “open wide, please” and shoving as much data inside as we can.

So we are seeing much better performance overall. But that isn’t to say that we don’t profile (you always do). So here are the results of profiling a big index run:

And this is funny, really funny. What you are seeing there is that about 50% of our runtime is going into those two functions.

What you don’t know is why they are here. Here is the actual code for this:


The problem is that the write.Delete() can be very expensive, especially in the common case of needing to index new documents. So instead of just calling for it all the time, we first check if we have previously indexed this document, and only then we’ll delete it.

As it turns out, those hugely costly calls are still a very big perf improvement, but we’ll probably replace them with a bloom filter that can do the same job, but with a lot less cost.

That said, notice that the runtime cost of those two functions together is 0.4 ms under profiler. So while I expect bloom filter to be much better, we’ll certainly need to double check that, and again, only the profiler can tell.

Performance analysisThe cost of StackOverflow indexes

time to read 5 min | 947 words

In this post, we are going to look at the relevant costs we have when testing out different indexes on the stack overflow data dump. In this case, we have the following two collections, and sample documents from each.


The first index we’ll look at is a simple full text index, simply mapping the users and asking to search over their data.


This index is pretty simple, both in how it works and what it would cause RavenDB to do. We do a simple scan over all the Users documents, and just index those two properties. Nothing much to see here, let  us move on.

Here we have a map/reduce index, over users, to see how many signups StackOverflow had per month.


This gets more interesting as a performance analysis goes. While a map only index just needs to spill everything to Lucene, and let it do its work (more on that later), a map/reduce index has to keep track of quite a lot of internal state. And as it runs out, the costs associated with the index vary according to what it does, and the access pattern it has.

In this case, we’re going to be scanning the users in roughly the order they were inserted into the database, and that would match pretty closely to the aggregation by the registration month. That means that this is likely to be a pretty cheap index, since we scan the data and it is mostly going to be naturally grouped along the same lines that we are going to group it. It means that the working set we’ll need for this index is going to be fairly small. Once we have passed by a particular month, we won’t be accessing it again.

The number of reduce keys (the distinct number of values that we group by) is also going to be fairly small, in the order of 100 keys or so.

Now, let us move to a more complex index:


This one is aggregating information over questions twice, once for the questions, and once for the answers. In terms of sheer amount of data it works on, it is pretty big. However, in terms of actual work that is done, we still somewhat benefit so significantly from the ordering of the data. This is the case since while questions and answers are usually temporally close to one another, that isn’t always the case. That said, for the vast majority of cases, we’re likely to see quite a bit of locality, which will help the index. There is also the fact that we still have very few reduce key, only about 100 or so, which also helps.


And now we are talking about putting the db through its paces. The problem here is that we are scanning all the indexes, and outputting an entry for each tag. There are 46,564 tags, and in total this index outputs 37,122,139 entries.  However, there are just 3,364 tags that have more than a thousand questions, and they are responsible for 32,831,961 of the questions, so while there is a significant long tail, it is pretty small, in terms of actual number of questions.

Note that we are still scanning in a manner consistent with the index, so we’ll typically go over a set of questions that all belong to the same month. And each month will have roughly 3 million questions (obviously less the further back in time we go). So complex, but not too badly so. The number of reduce keys will be in the hundreds of thousands, but the number of entries per reduce key will be relatively small.

This one, however, is much more problematic:


On the face of it, it looks like it would be a simpler index than the previous two, but it lacks a very important property, it isn’t actually limited by anything related to the scanning order we have. What this means is that whenever we will scan the data for this index, we are going to find a lot of tags, so each batch will have to touch a lot of reduce keys, and they are going to be big reduce keys (with lots of entries), so that means that we’ll need to re-reduce large number of large reduce keys often.

Effectively, this index is forcing us to scatter values all over the place, then aggregate all the information about each of those values. It is likely to be one of the worst scenarios for us to work with. Now that we have the groundwork laid out, we can talk about how we are actually going to approach performance optimizations here.

Stack arena memory allocation context

time to read 4 min | 705 words

Controlling memory allocations is something that any performant service need to do. In RavenDB 4.0 we have taken that to heart, and most of our memory usage is manual, instead of relying on the GC.

Now, there are a lot of really good reasons to want to avoid manual memory management. We have decades of usage telling us that this is a hard problem, with some nasty issues associated with it. But when we are talking about performance different of orders of magnitude, I’ll accept that burden.

The first thing that we did when making this decision is to see if we can change the way we approach unmanaged memory management to make it easier to avoid all of the common issues. We made several decisions:

  • All native allocations must go through a context.
  • A context is a unit of work that is single threaded.
  • A context is bounded in scope.
  • A context is reusable.

By placing those sort of limits on the native memory usage, we managed to gain a lot of simplicity. To start with, because the context is single threaded, we don’t have to implement an allocation strategy that is thread safe. Instead, we can just allocate a bunch of memory in the context, and parcel the memory from there. This looks like this:


The arena has allocated a certain size from the operation system. Allocating from that point is just a matter of bumping the next allocation pointer and handing that memory to the caller.

The nice thing about is that we don’t have the possibility for memory leaks, when the context scope ends, we can reset the next allocation pointer to the start, and we have “freed” all that memory. And we can start using it  again, without having to release it to the OS and get it back again. There are a whole bunch of edge cases (what happens when the arena is completely full, etc, but they are pretty obvious).

Not having to deal with multiple threads also present an interesting optimization. Once we are done with the memory, we can return it to the arena, but instead of maintaining free lists, etc, we do the following. If the returned memory is at the top of the arena (the last thing allocated), then we just bump the next allocation down, and the next allocation can reuse that memory. If that isn’t the case, we’ll do nothing, and rely on the context scope to handle this.

That, in turn, means that if I’m careful to free in the reverse order of allocations, I can get, within a single scope, memory reuse very cheaply. That is important, as we’ll see in a bit.

The context is also scoped to a certain operation, such as processing a request or an index batch. Because of that, even if we need to grow the arena, that is limited to however many operations we have to do inside that scope. And once the scope is over, we aren’t disposing of the context and releasing the memory, instead, we pool them so the next action will be able to use them. So a single OS level allocation is used across many actions, and the memory free cost is just setting a pointer.

But some scope need to do a lot of work, consider an index that is processing a big batch, in order to reduce the amount of memory that is being used, that is where the ability to free the top memory allocation comes in handy, because even inside the same scope, we can get memory reuse.

All in all, the allocation strategy was chosen to play to our strengths, and allow us to avoid doing multi threaded work in the hot paths, it also means that we get quite a lot of memory reuse, both within a scoped context and in different scopes, which reduce the overall amount of memory that we need to allocate. And the really great thing, it doesn’t come with a big pile of time spent in GC.

Voron Performance: A snapshot

time to read 2 min | 221 words

Yesterday the perf team let me know that they managed to get ~18% improvement on Voron by utilizing another on disk data structure. I’ll post more on that when we have it merged and working.

Talking about this, we decided to run a few benchmarks on our current status.

Nitpicker – this post talks about the performance of low level storage engines, the benchmark used was storing sequential values with 8 bytes key and 8 bytes value.

Here are the results for writing a billion values (16 bytes total) in ten thousands transactions.

  • 1,000,000,000 values  (1 billion)
  • 9.13 minutes
  • 1.824 million writes / sec
  • 31 GB final size

Then we spiked a small optimization that should allow us to defer major I/O costs, and we got:

  • 100,000,000 values (hundred million)
  • 40 seconds
  • 2.449 million writes / sec
  • 4.1 GB final size

We pretty much cheat here, because we defer the I/O cost under load to a later time, by not purging the journals, but that is something that I’ll post in detail once we have this merged.

Oh, and those are the number before the additional 18% optimization Smile. And they were run on a single commodity hardware node over SSD drive.

N+1 queries are hardly a feature

time to read 3 min | 531 words

I run into this article, talking about N+1 issues from a “fresh” perspective. In the article, there is a quote by DHH there that goes like this:

If you have N+1 query it means you’re executing one SQL query per element so if you have 50 emails in an inbox, that’d be 50 SQL calls, right? That sounds like a bug. Well in a Russian doll caching setup, it’s not a bug, it’s a feature. The beauty of those individual calls are that they’re individually cached, on their own timeline, and that they’re super simple.

Now, I have been involved with OR/Ms and databases for over a decade now, and I have spent large parts of my career working to resolve performance problems in database driven applications.

In a word, the idea that having a larger amount of simpler queries is better is nonsense. In particular, it completely ignores the cost of going to the database. Sure, a more complex query may require the database to do additional work, and if you are using caching, then you’ll not have the data in the cache in neat “cache entry per row”. But in practice, this leads to applications doing hundreds of queries per page view, absolute reliance on the cache and tremendous cost at startup.

In RavenDB, most queries are so fast that when we measured, the network component was the largest cost we had to face.

Let us do the numbers, shall we? Let us assume that we are talking about the best case, we have the database machine and the web machine in the same datacenter. Cost of roundtrip in the same datacenter is 0.5 ms. Now, let us go back to the 50 emails example above, shall we? We need to send 50 queries to the database. We’ll assume that we have a perfect database that takes no time at all to answer. That is still 25 ms wasted just on roundtrip times.

And the problem is that this is usually a lot more than 50 queries per page when you adopt this kind of silliness. You typically see hundreds or thousands of them, and the database isn’t really able to answer you in no time, so expect to see much bigger delays in practice.

But wait, I hear you say, this is all about caching, you are ignoring that part. Well, no, I’m not. If you are using a distributed cache, the costs over the network are exactly the same. So this is only relevant if you are using a cache on the same server, but then you run into issues when you have different caches on different machines hold different information. Not to mention that you are now in the wonderful world of having to worry about cache invalidation strategies and aligning them with the business requirements.

And for fun, what happens one of your cache nodes goes down? You got it, you just created a DoS attach on your own database.

On the other hand, you can actually create proper queries, get the data in as few roundtrips as possible and trust the database engine to do its work.

Digging into the CoreCLRExceptional costs, Part II

time to read 3 min | 429 words

Note, this post was written by Federico.

In my last post we talked about the cost of throwing and catching exceptions at the caller site. Some are straightforward and easily seen like the code complexity, but others are a bit deeper, like for instance how the code ends up like that (we will talk about that, but not just yet). Today we will focus on what to do when we control the call-site and are in a performance sensitive hot-spot.

There are 2 important assumptions here:

  • You own the code
  • You are in a hot-spot

If either one of these two is not true, then this is of zero importance or you are screwed. Smile

So let's modify our code a bit (check in yesterday's post if you don’t remember the details). In order to achieve the same result we will resort to a very well-known pattern that I like to call TryXXX. Many instance of such optimizations are visible in the .Net Framework like the famous int.TryParse method. Apparently someone during the course of using v1.0 (or v1.1) of the Framework figured out that the cost of exception handling for certain scenarios was a bit too much. We probably won’t know who was, but we can all be glad they have fixed it; even though we have to live with an exception based implementation (borrowed from Java style?) as obsolete code since then.

So let's see how the code would look. 


Pretty straightforward I might say. Now the interesting thing is what happens at the assembler level:


Even under shallow review, we can conclude that this code is definitely faster than the alternative. Now what did we win against the try-catch version? Essentially, we don't have a prolog and an epilog in case of the choosing the exceptional path, that’s faster than having to execute such code. The exception case also does not have to deal with non-local effects caused by unwinding the stack; but we are forced to have a hierarchy of TryXXX methods if that goes deep (the alternative of using exceptions for readability is not great either).

Now in this code we have the first glimpse of evidence of a few JIT design choices (and some current restrictions too) that are important performance wise and we will discuss them in future posts.

Digging into the CoreCLRExceptional costs, Part I

time to read 4 min | 624 words

Note, this post was written by Federico.

One guideline which is commonly known is: "Do not use exceptions for flow control." You can read more about it in many places, but this is good compendium of the most common arguments. If you are not acquainted with the reasons, give them a read first; I’ll wait.

Many of the reasons focus on the readability of the code, but remember, my work (usually) revolves around writing pretty disgusting albeit efficient code. So even though I care about readability it is mostly achieved through very lengthy comments on the code on why you shouldn't touch something if you cannot prove something will be faster.

Digression aside the question is still open. What is the impact of using exceptions for control flow (or having to deal with someone else throwing exceptions) in your performance sensitive code? Let's examine that in detail.

For that we will use a very simple code to understand what can happen. 


This is a code that is simple enough so that the assembler won’t get too convoluted, but at the same time sport at least some logic we can use as markers.

Let's first inspect the method CanThrow, in there what we can see is how the throwing of exceptions happen:


As you can see there is a lot of things to be done just to throw the exception. There in the last call we will use jump to the proper place in the stack and continue in the catch statement that we hit.


So here is the code of our simple method. At the assembler level, our try statement has a very important implication. Each try-catch forces the method to deal with a few control flow issues. First it has to store the exception handler in case anything inside would throw, then it has to do the actual work. If there is no exception (the happy path) we move forward and end. But what happen if we have an exception? We first need to remove the handler (we don't want to recheck this handler if we end up throwing inside the catch, right?) Then execute the catch and be done.

But now let’s contrast that to the generated code if no try-catch statement happens. The avid reader will realize that the happy path will never be executed because we are throwing, but don’t worry, the code is the same if there is no inlining happening.


We will talk about why the code ends up like this in a follow up post, but suffice to say that all this trouble cannot beat a check for a Boolean if you needed to fail (and could do something about it). 

It is also important to remember that this kind of work is only relevant if you are in the hot path. If you are not calling a function at least a few tens of thousands a second, don’t even bother, your performance costs are elsewhere. This is micro optimization land.

Digging into the CoreCLRJIT Introduction

time to read 3 min | 401 words

Note, this post was written by Federico.

imageThe .Net Just-In-Time compiler (or JIT for short) is one of those marvels you don't even realize is there, unless you have to smooth talk it to do your bidding. At the Voron level, most routines are so tight that the assembler emitted becomes important. It is not uncommon to have to reason about how the instructions are going to be decoded by the CPU, if the microarchitecture is able to execute two instructions independently or not or we are memory-bound (having too many cache misses).

Those with a C/C++ background know the optimizing compiler is actually quite good at its job; and even so, tricks are required to squeeze the last performance bits. The JIT is no different, with the addition that while the C++ compiler has all the time in the world (sort-of); the restrictions imposed by compiling while your code is executing are far more severe. However, even on a budget, the CoreCLR does a pretty amazing job at optimizing. Until, of course, you expect the performance feats of the offline C++ compiler from it.

So the question you may probably have is: "Why bother? Why don't you just go and write the database in C++?”. I won't go in detail on that, since we covered that already . What I can tell you is, that when you dig into the JIT world, it is pretty easy to achieve performance on par with C++ if the JIT by design doesn't get in your way with unsupported features. Good thing is the CoreCLR team is quite open to tackle those  issues if presented with a detailed analysis on the cost-effectiveness of your optimization.

In this series we will talk about the tricks we had to teach our code to squeeze those last bits of micro-optimizations. From specific optimizations requested to the development team and workarounds until they are available, to tricks we use to ensure the assembler code generated is of the highest quality possible.

One word of caution: The JIT is ever improving so it is important to understand that the optimizations opportunities presented here may not apply tomorrow or just be done automatically in future versions of CoreCLR.

Reducing allocations and resource usages when using Task.Delay

time to read 2 min | 399 words

In my previous posts, I talked about tri state waiting, which included the following line:


And then I did a deep dive into how timers on the CLR are implemented. Taken together, this presents me somewhat of a problem. What is the cost of calling a Task.Delay? Luckily, the code is there, so we can check.

The relevant costs are here. We allocate a new DelayPromise, and a Timer instance. As we previously saw, actually creating a Timer instance will take a global lock, and the cost of actually firing those timers is proportional to the number of timers that we have.

Let’s consider the scenario for the code above. We want to support a very large number of clients, and we typically expect them to be idle, so every second we’ll have the delay fire, we send them a heartbeat, and go on with the waiting. What will happen in such a case for the code above?

All of those async connections are going to be allocating several objects per run, and each of them is going to contend on the same lock. All of them are also going to run a lot more slowly than they should.

In order to resolve this issue, given what we know now about the timer system on the CLR, we can write the following code:

In this case, we have a single static timer (so there is no lock contention per call), and we have a single task allocation per second. All of the connections will use the same instance. In other words, if there are 10,000 connections, we just saved 20K allocations per second, as well as 10K lock contentions, not to mention that instead of waking up every single connection one at a time, we have just a single task instance is completed once, resulting in all the the timing out tasks being woken up.

This code does have the disadvantage of allocating a task every second, even if no one is listening. It is a pretty small price to pay, and fixing it can be left as an exercise for the reader Smile.

The cadence of tests speed

time to read 3 min | 402 words

RavenDB has quite a bit of tests. Over five thousands of them, on the last count. They test common things (like saving a document works) and esoteric things (like spatial query behavior near the equator). They are important, but they also take quite a lot of time to run. In our current setup, running the full test suite can take 3 – 6 hours, depending on which build agent is running and what is the configuration we have.

This means that in terms of actually running the tests, developers can’t really get good feedback. We have taken some steps to address that, by creating a core set of tests that will examine the major and most important features, fast, which developers can run in reasonable time as part of their work, and then we let the build server sort out the pull requests on its own time.

For RavenDB 4.0, we decided to go in a different direction. We split the tests into two separate projects. FastTests and SlowTests. The requirement for the FastTest is that they will run in under 1 minute. In order to do that we use xunit’s parallel test support, which allows us to run concurrent tests (this actually required a lot of changes in the test code, which assumed that we are running in a serial fashion and that each test is the sole owner of the entire system).

Part of making sure that we are running in under 1 minute is knowing when we breached that. Because the tests run concurrently, it is a bit difficult to know what is taking so long. Thankfully, it is easy to get a report out. We use the following script to do that.

We now have the top slowest tests (some of them takes tens of seconds to run, but we don’t usually care, as long as other test run concurrently parallel and the whole suite is fast), and we can figure out whatever we can optimize the test or move it to the SlowTest project.

Another change we made was to move a lot of the test code to use the async API, this allows us to run more tests concurrently without putting too much of a burden on the system. Our current tests times range between 30 – 55 seconds, which is enough to run them pretty much on every drop of a hat.


  1. Optimizing read transaction startup time: Every little bit helps, a LOT - 4 hours from now

There are posts all the way to Oct 27, 2016


  1. Optimizing read transaction startup time (6):
    26 Oct 2016 - The performance triage
  2. RavenDB Retrospective (4):
    17 Oct 2016 - The governors
  3. Timing the time it takes to parse time (2):
    11 Oct 2016 - Part II
  4. Performance analysis (2):
    04 Oct 2016 - Simple indexes
  5. Interview question (3):
    29 Sep 2016 - Stackoverflow THAT
View all series



Main feed Feed Stats
Comments feed   Comments Feed Stats