Ayende @ Rahien

Oren Eini aka Ayende Rahien CEO of Hibernating Rhinos LTD, which develops RavenDB, a NoSQL Open Source Document Database.

Get in touch with me:


+972 52-548-6969

Posts: 7,461 | Comments: 50,984

Privacy Policy Terms
filter by tags archive
time to read 4 min | 723 words

Deep inside of the Corax indexing engine inside of RavenDB there is the notion of a posting list. A posting list is just an ordered set of entry ids that contains a particular term. During the indexing process, we need to add and remove items from that posting list. This ends up being something like this:

For fun, go and ask ChatGPT to write you the code for this task.

You can assume that there are no duplicates between the removals and additions, and that adding an existing item is a no-op (so just one value would be in the end result). Here is a quick solution for this task (not actually tested that much, mind, but sufficient to understand what I’m trying to do):

If you look at this code in terms of performance, you’ll realize that this is quite expensive. In terms of complexity, this is actually pretty good, we iterate over the arrays just once, and the number of comparisons is also bounded to the lengths of the list.

However, there is a big issue here, the number of branches that you have to deal with. Basically, every if and every for loop is going to add a tiny bit of latency to the system. This is because these are unpredictable branches, which are pretty nasty to deal with.

It turns out that the values that we put in the posting list are actually always a multiple of 4, so the bottom 2 bits are always cleared. That means that we actually have a different way to deal with it. Here is the new logic:

This code was written with an eye to being able to explain the algorithm, mind, not performance.

The idea goes like this. We flag the removals with a bit, then concatenate all the arrays together, sort them, and then do a single scan over the whole thing, removing duplicates and removals.

In the real code, we are using raw pointers, not a List, so there are no access checks, etc.

From an algorithmic perspective, this code makes absolutely no sense at all. We concatenate all the values together, then sort them (O(NlogN) operation) then scan it again?!

How can that be faster than a single scan across all three arrays? The answer is simple, we have a really efficient sort primitive (vxsort) that is able to sort things really fast (GB/sec). There is a really good series of posts that explain how that is achieved.

Since we consider sorting to be cheap, the rest of the work is just a single scan on the list, and there are no branches at all there. The code plays with the offset that we write into, figuring out whether we need to overwrite the current value (duplicate) or go back (removal), but in general it means that it can execute very quickly.

This approach also has another really important aspect. Take a look at the actual code that we have in production. This is from about an hour worth of profiling a busy indexing session:


And the more common code path:


In both of them, you’ll notice something really important. There isn’t a call to sorting at all in here. In fact, when I search for the relevant function, I find:


That is 25 ms out of over an hour.

How can this be? As efficient as the sorting can be, we are supposed to be calling it a lot.

Well, consider one scenario, what happens if:

  • There are no removals
  • All additions happen after the last existing item in the list

In this case, I don’t need to do anything beyond concatenate the lists. I can skip the entire process entirely, just copy the existing and additions to the output and call it a day.

Even when I do have a lot of removals and complicated merge processes, the code structure means that the CPU can get through this code very quickly. This isn’t super friendly for humans to read, but for the CPU, this is chump change.

time to read 3 min | 479 words

At some point in any performance optimization sprint, you are going to run into a super annoying problem: The dictionary.

The reasoning is quite simple. One of the most powerful optimization techniques is to use a cache, which is usually implemented as a dictionary. Today’s tale is about a dictionary, but surprisingly enough, not about a cache.

Let’s set up the background, I’m looking at optimizing a big indexing batch deep inside RavenDB, and here is my current focus:


You can see that the RecordTermsForEntries take 4% of the overall indexing time. That is… a lot, as you can imagine.

What is more interesting here is why. The simplified version of the code looks like this:

Basically, we are registering, for each entry, all the terms that belong to it. This is complicated by the fact that we are doing the process in stages:

  1. Create the entries
  2. Process the terms for the entries
  3. Write the terms to persistent storage (giving them the recorded term id)
  4. Update the entries to record the term ids that they belong to

The part of the code that we are looking at now is the last one, where we already wrote the terms to persistent storage and we need to update the entries. This is needed so when we read them, we’ll be able to find the relevant terms.

At any rate, you can see that this method cost is absolutely dominated by the dictionary call. In fact, we are actually using an optimized method here to avoid doing a TryGetValue() and then Add() in case the value is not already in the dictionary.

If we actually look at the metrics, this is actually kind of awesome. We are calling the dictionary almost 400 million times and it is able to do the work in under 200 nanoseconds per call.

That is pretty awesome, but that still means that we have over 2% of our total indexing time spent doing lookups. Can we do better?

In this case, absolutely. Here is how this works, instead of doing a dictionary lookup, we are going to store a list. And the entry will record the index of the item in the list. Here is what this looks like:

There isn’t much to this process, I admit. I was lucky that in this case, we were able to reorder things in such a way that skipping the dictionary lookup is a viable method.

In other cases, we would need to record the index at the creation of the entry (effectively reserving the position) and then use that later.

And the result is…


That is pretty good, even if I say so myself. The cost went down from 3.6 microseconds per call to 1.3 microseconds. That is almost 3 folds improvement.

time to read 1 min | 114 words

RavenDB is a multi-primary database, which means that it allows you to write to multiple nodes at the same time, without needing synchronization between them.

This ability to run independently from the other nodes in the cluster (or even across clusters) makes RavenDB highly suitable for running on the edge.

We have recently published a guide on using RavenDB from Cloudflare Workers, as well as a full template so you can get up to speed in a few minutes.

The ability to run in a Cloudflare Worker (and use a nearby RavenDB server) means that your logic is running closer to the client, which can greatly reduce your overall latency and improve the overall user experience.

time to read 2 min | 380 words

I was looking into reducing the allocation in a particular part of our code, and I ran into what was basically the following code (boiled down to the essentials):

As you can see, this does a lot of allocations. The actual method in question was a pretty good size, and all those operations happened in different locations and weren’t as obvious.

Take a moment to look at the code, how many allocations can you spot here?

The first one, obviously, is the string allocation, but there is another one, inside the call to GetBytes(), let’s fix that first by allocating the buffer once (I’m leaving aside the allocation of the reusable buffer, you can assume it is big enough to cover all our needs):

For that matter, we can also easily fix the second problem, by avoiding the string allocation:

That is a few minutes of work, and we are good to go. This method is called a lot, so we can expect a huge reduction in the amount of memory that we allocated.

Except… that didn’t happen. In fact, the amount of memory that we allocate remained pretty much the same. Digging into the details, we allocate roughly the same number of byte arrays (how!) and instead of allocating a lot of strings, we now allocate a lot of character arrays.

I broke the code apart into multiple lines, which made things a lot clearer. (In fact, I threw that into SharpLab, to be honest). Take a look:

This code: buffer[..len] is actually translated to:

char[] charBuffer= RuntimeHelpers.GetSubArray(buffer, Range.EndAt(len));

That will, of course, allocate. I had to change the code to be very explicit about the types that I wanted to use:

This will not allocate, but if you note the changes in the code, you can see that the use of var in this case really tripped me up. Because of the number of overloads and automatic coercion of types that didn’t happen.

For that matter, note that any slicing on arrays will generate a new array, including this code:

This makes perfect sense when you realize what is going on and can still be a big surprise, I looked at the code a lot before I figured out what was going on, and that was with a profiler output that pinpointed the fault.

time to read 3 min | 533 words

Measuring the length of time that a particular piece of code takes is a surprising challenging task. There are two aspects to this, the first is how do you ensure that the cost of getting the start and end times won’t interfere with the work you are doing. The second is how to actually get the time (potentially many times a second) in as efficient way as possible.

To give some context, Andrey Akinshin does a great overview of how the Stopwatch class works in C#. On Linux, that is basically calling to the clock_gettime system call, except that this is not a system call. That is actually a piece of code that the Kernel sticks inside your process that will then integrate with other aspects of the Kernel to optimize this. The idea is that this system call is so frequent that you cannot pay the cost of the Kernel mode transition. There is a good coverage of this here.

In short, that is a very well-known problem and quite a lot of brainpower has been dedicated to solving it. And then we reached this situation:


What you are seeing here is us testing the indexing process of RavenDB under the profiler. This is indexing roughly 100M documents, and according to the profiler, we are spending 15% of our time gathering metrics?

The StatsScope.Start() method simply calls Stopwatch.Start(), so we are basically looking at a profiler output that says that Stopwatch is accounting for 15% of our runtime?

Sorry, I don’t believe that. I mean, it is possible, but it seems far-fetched.

In order to test this, I wrote a very simple program, which will generate 100K integers and test whether they are prime or not. I’m doing that to test compute-bound work, basically, and testing calling Start() and Stop() either across the whole loop or in each iteration.

I run that a few times and I’m getting:

  • Windows: 311 ms with Stopwatch per iteration and 312 ms without
  • Linux: 450 ms with Stopwatch per iteration and 455 ms without

On Linux, there is about 5ms overhead if we use a per iteration stopwatch, on Windows, it is either the same cost or slightly cheaper with per iteration stopwatch.

Here is the profiler output on Windows:


And on Linux:


Now, that is what happens when we are doing a significant amount of work, what happens if the amount of work is negligible? I made the IsPrime() method very cheap, and I got:


So that is a good indication that this isn’t free, but still…

Comparing the costs, it is utterly ridiculous that the profiler says that so much time is spent in those methods.

Another aspect here may be the issue of the profiler impact itself. There are differences between using Tracing and Sampling methods, for example.

I don’t have an answer, just a lot of very curious questions.

time to read 1 min | 83 words

I’m going to QCon San Francisco and will be teaching a full day workshop where we’ll start from a C compiler and  an empty file and end up with a functional storage engine, indexing and more.

Included in the minimum requirements are implementing transactions, MVCC, persistent data structures, and indexes.

The workshop is going to be loosely based on the book, but I’m going to condense things so we can cover this topic in a single day.

Looking forward to seeing you there.

time to read 3 min | 417 words

A customer called us, quite upset, because their RavenDB cluster was failing every few minutes. That was weird, because they were running on our cloud offering, so we had full access to the metrics, and we saw absolutely no problem on our end.

During the call, it turned out that every now and then, but almost always immediately after a new deployment, RavenDB would fail some requests. On a fairly consistent basis, we could see two failures and a retry that was finally successful.

Okay, so at least there is no user visible impact, but this was still super strange to see. On the backend, we couldn’t see any reason why we would get those sort of errors.

Looking at the failure stack, we narrowed things down to an async operation that was invoked via DataDog. Our suspicions were focused on this being an error in the async machinery customization that DataDog uses for adding non-invasive monitoring.

We created a custom build for the user that they could test and waited to get the results from their environment. Trying to reproduce this locally using DataDog integration didn’t raise any flags.

The good thing was that we did find a smoking gun, a violation of the natural order and invariant breaking behavior.

The not so good news was that it was in our own code. At least that meant that we could fix this.

Let’s see if I can explain what is going on. The customer was using a custom configuration: FastestNode. This is used to find the nearest / least loaded node in the cluster and operate from it.

How does RavenDB know which is the fastest node? That is kind of hard to answer, after all. It checks.

Every now and then, RavenDB replicates a read request to all nodes in the cluster. Something like this:

The idea is that we send the request to all the nodes, and wait for the first one to arrive. Since this is the same request, all servers will do the same amount of work, and we’ll find the fastest node from our perspective.

Did you notice the cancellation token in there? When we return from this function, we cancel the existing requests. Here is what this looks like from the monitoring perspective:


This looks exactly like every few minutes, we have a couple of failures (and failover) in the system and was quite confusing until we figured out exactly what was going on.

time to read 1 min | 110 words

I had a great time talking with Scott Hanselman about how we achieve great performance for RavenDB with .NET.

You can listen to the podcast here, as usual, I would love your feedback.

In this episode, we talk to Oren Eini from RavenDB. RavenDB is a NoSQL document database that offers high performance, scalability, and security. Oren shares his insights on why performance is not just a feature, but a service that developers and customers expect and demand. He also explains how RavenDB achieves fast and reliable data access, how it handles complex queries and distributed transactions, and how it leverages the cloud to optimize resource utilization and cost efficiency!


    No future posts left, oh my!


    1. Challenge (72):
      19 Sep 2023 - Spot the bug
    2. Filtering negative numbers, fast (4):
      15 Sep 2023 - Beating memcpy()
    3. Recording (9):
      28 Aug 2023 - RavenDB and High Performance with Oren Eini
    4. Production postmortem (50):
      24 Jul 2023 - The dog ate my request
    5. Podcast (4):
      21 Jul 2023 - Hansleminutes - All the Performance with RavenDB's Oren Eini
    View all series


    Main feed Feed Stats
    Comments feed   Comments Feed Stats