Ayende @ Rahien

My name is Oren Eini
Founder of Hibernating Rhinos LTD and RavenDB.
You can reach me by phone or email:


+972 52-548-6969

, @ Q c

Posts: 6,229 | Comments: 46,320

filter by tags archive

Optimizing read transaction startup timeUnicode ate my perf and all I got was

time to read 3 min | 469 words

As an aside, I wonder how much stuff will break because the title of this post has in it.

The topic of this post is the following profiler output. GetSliceFromKey takes over 6% of our performance, or about 24.5 seconds out of the total run. That kinda sucks.


What is the purpose of this function? Well, RavenDB’s document ids are case insensitive, so we need to convert the key to lower case and then do a search on our internal index. That has quite a big cost associated with it.

And yes, we are aware of the pain. Luckily, we are already working with highly optimized codebase, so we aren’t seeing this function dominate our costs, but still…

Here is the breakdown of this code:


As you can see, over 60% of this function is spent in just converting to lower case, which sucks. Now, we have some additional knowledge about this function. For the vast majority of cases, we know that this function will handle only ASCII characters, and that Unicode document ids are possible, but relatively rare. We can utilize this knowledge to optimize this method. Here is what this will look like, in code:


Basically, we scan through the string, and if there is a character whose value is over 127 we fall to the slower method (slower being relative, that is still a string conversion in less than 25 μs).

Then we just find if a character is in the upper case range and convert it to lower case (ASCII bit analysis is funny, it was intentionally designed to be used with bit masking, and all sort of tricks are possible there) and store it in the buffer, or just store the original value.

The result?


This method went from taking 5.89% to taking just 0.6%, and we saved over 22 seconds(!) in the run. Again, this is under the profiler, and the code is heavily multi threaded. In practice, this means that the run took 3 seconds less.

Either way, we are still doing pretty good, and we don’t have that pile of poo, so I we’re good.

Optimizing read transaction startup timeDon’t ignore the context

time to read 2 min | 364 words

We focused on opening the transaction, but we also have the context management to deal with, in this case, we have:image

And you can see that it cost us over 11.5% of the total request time. When we started this optimization phase, by the way, it took about 14% of the request time, so along the way our previous optimizations has also helped us to reduce it from 42 μs to just 35 μs.

Drilling into this, it become clear what is going on:



The problem is that on each context release we will dispose the ByteStringContext, and on each operation allocation we’ll create a new one. That was done as part of performance work aimed at reducing memory utilization, and it looks like we were too aggressive there. We do want to keep those ByteStringContext around if we are going to immediately use them, after all.

I changed it so the actual disposal will only happen when the context is using fragmented memory, and the results were nice.


This is a much nicer profiler output to look at, to be frank. Overall, we took the random reads scenario and moved it from million reads in 342 seconds (under the profiler) to 264 seconds (under the profiler). Note that those times are cumulative, since we run that in multiple threads, the actual clock time for this (in the profiler) is just over a minute.

And that just leave us with the big silent doozy of GetDocumentsById, which now takes a whooping 77% of the request time.

Optimizing read transaction startup timeGetting frisky

time to read 7 min | 1268 words

As in the last post, I’m focusing on reducing the startup time for transactions. In the last post, we focused on structural changes (removing Linq usage, avoiding O(N^2) operations) and we were able to reduce our cost by close to 50%.

As a reminder, this is what we started with:

And this is where we stopped on the last post:

Now, I can see that we spend quite a bit of time in the AddifNotPresent method of the HashSet. Since we previously removed any calls to write only transactional state, this means that we have something in the transaction that uses a HashSet and in this scenario, adds just one item to it. Inspecting the code showed us that this was the PagerStates variables.

Transactions need to hold the PagerState so they can ensure that the pagers know when the transaction starts and ends. And we do that by calling AddRef / Release on that at the appropriate times. The nice thing about this is that we don’t actually care if we hold the same PagerState multiple times. As long as we called the same number of AddRef / Release, we are good. Therefor, we can just drop the HashSet in favor of a regular list, which gives us:


So that is about a second and a half we just saved in this benchmark.But note that we still spend quite a bit of time on the List.Add method, looking deeper into this, we can see that all of this time is spent here:


So the first Add() requires an allocation, which is expensive.

I decided to benchmark two different approaches to solving this. The first is to just define an initial capacity of 2, which should be enough to cover most common scenarios. This resulted in the following:


So specifying the capacity upfront had a pretty major impact on our performance, dropping it by another full second. The next thing I decided to try was to see if a linked list would be even better. This is typically very small, and the only iteration we do on it is during disposal, anyway (and it is very common to have just one or two of those).

That said, I’m not sure that we can beat the List performance when we have specified the size upfront. A LinkedList.Add() requires allocation, after all, and a List.Add just sets a value.


So… nope, we won’t be using this optimization.

Now, let us focus back on the real heavy weights in this scenario. The GetPageStatesOfallScratches and GetSnapshots. Together they take about 36% of the total cost of this scenario, and that is just stupidly expensive.  Here we can utilize our knowledge of the code and realize that those values can only ever be changed by a write transaction, and they are never changed . That gives us an excellent opportunity to do some caching.

Here is what this looks like when we move the responsibility of creating the pager states of all scratches to the write transaction:


Now let us do the same for GetSnapShots()… which give us this:


As a reminder, LowLevelTransaction.ctor started out with 36.3 seconds in this benchmark, now we are talking about 6.6. So we reduced the performance cost by over 82%.

And the cost of a single such call is down to 7 microsecond under the profiler.

That said, the cost of OpenReadTransaction started out at 48.1 seconds, and we dropped it to 17.6 seconds. So we had a 63% reduction in cost, but it looks like we now have more interesting things to look at than the LowLevelTransaction constructor…


The first thing to notice is that EnsurePagerStateReference ends up calling _pagerStates.Add(), and it suffers from the same issue of cost because of it needs to increase the capacity.


Increasing the initial capacity has resulted in measurable gain.


With that, we can move on to analyze the rest of the costs. We can see that the TryAdd on the ConcurrentDictionary is really expensive*.

* For a given value of really Smile It takes just under 3 microseconds to complete, but that is still a big chunk of what we can do here.

The reason we need this call is that we need to track the active transactions. This is done because we need to know who is the oldest running transaction for MVCC purposes. The easiest thing to do there was to throw that in a concurrency dictionary, but that is expensive for those kind of workloads. I have switch it up with a dedicated class, that allows us to do better optimizations around it.

The design we ended up going with is a bit complex (more after the profiler output), but it gave us this:


So we are just over a third of the cost of the concurrent dictionary. And we did that using a dedicated array per thread, so we don’t have contention. The problem is that we can’t just do that, we need to read all of those values, and we might be closing a transaction from a different thread. Because of that, we split the logic up. We have an array per each thread that contains a wrapper class, and we give the transaction access to that wrapper class instance. So when it is disposed, it will clear the value in the wrapper class.

Then we can reuse that instance later in the original thread once the memory write has reached the original thread. And until then, we’ll just have  a stale read on that value and ignore it. It is more complex, and took a bit of time to get right, but the performance justify it.

Current status is that we started at 48.1 seconds for this benchmark, and now we are at 14.7 seconds for the OpenReadTransaction. That is a good day’s work.

Optimizing read transaction startup timeThe low hanging fruit

time to read 5 min | 849 words

The benchmark in question deals with service 1 million random documents, as fast as possible. In my previous post, I detailed how we were able to reduce the cost of finding the right database to about 1.25% of the cost of the request (down from about 7% – 8%). In this post, we are going to tackle the next problem:


So 14% of the request times goes into just opening a transaction?! Well, that make sense, surely that means that there is a lot going on there, some I/O, probably. The fact that we can do that in less than 50 micro seconds is pretty awesome, right?

Not quite, let us look at what is really costing us here.


Just look at all of those costs, all of this isn’t stuff that you just have to deal with because there is I/O involved, let us look at the GetPagerStatesOfAllScratches, shall we?


I’ll just sit down and cry now, if you don’t mind. Here is what happens when you take this code and remove the Linqness.


This is purely a mechanical transformation, we have done nothing to really change things. And here is the result:


Just this change reduced the cost of this method call by over 100%! As this is now no longer our top rated functions, we’ll look at the ToList now.

This method is called from here:


And here is the implementation:


And then we have:



Well, at least my job is going to be very easy here. The interesting thing is that the Any() call can be removed / moved to DEBUG only. I changed the code to pass the JournalSnapshots into the GetSnapshots, saving us an allocation and all those Any calls. That gave us:


So far, we have managed to reduce ten seconds out of the cost of opening a transaction. We have done this not by being smart of doing anything complex. We just looked at the code and fixed the obvious performance problems in it.

Let’s see how far that can take us, shall we?

The next observation I had was that Transaction is actually used for both read and write operations. And that there is quite a bit of state in the Transaction that is only used for writes. However, this is actually a benchmark measuring pure read speed, so why should we be paying all of those costs needlessly? I basically moved all the field initializers to the constructor, and only initialize them if we are using a write transaction. Just to give you some idea, here is what I moved:


So those are six non trivial allocations that have been moved from the hot path, and a bunch of memory we won’t need to collect. As for the impact?


We are down to about half of our initial cost! And we haven’t really done anything serious yet. This is quite an awesome achievement, but we are not done. In my next post, we’ll actually move to utilizing our knowledge of the code to make more major changes in order to increase overall performance.

Multiple optimizations passes with case insensitive routing

time to read 5 min | 938 words

The following benchmark is from a small database containing about 500K documents, doing random load by id. As you can see, I highlighted a problematic area:


We spent a lot of time to optimize routing as much as possible, so I was pretty annoyed when I saw a profiler output showing that we spend 7% – 8% of our time handling routing.

Actually, that is a lie. We spent most of our time looking up what database we should be using.

I decided to simplify the code a lot, to get down to the essentials, and this is the hotspot in question:


We can’t really test this in something like Benchmark.NET, because we need to see how it works when using multiple threads and concurrent access. We care more about the overall performance than a single invocation.

So I tested it by spinning 32 threads that would call the class above (initialized with 10 different values) with the following keys:

  • Oren
  • oren
  • oRen

Each of the threads would process as many of those calls as it can in the span of 15 seconds. And we’ll then tally up the result. The code above gives me 89 million calls per second, which is impressive. Except that this is actually able to utilize the GetCaseInsensitiveHash, which is an internal call (written in C++) that is extremely efficient. On the other hand, my string segment code is far slower.


On the other hand, if I give up on the OrdinalIgnoreCase, in the code above we get 225 million operations / sec, so there is definitely performance left on the table.

The first attempt was to introduce a bit of smarts, if we have a match by case, we can actually check it and still be faster than the case insensitive version. The code looks like this:

This gave a performance of 76 millions ops / sec when running a mix match, and 205 millions / sec when always using the case matching. That was awesome, but we still missed something. This optimization will kick in only if you actually had an exact case match, but it is very common to miss that. In fact, we noticed that because after we applied this optimization, we created a different benchmark where we got a case mismatch, and had hit the same perf issue.

So the next attempt was to actually learn on the fly. The basic idea is that we still have the two dictionaries, but when we have a miss at the first level, we’ll add the entry to the case sensitive dictionary based on what was searched. In this way, we can learn over time, and then most of the calls would be very fast. Here is the code:

And we get 167 millions ops / second using it.

Moving the code to using a ConcurrentDictionary upped this to 180 millions ops / sec.

And this is the final result of actually implementing this:


Cost went down from 29.2 seconds to 6.3 seconds! There is still significant cost here around using the concurrent dictionary, but drilling down shows that we are stuck:


This is all high end framework code. But we can do better. Instead of calling the framework, and passing this through multiple chains of calls, we can just compare the memory values directly, like so:


And this result in:


So we dropped the cost fro 6.3 (29.2 initially!) seconds to 5 seconds.

Although, let us take a deeper look at this, shall we?


It looks like the actual costs we have here for finding the data are now dominated by the call to ResourceCache.TryGetValue. A small change there gave us:


So we saved over 250 ms in our test run, and a total of 6.36% of our runtime cost.

What was the change?


The parts outlined in yellow are new. So instead of having a pretty big method, we now have a very small one that does the happy case, and the rest is in the unlikely method that will be called rarely.

Timing the time it takes to parse timePart II

time to read 3 min | 536 words

There are times when you write clean, easily to understand code, and there are times when you see 50% of your performance goes into DateTime parsing, at which point you’ll need to throw nice code out the window, put on some protective gear and seek out that performance hit that you need so much.

Note to the readers: This isn’t something that I recommend you’ll do unless you have considered it carefully, you have gathered evidence in the form of actual profiler results that show that this is justified, and you covered it with good enough tests. The only reason that I was actually able to do anything is that I know so much about the situation. The dates are strictly formatted, the values are stored as UTF8 and there are no cultures to consider.

With that said, it means that we are back to C’s style number parsing and processing:

Note that the code is pretty strange, we do upfront validation into the string, then parse all those numbers, then plug this in together.

The tests we run are:

Note that I’ve actually realized that I’ve been forcing the standard CLR parsing to go through conversion from byte array to string on each call. This is actually what we need to do in RavenDB to support this scenario, but I decided to test it out without the allocations as well.

All the timings here are in nanoseconds.


Note that the StdDev for those test is around 70 ns. And this usually takes about 2,400 ns to run.

Without allocations, things are better, but not by much. StdDev goes does to 50 ns, and the performance is around 2,340 ns, so there is a small gain from not doing allocations.

Here are the final results of the three methods:


Note that my method is about as fast as the StdDev on the alternative. With an average of 90 ns or so, and StdDev of 4 ns. Surprisingly, LegacyJit on X64 was the faster of them all, coming in at almost 60% of the LegacyJit on X86, and 20% faster than RyuJit on X64. Not sure why, and dumping the assembly at this point is quibbling, honestly. Our perf cost just went down from 2400 ns to 90 ns. In other words, we are now going to be able to do the same work at 3.66% of the cost. Figuring out how to push it further down to 2.95% seems to insult the 96% perf that we gained.

And anyway, that does leave us with some spare performance on the table if this ever become a hotspot again*.

* Actually, the guys on the performance teams are going to read this post, and I’m sure they wouldn’t be able to resist improving it further Smile.

Timing the time it takes to parse timePart I

time to read 2 min | 344 words

This is a small part from a larger benchmark that we run:


The index in question is using a DateTime field, and as you can see, quite a lot of time is spent in translating that. 50% of our time, in fact. That is… not so nice.

The question now is why we do it? Well, let us look at the code:


Here we can see several things, first, there is the small issue with us allocating the string to check if it is a date, but that isn’t where the money is. That is located in the TryParseExact.

This method is actually quite impressive. Given a pattern, it parses the pattern, then it parse the provided string. And if we weren’t calling it hundreds of thousands of times, I’m sure that it wouldn’t be an issue.  But we are, so we are left with writing our own routine to do this in a hard coded manner.

I built the following benchmark to test this out:


As you can see, this is pretty much identical to our code, and should tell us how good we are. Here are the benchmark results:













2,458.2915 ns

102.7071 ns






2,506.7353 ns

142.7946 ns






2,443.4806 ns

51.4903 ns

In my next post, I’ll show what I came up with that can beat this.

The beauty of fully utilized machine

time to read 4 min | 617 words

System administrators like to see graphs with server utilizations sitting at the very low end of the scale. That means that they don’t need to worry about spikes, capacity or anything much, they are way over provisioned, and that means less waking up at night.

That works very well, until you hit a real spike, hit some sort of limit, and then have to scramble to upgrade your system while under fire, so to speak. [I have plenty of issues with the production team behavior as described in this post, but that isn’t the topic for this post.]

So one of the things that we regularly test is a system being asked to do something that is beyond its limits. For example, have 6 indexes running at the same time, indexing at different speeds, a dataset that is 57GB in size.

The idea is that we will force the system to do a lot of work that it could typically avoid by having everything in memory. Instead, we’ll be forced to page data, and we need to see how we behave under such a scenario.

Here is what this looks like from the global system behavior.


If you’ll show this to most admins, they will feel faint. That usually means Something Is About To Break Badly.

But things are slightly better when we look at the details:


So what do we have here? We have a process that (at the time of running it, has mapped about 67 GB of files, and has allocated 8.5 GB of RAM). However, only about 4.5 GB of that is actively used, and the rest of the working set is actually the memory mapped files. That lead to an interesting observation, if most of your work is local and transient (so you scan through sections of the file, like we do during indexing), the operating system will load those pages from disk, and keep them around until there is memory pressure, at which point it will look at all of those nice pages that are just sitting them, unmodified and with a source on disk.

That means that the operating can immediately discard them without having to page them out. So that means that they are very cheap. Oh, we’ll still need to load the data from disk into them, but we’ll have to do that anyway, since we can’t fit the entire dataset into memory.

So that means that our allocation strategy basically goes something like this:

  • Ignore the actually free space the operating system report.
  • Instead, take into account the private working set and compare it to the actual working set.

The private working set is what goes into the page file, it mostly consists of managed memory and whatever unmanaged allocations we have to do during the indexing. So by comparing the two, we can tell how much of the used memory is actually used by memory mapped files. We are careful to ensure that we leave about 25% of the system memory to the memory mapped files (otherwise we’ll do a lot of paging), but that still gives us leave to use quite a lot of memory to speed things up, and we can negotiate between the threads to see who is faster (and thus deserve more memory).

The nine hours micro benchmark

time to read 4 min | 739 words

We’ve been running all sort of benchmarks on RavenDB, ranging from micro optimizations to a single heavily used routine that focused on re-ordering the instructions so the CPU can execute them in parallel to large scale load testing and data processing.

This tale is about the later. We have defined 6 indexes over a data set of over 18 million documents with a total size of 57GB. We then run it on a system that had 16GB of RAM (and typically had a lot less actually available).

When we started, we very quickly run out of memory, in fact, our allocations exceeded the database size (so we allocated over 60 GB) before we decided to focus on fixing that. The problem was the scope of the batch, it was too long, and we didn’t reuse the memory inside the same batch. Once that was fixed, we were able to run the performance test successfully.

Total time to run it? Over 9 hours. Granted, this is talking about 6 indexes each needing to go over the entire dataset, so we expect it to take a while, and it is much faster than in previous versions, but that is still far too much.

I should probably explain that the whole point of doing something like that is to see the interference effects. What happen when you have a bunch of competing indexes over the same resources (CPU, memory, disk)?

Turn out, there is a lot going there, and you can’t really get good profiling results from this kind of run  (to start with, the profiling overhead would push this into a multi day effort), and while we captured some profiling run of shorter stats, we weren’t really able to pinpoint the blame.

So we added the tooling we needed to figure it out, and then put that in the studio. The end result was… not so good.


Yes, that is the studio trying to display the debug information for that benchmark. It… did not go so well for us. In other words, before we could fix the benchmark, we had to optimize the code that would tell us where we are spending all that time in the benchmark Smile.

As it turned out, the problem was that our code was optimizing for problems we no loner had. Basically, in RavenDB 3.x we have to take into account our memory usage, and pretty much all of it is managed memory. That lead to some interesting problems, to whit, we don’t know how much memory we use. The fact that a document is size so and so when serialized to disk means exactly squat with regards to how much it is actually going to take in memory. And because too much allocations lead to higher GC costs down the road, we put term limits on the size of the indexing batches we’ll process.

Effectively, we will take a certain amount of documents to process in a batch, and then complete the batch. All the memory we used in the batch will go away, and hopefully we didn’t push too much of it into higher generations. But the real costs that we have in RavenDB 4.x are very different. To start with, we use a lot less managed operations, and we use Voron for pretty much everything. That means that our costs has now shifted, instead of worrying about releasing the memory as quickly as possible, we need to worry about reducing the number of times we go to disk.

As it turns out, artificially reducing the batch size results in us processing more batches, which require us to hit the disk a lot more. The same logic applies to RavenDB 3.x (and we have users who have configured RavenDB to have very long batches for exactly that reason), but that come at GC cost that simply does not exist in RavenDB 4.0.

The immediate solution was to remove all the batch limits and see what would happen. Overall performance had doubled. We were able to process the same amount of information in about half the time. And that is before we did deep dive with a profiler to seek inefficiencies.

Performance analysisSimple indexes

time to read 2 min | 374 words

I outlined the scenario in my previous post, and I wanted to focus on each scenario one at a time. We’ll start with the following simple map index:


In RavenDB 4.0, we have done a lot of work to make this scenario as fast as possible. In fact, this has been the focus of a lot of architectural optimizations. When running a simple index, we keep very few things in managed memory, and even those are relatively transient. Most of our data is in memory mapped files. That means, no high GC cost because we have very few objects getting pushed to Gen1/Gen2. Mostly, this is telling Lucene “open wide, please” and shoving as much data inside as we can.

So we are seeing much better performance overall. But that isn’t to say that we don’t profile (you always do). So here are the results of profiling a big index run:

And this is funny, really funny. What you are seeing there is that about 50% of our runtime is going into those two functions.

What you don’t know is why they are here. Here is the actual code for this:


The problem is that the write.Delete() can be very expensive, especially in the common case of needing to index new documents. So instead of just calling for it all the time, we first check if we have previously indexed this document, and only then we’ll delete it.

As it turns out, those hugely costly calls are still a very big perf improvement, but we’ll probably replace them with a bloom filter that can do the same job, but with a lot less cost.

That said, notice that the runtime cost of those two functions together is 0.4 ms under profiler. So while I expect bloom filter to be much better, we’ll certainly need to double check that, and again, only the profiler can tell.


  1. Optimizing read transaction startup time: The performance triage - 18 hours from now
  2. Optimizing read transaction startup time: Every little bit helps, a LOT - about one day from now

There are posts all the way to Oct 27, 2016


  1. Optimizing read transaction startup time (6):
    25 Oct 2016 - Unicode ate my perf and all I got was
  2. RavenDB Retrospective (4):
    17 Oct 2016 - The governors
  3. Timing the time it takes to parse time (2):
    11 Oct 2016 - Part II
  4. Performance analysis (2):
    04 Oct 2016 - Simple indexes
  5. Interview question (3):
    29 Sep 2016 - Stackoverflow THAT
View all series



Main feed Feed Stats
Comments feed   Comments Feed Stats