Ayende @ Rahien

My name is Oren Eini
Founder of Hibernating Rhinos LTD and RavenDB.
You can reach me by phone or email:


+972 52-548-6969

, @ Q c

Posts: 6,229 | Comments: 46,321

filter by tags archive

Optimizing read transaction startup timeUnicode ate my perf and all I got was

time to read 3 min | 469 words

As an aside, I wonder how much stuff will break because the title of this post has in it.

The topic of this post is the following profiler output. GetSliceFromKey takes over 6% of our performance, or about 24.5 seconds out of the total run. That kinda sucks.


What is the purpose of this function? Well, RavenDB’s document ids are case insensitive, so we need to convert the key to lower case and then do a search on our internal index. That has quite a big cost associated with it.

And yes, we are aware of the pain. Luckily, we are already working with highly optimized codebase, so we aren’t seeing this function dominate our costs, but still…

Here is the breakdown of this code:


As you can see, over 60% of this function is spent in just converting to lower case, which sucks. Now, we have some additional knowledge about this function. For the vast majority of cases, we know that this function will handle only ASCII characters, and that Unicode document ids are possible, but relatively rare. We can utilize this knowledge to optimize this method. Here is what this will look like, in code:


Basically, we scan through the string, and if there is a character whose value is over 127 we fall to the slower method (slower being relative, that is still a string conversion in less than 25 μs).

Then we just find if a character is in the upper case range and convert it to lower case (ASCII bit analysis is funny, it was intentionally designed to be used with bit masking, and all sort of tricks are possible there) and store it in the buffer, or just store the original value.

The result?


This method went from taking 5.89% to taking just 0.6%, and we saved over 22 seconds(!) in the run. Again, this is under the profiler, and the code is heavily multi threaded. In practice, this means that the run took 3 seconds less.

Either way, we are still doing pretty good, and we don’t have that pile of poo, so I we’re good.

Optimizing read transaction startup timeDon’t ignore the context

time to read 2 min | 364 words

We focused on opening the transaction, but we also have the context management to deal with, in this case, we have:image

And you can see that it cost us over 11.5% of the total request time. When we started this optimization phase, by the way, it took about 14% of the request time, so along the way our previous optimizations has also helped us to reduce it from 42 μs to just 35 μs.

Drilling into this, it become clear what is going on:



The problem is that on each context release we will dispose the ByteStringContext, and on each operation allocation we’ll create a new one. That was done as part of performance work aimed at reducing memory utilization, and it looks like we were too aggressive there. We do want to keep those ByteStringContext around if we are going to immediately use them, after all.

I changed it so the actual disposal will only happen when the context is using fragmented memory, and the results were nice.


This is a much nicer profiler output to look at, to be frank. Overall, we took the random reads scenario and moved it from million reads in 342 seconds (under the profiler) to 264 seconds (under the profiler). Note that those times are cumulative, since we run that in multiple threads, the actual clock time for this (in the profiler) is just over a minute.

And that just leave us with the big silent doozy of GetDocumentsById, which now takes a whooping 77% of the request time.

Optimizing read transaction startup timeGetting frisky

time to read 7 min | 1268 words

As in the last post, I’m focusing on reducing the startup time for transactions. In the last post, we focused on structural changes (removing Linq usage, avoiding O(N^2) operations) and we were able to reduce our cost by close to 50%.

As a reminder, this is what we started with:

And this is where we stopped on the last post:

Now, I can see that we spend quite a bit of time in the AddifNotPresent method of the HashSet. Since we previously removed any calls to write only transactional state, this means that we have something in the transaction that uses a HashSet and in this scenario, adds just one item to it. Inspecting the code showed us that this was the PagerStates variables.

Transactions need to hold the PagerState so they can ensure that the pagers know when the transaction starts and ends. And we do that by calling AddRef / Release on that at the appropriate times. The nice thing about this is that we don’t actually care if we hold the same PagerState multiple times. As long as we called the same number of AddRef / Release, we are good. Therefor, we can just drop the HashSet in favor of a regular list, which gives us:


So that is about a second and a half we just saved in this benchmark.But note that we still spend quite a bit of time on the List.Add method, looking deeper into this, we can see that all of this time is spent here:


So the first Add() requires an allocation, which is expensive.

I decided to benchmark two different approaches to solving this. The first is to just define an initial capacity of 2, which should be enough to cover most common scenarios. This resulted in the following:


So specifying the capacity upfront had a pretty major impact on our performance, dropping it by another full second. The next thing I decided to try was to see if a linked list would be even better. This is typically very small, and the only iteration we do on it is during disposal, anyway (and it is very common to have just one or two of those).

That said, I’m not sure that we can beat the List performance when we have specified the size upfront. A LinkedList.Add() requires allocation, after all, and a List.Add just sets a value.


So… nope, we won’t be using this optimization.

Now, let us focus back on the real heavy weights in this scenario. The GetPageStatesOfallScratches and GetSnapshots. Together they take about 36% of the total cost of this scenario, and that is just stupidly expensive.  Here we can utilize our knowledge of the code and realize that those values can only ever be changed by a write transaction, and they are never changed . That gives us an excellent opportunity to do some caching.

Here is what this looks like when we move the responsibility of creating the pager states of all scratches to the write transaction:


Now let us do the same for GetSnapShots()… which give us this:


As a reminder, LowLevelTransaction.ctor started out with 36.3 seconds in this benchmark, now we are talking about 6.6. So we reduced the performance cost by over 82%.

And the cost of a single such call is down to 7 microsecond under the profiler.

That said, the cost of OpenReadTransaction started out at 48.1 seconds, and we dropped it to 17.6 seconds. So we had a 63% reduction in cost, but it looks like we now have more interesting things to look at than the LowLevelTransaction constructor…


The first thing to notice is that EnsurePagerStateReference ends up calling _pagerStates.Add(), and it suffers from the same issue of cost because of it needs to increase the capacity.


Increasing the initial capacity has resulted in measurable gain.


With that, we can move on to analyze the rest of the costs. We can see that the TryAdd on the ConcurrentDictionary is really expensive*.

* For a given value of really Smile It takes just under 3 microseconds to complete, but that is still a big chunk of what we can do here.

The reason we need this call is that we need to track the active transactions. This is done because we need to know who is the oldest running transaction for MVCC purposes. The easiest thing to do there was to throw that in a concurrency dictionary, but that is expensive for those kind of workloads. I have switch it up with a dedicated class, that allows us to do better optimizations around it.

The design we ended up going with is a bit complex (more after the profiler output), but it gave us this:


So we are just over a third of the cost of the concurrent dictionary. And we did that using a dedicated array per thread, so we don’t have contention. The problem is that we can’t just do that, we need to read all of those values, and we might be closing a transaction from a different thread. Because of that, we split the logic up. We have an array per each thread that contains a wrapper class, and we give the transaction access to that wrapper class instance. So when it is disposed, it will clear the value in the wrapper class.

Then we can reuse that instance later in the original thread once the memory write has reached the original thread. And until then, we’ll just have  a stale read on that value and ignore it. It is more complex, and took a bit of time to get right, but the performance justify it.

Current status is that we started at 48.1 seconds for this benchmark, and now we are at 14.7 seconds for the OpenReadTransaction. That is a good day’s work.

Optimizing read transaction startup timeThe low hanging fruit

time to read 5 min | 849 words

The benchmark in question deals with service 1 million random documents, as fast as possible. In my previous post, I detailed how we were able to reduce the cost of finding the right database to about 1.25% of the cost of the request (down from about 7% – 8%). In this post, we are going to tackle the next problem:


So 14% of the request times goes into just opening a transaction?! Well, that make sense, surely that means that there is a lot going on there, some I/O, probably. The fact that we can do that in less than 50 micro seconds is pretty awesome, right?

Not quite, let us look at what is really costing us here.


Just look at all of those costs, all of this isn’t stuff that you just have to deal with because there is I/O involved, let us look at the GetPagerStatesOfAllScratches, shall we?


I’ll just sit down and cry now, if you don’t mind. Here is what happens when you take this code and remove the Linqness.


This is purely a mechanical transformation, we have done nothing to really change things. And here is the result:


Just this change reduced the cost of this method call by over 100%! As this is now no longer our top rated functions, we’ll look at the ToList now.

This method is called from here:


And here is the implementation:


And then we have:



Well, at least my job is going to be very easy here. The interesting thing is that the Any() call can be removed / moved to DEBUG only. I changed the code to pass the JournalSnapshots into the GetSnapshots, saving us an allocation and all those Any calls. That gave us:


So far, we have managed to reduce ten seconds out of the cost of opening a transaction. We have done this not by being smart of doing anything complex. We just looked at the code and fixed the obvious performance problems in it.

Let’s see how far that can take us, shall we?

The next observation I had was that Transaction is actually used for both read and write operations. And that there is quite a bit of state in the Transaction that is only used for writes. However, this is actually a benchmark measuring pure read speed, so why should we be paying all of those costs needlessly? I basically moved all the field initializers to the constructor, and only initialize them if we are using a write transaction. Just to give you some idea, here is what I moved:


So those are six non trivial allocations that have been moved from the hot path, and a bunch of memory we won’t need to collect. As for the impact?


We are down to about half of our initial cost! And we haven’t really done anything serious yet. This is quite an awesome achievement, but we are not done. In my next post, we’ll actually move to utilizing our knowledge of the code to make more major changes in order to increase overall performance.

Multiple optimizations passes with case insensitive routing

time to read 5 min | 938 words

The following benchmark is from a small database containing about 500K documents, doing random load by id. As you can see, I highlighted a problematic area:


We spent a lot of time to optimize routing as much as possible, so I was pretty annoyed when I saw a profiler output showing that we spend 7% – 8% of our time handling routing.

Actually, that is a lie. We spent most of our time looking up what database we should be using.

I decided to simplify the code a lot, to get down to the essentials, and this is the hotspot in question:


We can’t really test this in something like Benchmark.NET, because we need to see how it works when using multiple threads and concurrent access. We care more about the overall performance than a single invocation.

So I tested it by spinning 32 threads that would call the class above (initialized with 10 different values) with the following keys:

  • Oren
  • oren
  • oRen

Each of the threads would process as many of those calls as it can in the span of 15 seconds. And we’ll then tally up the result. The code above gives me 89 million calls per second, which is impressive. Except that this is actually able to utilize the GetCaseInsensitiveHash, which is an internal call (written in C++) that is extremely efficient. On the other hand, my string segment code is far slower.


On the other hand, if I give up on the OrdinalIgnoreCase, in the code above we get 225 million operations / sec, so there is definitely performance left on the table.

The first attempt was to introduce a bit of smarts, if we have a match by case, we can actually check it and still be faster than the case insensitive version. The code looks like this:

This gave a performance of 76 millions ops / sec when running a mix match, and 205 millions / sec when always using the case matching. That was awesome, but we still missed something. This optimization will kick in only if you actually had an exact case match, but it is very common to miss that. In fact, we noticed that because after we applied this optimization, we created a different benchmark where we got a case mismatch, and had hit the same perf issue.

So the next attempt was to actually learn on the fly. The basic idea is that we still have the two dictionaries, but when we have a miss at the first level, we’ll add the entry to the case sensitive dictionary based on what was searched. In this way, we can learn over time, and then most of the calls would be very fast. Here is the code:

And we get 167 millions ops / second using it.

Moving the code to using a ConcurrentDictionary upped this to 180 millions ops / sec.

And this is the final result of actually implementing this:


Cost went down from 29.2 seconds to 6.3 seconds! There is still significant cost here around using the concurrent dictionary, but drilling down shows that we are stuck:


This is all high end framework code. But we can do better. Instead of calling the framework, and passing this through multiple chains of calls, we can just compare the memory values directly, like so:


And this result in:


So we dropped the cost fro 6.3 (29.2 initially!) seconds to 5 seconds.

Although, let us take a deeper look at this, shall we?


It looks like the actual costs we have here for finding the data are now dominated by the call to ResourceCache.TryGetValue. A small change there gave us:


So we saved over 250 ms in our test run, and a total of 6.36% of our runtime cost.

What was the change?


The parts outlined in yellow are new. So instead of having a pretty big method, we now have a very small one that does the happy case, and the rest is in the unlikely method that will be called rarely.

That is my memory you’re freeing, you foreign thread!

time to read 4 min | 615 words

RavenDB is a pretty big project, and it has been around for quite a while. That means that we have run into a lot of strange stuff over the years. In particular, support incidents are something that we track and try to learn from. Today’s post is about one such lesson. We want to be able to track, on a per thread basis, how much memory is in use. Note that when we say that, we talk about unmanaged memory.

The idea is, once we track it, we can manage it. Here is one such example:


Note that this has already paid for itself when it showed us very clearly (and without using special tools), exactly who is allocating too much memory.

Memory allocation / de-allocation is often a big performance problem, and we are trying very hard to not get painted into performance corners. So a lot of our actual memory usage is allocate once, then keep around in the thread for additional use. This turn out to be quite useful. It also means that for the most part, we really don’t have to worry about thread safety. Memory allocations happen in the context of a thread, and are released to the thread once an operation is done.

This gives us high memory locality and it avoids having to take locks to manage memory. Which is great, except that we also have quite a bit of async request processing code. And async request processing code will quite gladly jump threads for you.

So that lead to a situation where you allocate memory in thread #17 at the beginning of the request and it waits for I/O, so when it finally completes, the request finish processing in thread #29. In this case, we keep the memory we go for next usage in the finishing thread. This is based on the observation that we typically see the following patterns:

  • Dedicated threads for tasks, that do no thread hopping, each have unique memory usage signature, and will eventually settle into the memory it needs to process everything properly.
  • Pools of similar threads that share roughly the same tasks with one another, and have thread hopping. Over time, things will average out and all threads will have roughly the same amount of memory.

That is great, but it does present us with a problem, how do we account for that? If thread #17 allocated some memory, and it is now sitting in thread #29’s bank, who is charged for that memory?

The answer is that we always charge the thread that initially allocated the memory, even if it currently doesn’t have that memory available. This is because it is frequently the initial allocation that we need to track, and usage over time just means that we are avoiding constant malloc/free calls.

It does present a problem, what happens if thread #29 is freeing memory that belongs to thread #17? Well, we can just decrement the allocated value, but that would force us to always do threads safe operations, which are more expensive.

Instead, we do this:


If the freeing thread is the same as the allocation thread, just use simple subtraction, crazy cheap. But if it was allocated from another thread, do the thread safe thing. Then we smash both values together to create the final, complete, picture. 

RavenDB RetrospectiveThe governors

time to read 5 min | 860 words

imageRavenDB’s core philosophy is that It Just Works and that means that we try very hard to get things right. Conversely, that means that we are also trying to make it hard to do the wrong thing. Basically, we want to push you hard into the pit of success.

Part of that approach is what we call the governors. It is a set of features that will detect and abort known bad behavioral patterns.  I have already talked about Unbounded Result Sets and I recently run into this post, which shows how nasty a problem that can be, and how invisible.

Another governor we have in place is the session’s maximum request limit. A session is meant to be a scope, it has a very short duration and is typically used for a single request / processing a single message, etc. It is supposed to live as long as the business transaction. Because the session is scoped, we can reason that a single session that is making a lot of database operation is probably doing something pretty bad.

For example, it might be calling the database in a loop. Those kind of issues can be truly insidious. Let us look at the following code (taken from here):



This kind of thing is a silent performance killer. No one is likely to see this is happening, and it will silently increase the number of database operations that your application make, leading to increased DB load, higher page load times and all sort of problems associated with it.

In one particular case, I saw a single page load generate 17,000 queries to the database. The software in question grew over time, and people assumed that this was just it took to run the software. Their database server was a true monster (this was about a decade ago), with dedicated RAM disks, high CPU count and a truly ridiculous amount of memory. Just to explain, we are talking about something like this:


But a decade ago, and it had a quite a bit of space. Now, this kind of beasty can do 500K IOPS (I’m drooling just thinking about it), but it is damn expensive. Just to put things in perspective, I spent several weeks at that company working on this particular problem, the cost of those weeks of work didn’t even cover the cost of the drive on that machine.

And on that monster, we were seeing page load times in the tens of seconds, and extremely high system load. I was able to bring it down to about 70 queries per page load, and their database server has pretty much idled ever since (IIRC, they turn that machine into a VM host for all the rest of their software, actually).

This is something that can bite.

To avoid that, we have the max numbers of requests in the session, which will abort excessive amount of database chatter. This have two important effects:

  • It follow the “better let one bad request die rather than take down the entire application”.
  • It put a budget on the number of calls that you can make.

Now, that budget is actually really interesting. Because we have it, we need to think about how we can reduce the number of database calls that we have to process the request. That led to a whole bunch of features around that. Lazy requests, includes and transformers to name just a few.

That had a positive unintended consequence. RavenDB is fast,  really fast, but it is also typically deployed as a network database, that means that each database call actually go over the network, and we all remember our fallacies, right?


In our profiling, we found that most often, the real cost in a RavenDB application was the back & forth chatter with the database. Reducing the number of requests we make to the server has an immediate benefit. And RavenDB allows you to do that by pipelining requests with Lazy, predicting requests with Includes or running the whole thing on the server side with Transformers.

And, like all governors, you can control it, RavenDB allows you to decide what the limit should be (on that particular session or globally based on your actual needs and environment.

RavenDB RetrospectiveExplicit indexes & auto indexes

time to read 3 min | 492 words

imageRavenDB doesn’t provide any way for queries to do table scans*.

* That isn’t actually true, we have Data Exploration, which does just that, but we don’t provide an explicit API for it, and it is a DBA driven feature (I wanna get this report with a minimum of fuss without regards to how much it is going to cost me) than an API that is exposed.

What this means is that the cost of query operations in RavenDB is always going to be O(logN), instead of O(N). How does this relate to the topic of RavenDB retrospectives?

One of the things that I kept seeing over and over as a database consultant was that databases are complex, and that it is easy to write a query that works perfectly fine for a period of time, then fall over completely as the size of the data goes over a certain threshold. In particular, queries that use table scans are particularly vulnerable for this issue.

One of the design goals for RavenDB was to avoid that, completely. We did it by simply forbidding any query that doesn’t have an index. initially, that was a pretty annoying requirement, because every time that you needed a new query, you needed to go ahead and create an index. But early on we got the Auto Indexes feature.

Basically, it means that when you can query RavenDB without specifying which index you want to use, at which point the query optimizer will inspect the query and decide which index can serve it. The most interesting point here is that if there isn’t an index that can serve this query, the query optimizer is going to create one on the fly. See the previous post about BASE indexes and how we can afford to do that.

The fun part here is that the query optimizer is actually learning over time, and it will shape its indexes to best fit the kind of queries you are doing. It also makes RavenDB much more robust for New Version Degradation effects. NVD is what happens when you push a new version out, which have slightly different queries, which make previously used indexes ineffective, forcing all your queries to become full table scans. Here is an example of the kind of subtle issues that this can cause. With RavenDB, when you use auto indexes (in other words, when you don’t explicitly state which index to use), the query optimizer will take care of that, and it will create all the appropriate indexes (and retire the unused ones)  for you.

This in particular is a feature that I’m really proud of, it require very little from the user to work with, and it gets the Right Thing Done.

The Red Alert Sleeper Agent Bug

time to read 6 min | 1170 words

Today I started out like most recent days, I was working on improving performance and running benchmarks. I made a small change in how we handle file allocation and mapping inside Voron. This is the kind of change that should have no observable effects. And indeed, except for making us run faster, everything worked.

Except that later today I merged some stuff from a colleague and suddenly I started getting invalid memory accesses. After quickly blaming my colleague for the issue, we eventually figured out that it was my change that caused it.

Unfortunately, the problem was a lot more serious than it immediately appeared. It wasn’t just that I needed to fix my code, what was happening there was that I made a certain situation (a new file mapping, and thus, exercising the cleanup routine) a lot more frequent. Which is all well and good, except that this is something that will happen routinely in Voron anyway, it just means that this is now much more likely.

And the problem is that we couldn’t for the life of us figure out why it was failing. Oh, we quickly figured out that we are accessing memory that has been unmapped, but how? The Voron codebase is really careful about such things, and we have quite enough production usage to know that this doesn’t really happen. But again, it might just be a sleeper…

The real problem wasn’t actually with the access violation, that was pretty obvious and would have come to our immediate attention. The problem was that the error looked like the Page Translation Table had a race condition. In this case, because we are much more eager about cleanup, it was obvious that we are accessing old information, but without this to trigger our attention, the fear was that we are actually racy, and that the Page Translation Table will serve incorrect information.

That means that Voron would violate its consistency rule, we’ll effectively be returning random garbage to the user and… Bad Things to Follow.

At various points during the day, we had five different people working on it in three continents, because it is that kind of bug. And we couldn’t figure it out. We traced the code that did that every which way. It is old code, that has been worked upon repeatedly, and it has been stable for years. And none of us could figure out what was going on. Theories ranging from cosmic rays to the wrath of Murphy has been thrown out.

Something was very rotten in Voron. Okay, after all of this exposition, let me explain what was going on. We started out with a Page Table that looked like this:


So the first number is the page number, and the second is the page number inside the scratch file (#1 or #2, above).

Basically, this means that when Transaction #3 looks asks for Page #0, it will actually get Page #238 from scratch #1. And when Transaction #4 asks for the same page, it will get it from Page #482 on scratch #2. If you got lost with the numbers, don’t worry, we did too.

The problem was… the failing issue was in transaction #5, and the problem was that it was access page #412 on scratch #1. And due to my change, we actually closed scratch #1. The problem is that we couldn’t figure out how this thing could actually happen. Crazy stuff. We tried reproducing this in all sort of crazy ways, but it would only fail on the most trivial of tests, and very unpredictably. And then we finally figured it out. Basically by tracing everything, putting locks on everything that moved or looked like it would move it I kicked it and plain head against wall rinse & repeat.

Eventually we focused on what happened around the location of the error. It always happened during a query, that was very consistent, when it happened. And finally we figured it out. We now use Lucene indexes stored inside Voron, and Lucene has some funky ideas about how it should be able to access the data. So we have to put a Voron transaction around the whole thing. And we have to flow the same transaction across multiple Lucene index input instances. So we put the transaction inside a thread local variable. And the query method is async.

I think that you can figure out what happened from here, right? When the async machinery jumped us threads, we would end up with a totally foreign transaction, our old transaction would be gone, and all of the carefully thought out premises that we have for transaction scope went out the window. Much cursing was to be heard.

So we did a quick fix and changed the ThreadLocal<Transaction> to be an AsyncLocal<Transaction>, so it would flow through the async calls. And then we run the tests, confident it would solve it. But it didn’t, in fact, we got the exact same error, in the exact same place, and we went back to head butting the wall to see who is smarter.

And then I realized that we were doing something else there. Lucene has the notion of cloning an input, which allows for multi threaded usage of an input. When we do that, we check if we are in the old transaction scope and can reuse previous work, or if we need to do the initialization for a new transaction. The problem was that we were doing this check by id.

Now, two transactions with the same id will always show the same data, period. However, how they do it is very different. Let us take a look at the Page Table diagram above. It shows that Page #412 is located on scratch #2 in position #8327. Now, we have a flushing background process that will take the data from the scratch file and move it to the data file. So the new Page Table will look like this:


Note that because the data on the data file in position #412 and scratch #2 in position #8327 is the same, that doesn’t actually matter. Except that when you have started in one transaction and started reading from scratch #2, then was bumped into another thread, and now are trying to keep on reading from the same place, only to end up blowing up entirely.

Once we have fixed this problem as well, all was well with the world. The sky wasn’t falling, and I was writing a blog post at midnight for relaxation.

RavenDB RetrospectiveBASE Indexes

time to read 7 min | 1257 words

RavenDB was designed from the get go with ACID documents store, and BASE indexes. ACID stands for Atomic, Consistent, Isolated, Durable, and BASE stands for Basically Available, Soft state, Eventually consistent.

That design had been conceived by twin competing needs. First, and obvious, a database should never lose data. Second, we want to ensure that the system remains responsive even under load. It is quite common to have spike in production traffic, and we wanted to be able to be able to handle it with better aplomb.

In particular, the kind of promises that are made by RavenDB queries allow us to perform quite a few performance optimizations. In databases that require that all indexes will be up to date on transaction commit, you’ll find that there is a very high cost to adding indexes to the system, because each additional index means additional work is needed at query time. It also makes things such as aggregating indexes (map/reduce, in RavenDB terms) a lot harder to build.

By having BASE indexes, we gain the ability to batch multiple writes into a single index update operation. It also allows us to defer writing the indexes to the disk, avoiding costly I/O operations. But most importantly, by changing the kind of promise that we give to users, we are able to avoid a lot of locks, complexity and hardship inside RavenDB. This may seems like a small thing, but this is actually quite important. Take a look at this study:


In fact, there are a lot of studies on the overhead of locking in database systems, and that has been a hot research topic for many years. By choosing a different architecture, we can avoid a lot of those costs and complexities.

So far, that was the explanation from the point of view of the database creator. What about the users?

Here the tradeoff is more nuanced. On the one hand, there is a certain level of complexity that people have to deal with the notion that queries on just inserted data might not include it (stale queries), on the other hand, it means that queries are consistently faster and we can handle spikes in traffic and load much more easily and consistently.

But it is a mental model that can be hard to follow, even when you are familiar with it. Probably the most common issue with RavenDB’s BASE indexes is the case of Post / Redirect / Get. Let us look at how this may play out:

In here, we actually have two requests, one that adds a new order to the system, and the other that fetch the details. If you have redirected to the new order page, everything is going to work as expected, and you won’t notice anything even if the indexes are stale at the time of the request. But a pretty common scenario is to add the new order, and then go and look at the list of orders for this customer, and if the index didn’t have the chance to update between those two requests (which typically happen very quickly) then the customer will not see the new order.

That particular scenario is responsible for the vast majority the pain we have seen from our users around BASE indexes.

Now, one of the great things about BASE indexes is that the user get to choose whatever they want to wait for the up to date results or whatever they want whatever is there right now. And we have had mechanisms to control this at a very granular level (including options for personal consistency control, so different customers will have different waits depending on their own previous behavior). But we have found that this is something that puts a lot of responsibility on the developer to control the flow on their users on their applications.

So in RavenDB 3.5 we have changed things a bit. Now, instead of processing the write requests as soon as possible, you can ask for the server to wait until the relevant indexes has processed:


In other words, when you call SaveChanges, it will wait until the indexes has been updated, so when you return from the call, you can be certain that the results of any future queries will include all the changes on that transaction. This moves the responsibility to the  write side and make such scenarios much easier to handle.

Given all of that, and our experience with RavenDB for the past 8 years or so, we spiked how it would look like with ACID indexes, at least for certain things. The problem is that this pretty much takes out of the equation a lot of the power and flexibility that we get from Lucene (more on why you can’t do that in Lucene in a bit) and force us to offer what are essentially B+Tree indexes. Those are so limited that we would have to offer:

  • B+Tree indexes – ACID (simple property / range queries). With different indexes needed for different queries and ordering options.
  • Lucene indexes – BASE, full text, spatial, facets, etc queries. Much more flexible and easy to use.
  • Map/reduce indexes – BASE (because you aren’t going to run the full map/reduce during the original transaction).

The problem is that then we would have continuous burden of explaining when to use which index type, and how to deal with the different limitations. It will also make it much more complex if you have a query that can use multiple indexes, and there are problems associated with creating new ACID indexes on live systems. So it would generate a lot of confusion and complexity to users, for fairly small benefit that we can address already with the “wait on save” option.

As for why we can’t do it all via Lucene anyway, the problem is that this wouldn’t be sustainable. Lucene isn’t really meant for individual operations, it shines when you push large amount of data through it. It also doesn’t really have the facilities to be transactional, we have actually solved that particular problem in RavenDB 4.0, but it was neither pretty nor easy, and it doesn’t alleviate the issue of “we do best in large batches”. RavenDB’s BASE indexes are actually designed to take advantage of that particular aspect. Because under load, we’ll process bigger batches are reap the performance benefits that they bring.

BASE indexes also make for much simpler operations. You can define a new index without fearing locking the database, and it enables scenarios such as side by side indexing to update index definitions without impacting the running system.

Finally, a truly massive benefit of BASE indexes is that they allow us to change the following statement: more indexes means faster reads, slower writes. Fewer indexes means slower reads, faster writes. By movng the actual indexing work to a background task, we let the writes go though as fast as tehy possible can.

Indexes still have a cost, and the more indexes you have, the higher the cost (we still got to do some work here). But in the vast majority of the cases, we can squeeze this kind of work between writes, in times that the database would be idling. 

What that means is that you can have more indexes at the same cost, and that your queries are going to be using those indexes and are going to be fast.



  1. Optimizing read transaction startup time: The performance triage - 14 hours from now
  2. Optimizing read transaction startup time: Every little bit helps, a LOT - about one day from now

There are posts all the way to Oct 27, 2016


  1. Optimizing read transaction startup time (6):
    25 Oct 2016 - Unicode ate my perf and all I got was
  2. RavenDB Retrospective (4):
    17 Oct 2016 - The governors
  3. Timing the time it takes to parse time (2):
    11 Oct 2016 - Part II
  4. Performance analysis (2):
    04 Oct 2016 - Simple indexes
  5. Interview question (3):
    29 Sep 2016 - Stackoverflow THAT
View all series



Main feed Feed Stats
Comments feed   Comments Feed Stats