Ayende @ Rahien

My name is Oren Eini
Founder of Hibernating Rhinos LTD and RavenDB.
You can reach me by phone or email:


+972 52-548-6969

, @ Q c

Posts: 6,026 | Comments: 44,842

filter by tags archive

Production postmortemThe case of the “it is slow on that machine (only)”

time to read 4 min | 661 words

A customer called with a major issue, on a customer machine, a particular operation took too long. In fact, it takes quite a bit more than too long. Instead of the few milliseconds or (at worst, seconds), the customer saw a value in the many minutes.

At first, we narrowed it down to an extreme load on the indexing engine. The customer had a small set of documents that were referenced using LoadDocument by large number of other documents. That meant that whenever those documents were updated, we would need to reindex all the referencing documents.

In their case, that was in the tens to hundreds of thousands of referencing documents in some cases. So an update to a single document could force re-indexing of quarter million documents. Except… that this wasn’t actually the case. What drove everyone crazy was that here was a reasonable, truthful and correct answer. And on one machine the exact same thing took 20 – 30 seconds, and on the customer machine the process took 20 minutes.

The customer also assured us that those documents that everyone referenced are very rarely, if at all, touched or modified, so that shouldn’t be the issue.

The machine with the problem was significantly more powerful from the one without the problem. This issue also started to occur recently, out of the blue. Tracing the resource utilization in the system showed moderate CPU usage, low I/O and memory consumption and nothing much really going on. We looked at the debug logs, and we couldn’t really figure out what it was doing. There were very large gaps in the log where nothing seems to be happening. % Time in GC was low, so that ruled out a long GC pause that would explain the gap in the logs.

This is in version 2.5, which predates all of our introspection efforts, so figuring out what was going on was pretty hard. I’ll have another post talking about that in this context later.

Eventually we gained access to the machine and was able to reproduce this, and take a few mini dumps along the way. Looking at the stack traces, we found this:


And now it all became clear. Suggestions in RavenDB is a really cool feature, which allows you to ask RavenDB to figure out what the user actually meant to ask. It is also extremely CPU intensive during indexing, which is really visible when you try to pump large number of documents through it. And it is a single threaded process.

Except that the customer wasn’t using Suggestions in their application…

So, what happened, in order to hit this issue the following things all needed to happen:

  • Suggestions to be enabled on the relevant index/indexes. Check while the customer wasn’t currently using it, they were using it in the past, and unfortunately that stuck.
  • A very large number of documents need to be indexed. Check – that happened when they updated one of the commonly referenced documents.
  • A commonly referenced document needed to be modified. Check – that happens when they started work for next year, which modified those rarely touched documents.

Now, why didn’t it manifest itself on the other machines? Simple, on those machines, they used the current version of the application, which didn’t use suggestions. On the machines that were problematic, they upgraded to the new version, so even though they weren’t using suggestions, that was still in affect, and still running.

According to a cursory check, those suggestions has been running there for over 6 months, and no one noticed, because you needed the confluence of all three aspects to actually get this issue.

Removing the suggestions once we know they were there was very easy, and the problem was resolved.

Configuration is user input too

time to read 3 min | 587 words

Every programmer knows that input validation is important for good application behavior. If you aren’t validating the input, you will get… interesting behavior, to say the least.

The problem is that what developers generally don’t consider is that the system configuration is also users’ input, and it should be treated with suspicion. It can be something as obvious as a connection string that is malformed, or just wrong. Or it can be that the user has put executionFrequency=“Monday 9AM” into the the field. Typically, at least, those are fairly easy to discover and handle. Although most systems have a fairly poor behavior when their configuration doesn’t match their expectations (you typically get some error, usually with very few details, and frequently the system won’t start, so you need to dig into the logs…).

Then there is the perfectly valid and legal configuration items, such as dataDirectory=”H:\Databases\Northwind”, when the H: drive is in accessible for some reason (SAN drive that was snatched away). Or listenPort=”1234” when the port is currently busy and cannot be used. Handling those gracefully is something that devs tend to forget, especially if this is an unattended application (service, for example).

The problem is that we tend to assume that the people modifying the system configuration actually knows what they are doing, because they are the system administrators. That is not always the case, unfortunately, and I’m not talking about just issues of wrong path or busy port.

In a recent case, we had a sys admin that called us with high memory usage when creating an index. As it turned out, they setup the minimum number of items to load from disk to 10,000. And they have large documents (in the MBs range). The problem was that this configuration meant that before we could index, we had to load 10,000 documents to memory (each of them about 1 MB in average, and only then could we start indexing. That means 10GB of documents to load, and then start indexing (which has its own memory costs). That resulted in pushing other stuff from memory, and in general slowed things down considerably, because each indexing batch had to be at least 10GB in size.

We also couldn’t reduce the memory usage by reducing the batch size (as would normally would be the case under memory pressure), because the minimum amount was set so high.

In another case, a customer was experiencing a high I/O write rate, when we investigated, it looked like this was because of a very high fan out rate in the indexes. There is a limit on the number of fan out entries per document, and it is there for a reason, but it is something that we allow the system admin to configure. They have disabled this limit and went on with very high fan out rates, with the predictable result of issues as a result of it.

So now the problem is what to do?

On the one hand, accepting administrator input about how to tune the system is pretty much required. On the other hand, to quote a recent ops guy I spoke to “How this works? I have no idea, I’m in operations Smile, I just need it to work“, referring to a service that his company wrote.

So the question is, how do you handle such scenario? And no, writing warnings to the log won’t do, no one reads that.

How saving a single byte cost me over a day (and solved with a single byte)

time to read 3 min | 516 words

I’m currently doing something that I would rather not do. I’m doing significant refactoring to the lowest level of RavenDB, preparing some functionality that I need to develop some use visible features. But in the meantime, I’m mucking about in the storage layer, dealing with how Voron actually persists and manage the information we have. And I’m making drastic changes. That means that stuff breaks, but we have got the tests to cover for us, so we are good in that regard, yeah!

The case for this post is that I decided to break apart a class that had too many responsibilities. The class in question is:


As you can see, there are multiple concerns here. There are actually several types of Page in the system, but changing this would have been a major change, so I didn’t do that at the time, and just tucked more responsibility into this poor class.

When I finally got around to actually handling this, it was part of a much larger body of work, but I felt really great that I was able to do that.

Until I run the tests, and a few of them broke. In particular, only the tests that deal with large amount of information (over four million entries) broke. And they broke in a really annoying way, it looked like utter memory corruption was happening. It was scary, and it took me a long while to figure out what was going on.

Here is the fix:


This error happens when adding a new entry to a tree. In particular, this piece of code will only run when we have a page split on a branch page (that is, we need to move from 2 levels in the tree to 3 levels).

The issue turned out to be because when we split the Fixed Size Tree from the Variable Size Tree, I was able to save a single byte in the header of the fixed size tree page. The previous header was 17 bytes in size, and the new header was 16 bytes in size.

The size of BranchEntrySize is 16 bytes… and I think you get the error now, right?

Before, with page size of 4K, we had a PageMaxSize of 4079 bytes. So on the 255 entry, we would be over the limit, and split the page. By reducing a single byte from the header, the PageMaxSize was 4080, and because we only checked from greater than, we though that we could write to that page, but we ended up writing to the next page, corrupting it.

The fix, ironically enough, was to add another byte, to check for greater than or equal Smile.

Production postmortemThe case of the Unicode Poo

time to read 4 min | 635 words

We got an error report from a customer about migration issue from 2.5 to 3.0. A particular document appear to have been corrupted, and caused issues.

We have an explicit endpoint to expose the database raw bytes to the client, so we can troubleshoot exactly those kind of errors. For fun, this is a compressed database, so the error was hiding beneath two level of indirection, but that is beside the point.

When looking at the raw document’s byte, we saw:


Which was… suspicious. It took a while to track down, but the end result was that this error would occur when you have:

  • A large string (over 1KB), and it is the first large string in the document.
  • At the 1023 position of the string (bytewise), you have a multi byte and multiple character value.

In those cases, we wouldn’t be able to read the document.

The underlying reason was an optimization we made in 3.0 to reduce buffer allocations during deserialization of documents. In order to properly handle that, we used an Encoding Decoder directly, without any intermediate buffers. This works great, except in this scenario, and the way JSON.Net calls us.

When JSON.Net find a large string, it will repeatedly read characters from the stream until it reached the end of the stream, and only then it will process it. If the string size is more than the buffer size, it will increase the buffer.

Let us imagine the following string:


When we serialize it, it looks like this:

var bytes = new byte[] { 65, 66, 67, 0xF0, 0x9F, 0x92, 0xA9 };

And let us say that we want to read that in a buffer of 4 characters. We’ll use it like so:

 int bytesPos = 0;
 int charsPos = 0;
 var chars = new char[4];
 while (bytesPos < bytes.Length) // read whole buffer
    while (charsPos < chars.Length) // process chars in chunks
         int bytesUsed;
         int charsUsed;
         bool completed;
         decoder.Convert(bytes, bytesPos, bytes.Length - bytesPos, chars, charsPos, chars.Length - charsPos, false,
             out bytesUsed, out charsUsed, out completed);
         bytesPos += bytesUsed;
         charsPos += charsUsed;
     Console.WriteLine(new string(chars));

On the first call, the Convert will convert the first three bytes into three characters, and stop. The JSON.Net code will then ask it to fill to the end of the buffer (simulated by the inner loop), but at that point, the Convert method will throw, because it has just one character available in the buffer to write to, but it can’t write that character.

Why is that? Look at the poo string above. How many character does it take?

If you answered four, you are correct visually, and wrong in buffer sense. This string actually takes 5 characters to represent. As I mentioned, in order to hit this error, we have to had a particular set of things align just right (or wrong). Even a single space difference would align things so no multi byte character would span the 1KB boundary.

The solution, by the way, was to drop the optimization, sadly, we’ll revisit this at a later time, probably, but now we’ll have a way to confirm that this scenario is also covered.

Critique this code: The memory mapped file

time to read 4 min | 654 words

We got an error in the following code, in production. We are trying hard to make sure that we have good errors, which allows us to troubleshoot things easily.

In this case, the code… wasn’t very helpful about it. Why? Take a look at the code, I’ll explain why below…

public CodecIndexInput(FileInfo file, Func<Stream, Stream> applyCodecs)
        this.file = file;
        this.applyCodecs = applyCodecs;

        fileHandle = Win32NativeFileMethods.CreateFile(file.FullName,
            Win32NativeFileShare.Read | Win32NativeFileShare.Write | Win32NativeFileShare.Delete,

        if (fileHandle.IsInvalid)
            const int ERROR_FILE_NOT_FOUND = 2;
            if (Marshal.GetLastWin32Error() == ERROR_FILE_NOT_FOUND)
                throw new FileNotFoundException(file.FullName);
            throw new Win32Exception();

        mmf = Win32MemoryMapNativeMethods.CreateFileMapping(fileHandle.DangerousGetHandle(), IntPtr.Zero, Win32MemoryMapNativeMethods.FileMapProtection.PageReadonly,
            0, 0, null);
        if (mmf == IntPtr.Zero)
            throw new Win32Exception();

        basePtr = Win32MemoryMapNativeMethods.MapViewOfFileEx(mmf,
            0, 0, UIntPtr.Zero, null);
        if (basePtr == null)
            throw new Win32Exception();

        stream = applyCodecs(new MmapStream(basePtr, file.Length));
    catch (Exception)

Did you see it?

This code has multiple locations in which it can throw Win32Exception. The problem with that is that Win32Exception in this mode is pretty much just a code, and we have multiple locations inside this method that can thrown.

When that happens, if we don’t have the PDB files deployed, we have no way of knowing, just from the stack trace (without line numbers), which of the method calls had caused the error. That is going to lead to some confusion after the fact.

We solved this by adding description text for each of the options, including additional information that will let us know what is going on. In particular, we also included not only the operation that failed, but even more important, we included the file that failed.

Production postmortemThe case of the man in the middle

time to read 3 min | 553 words

One of the most frustrating things when you dealing with production issues is when the problem is not in our product, but elsewhere. In particular, this post is dedicated to the hard work done by many anti virus products, in particular, to make our life harder.

Let us take a look at the following quote, taken from the ESET NOD32 Anti Virus knowledge base (emphasis mine):

By default, your ESET product automatically detects programs that are used as web browsers and email clients, and adds them to the list of programs that the internal proxy scans. This can cause loss of internet connectivity or other undesired results with applications that use network features but are not web browsers/email clients.

Yes, it can. In fact, it very often does.

Previously, we looked at a similar issue with Anti Virus slowing down I/O enough to cause us to slowly die. But in this case, the issue is a lot more subtle.

Because it is doing content filtering, it tends to put a much higher overhead on the system resources, which means that as far as the user is concerned, RavenDB is slow. We actually developed features specifically to handle this scenario. The traffic watch mode will tell you how much time you spend on the server side, and we have added a feature that will make RavenDB account for the internal work each query is doing, so we can tell where the actual cost is.

You can enable that by issuing:

GET databases/Northwind/debug/enable-query-timing

And one that is setup, you can get a good idea about what is costly in the query, as far as RavenDB is concerned. Here is an example of a very slow query:


You can see that the issue is that we are issuing a very wide range query, so most of the time is spent in inside Lucene. Other examples might be ridicilously complex queries, which result in high parsing time (we have seen queries in the hundreds of KB range). Or loading a lot of big documents, or… you get the drift. If we see that the server thinks that a query is fast, but the overall time is slow, we know to blame the network.

But an even more insidious issue is that this would drop requests,  consistently and randomly (and yes, I know that those are contradictions, it was consistently dropping requests in a random pattern that seemed explicitly designed to thwart figuring out what is going on). Leading to things breaking, and escalated support calls. “RavenDB is broken” leads to a lot of headache, and a burning desire to hit something when you figure out that not only isn’t it your fault, but the underlying reason is actively trying to prevent you from figuring it out (I assume it is to deal with viruses that try to shut it off), which lead to really complex find facting sessions.

That is more annoying because it seems that the issue there was a bug in respecting keep alive sessions for authenticated requests under some scenarios, in the AV product in question! Absolutely not fun!

Pointer arithmetic and dynamic HTML generation FTW–at 2 AM

time to read 2 min | 388 words

I just finished writing the following code, the time is 2 AM local time, and I am chasing a conceptual issue deep in the guts of Voron (RavenDB’s internal storage system).


By any measure of the word, this code should frighten anyone who reads it. For what I think can we can all safely assume are self evident reasons.

So why am I so very happy about this kind of code? Because the output of this code is this:


This is pretty trivial tree view for internal data structures inside of Voron.

But the point here isn’t that this code is probably going to be a no-go zone by tomorrow morning. This is code explicitly designed for that. In fact, the calling method for this is:


So this is a piece of code that we are trying very hard to ensure that will never be called unless we are actively requesting it in a debug session. In other words, this isn’t production code. It is scaffolding code. The code that you’ll call when you lost your way, and can no longer figure out what is going on using the standard tools.

This kind of visualization is cheap & dirty trick, but it means that instead of having to reason about this kind of thing, I can immediately see what is going on.

For that matter, if you look closely at the visualization, you might be able to see the bug in question. It is pretty obvious, once we are able to look at the entire data set in one go. The more complex the algorithm, the harder it is to conceptualize it. But in most cases, if we could but look at the data in a way that would make sense to us, we will be able to immediately see what we did wrong.

Production postmortemThe case of the intransigent new database

time to read 3 min | 480 words

A customer called us to tell that they had a problem with RavenDB. As part of their process for handling new customers, they would create a new database, setup indexes, and then direct all the queries for that customer to that database.

Unfortunately, this system that has worked so well in development died a horrible death in production. But, and this was strange, only for new customers, and only in the create new customer stage. The problem was:

  • The user would create a new database in RavenDB. This just create a db record, and its location on disk. It doesn’t actually initialize a database.
  • On the first request, we initialize the db, creating it if needed. The first request will wait until this happens, then proceed.
  • On their production systems, that first request (which they used to create the indexes they require) would time out with an error.

Somehow, the creation of a new database would take way too long.

The first thought we had was they are creating the database on a path of an already existing database, maybe a big one that had a long initialization period, or maybe one that required recovery. But the customer validated that they were creating the database on an empty folder.

We looked at the logs, and the logs just showed a bunch of time were there was no activity. In fact, we had a single method call to open the database that took over 15 seconds to run. Except that on a new database, this method just create a bunch of files to start things out and is ready really quickly.

That is the point that led us to suspect that the issue was environmental. Luckily, as the result of many such calls, RavenDB comes with a pretty basic I/O Test tool. I asked the customer to run this on their production system, and I got the following:image

And now everything was clear. They were running on an I/O constrained system (a cloud machine), and they were running into an interesting problem. When RavenDB creates a database, it pre-allocate some files for its transactional journal.

Those files are 64MB in size, and the total write for a new Esent RavenDB database with default configuration is just over 65MB. If your write throughput is less than 1MB/sec sustained, that will be problematic.

I let the customer know about the configuration option to take less space at startup (Esent RavenDB databases can go as low as 5MB, Voron RavenDB starts at 256Kb), but I also gave them a hearty recommendation to make sure that their I/O rates improved, because this isn’t going to be the only case where slow I/O will kill them.

Production postmortemThe case of the infected cluster

time to read 8 min | 1592 words

RavenDB is a critical system for many of our clients, that means that we get our share of “the sky is falling” stories, and we have a team of people we train to be able to diagnose and troubleshoot issues in production. This series of post is about the kind of issues that we run into, how they were resolved, and what the underlying issue was.

The scenario, a customer had a total system meltdown because RavenDB would stop responding to requests. Restarting the server would result in everything working, but in about 10 minutes, they would run into the same issue.

Complicating matters was the fact that the customer was running in a high availability cluster, with two quite powerful nodes that had failover setup. However, the problem would typically impact a single server, eventually leading to it stopping processing requests, while the second server would be humming along beautifully. When the first server would stop, the second server would continue for a few more minutes, then die in the same manner.

The issue was clearly related to the load on those server. The website is pretty busy, with millions of queries to RavenDB per hour, but we had no reason to think that the system was over capacity. Why is that?

  • The system has been running around for over a month on the same configuration, with moderate memory consumption and low CPU usage throughout that time.
  • The last upgrade to the RavenDB servers happened about a week ago, and no new traffic patterns should cause that load.
  • The last RavenDB client upgrade happened that morning, but it was a minor change and should reduce load on the server. But the problem would show up within 10 – 20 minutes, and the client upgrade happened several hours before it started happening.
  • The operations team swore that there were no changes made to either server recently.

The behavior was puzzling because we didn’t have high CPU, high memory or paging. Typical indications that the system is under too much of a load. Furthermore, I/O rates were quite good, and testing the disk speed and throughput showed that everything is all right.

We tried to revert all recent changes, to both clients and servers. We monitored the servers to see what happens when the problem recurred, and eventually we figured out what is the probable issue. The affected server would slowly show a growth pattern in the number of threads being used. Slowly going to upward of 700 threads. The non impacted server would sit on around 100 threads or so, until the first server would fail, in which case all load would hit the okay server, and it would be fine, for a while, then the thread count would start climbing up, the request timing would increase, and eventually things would stop.

One thing that was very interesting was that the problem was that RavenDB would still be responding (albeit a bit badly) under those conditions, and there was no indication that there are any issues with it as far as it was concerned. But under this failure condition we did see that we start getting a rather strange error.

The server is too busy, could not acquire transactional access

This rather poorly worded message is an indication that we hit an internal limit inside RavenDB. By default, we have a limit of 512 concurrent requests inside the server. That is done to make sure that we aren’t swamping the server with too many requests that would take it down. The good news was that this is a configurable value, and we were able to set Raven/MaxConcurrentServerRequests to a higher value to get a little bit more breathing space. Once we have done that, the system would still become slower, but it would stabilize at some point, even though it would still be much slower than what we come to expect of it.

Intermediary conclusions were:

  1. Something is causing requests to take a lot longer than they should.
  2. Because they take longer than they should, we see a convoy of requests.
    • We can’t handle requests as fast as they come, so they start to pile up.
    • The thread pool (which dispatches requests) is handling this by slowly increasing the number of threads.
    • But we still aren’t able to catch up with the number of requests coming in.
    • At some point, the number of concurrent requests hits the limit, and we start rejecting some requests
  3. We would see major slow downs because the thread pool would increase the number of threads in the pool only on a slow and measured pace.
  4. We don’t have any idea what is causing us to slow down.

Remember, this is a system that has been in production for a while now. In fact, the current traffic numbers are actually only about one third of the peek that was hit shortly after the system was released, and we handled that load with no issue.

We came back to the drawing board, and started look at pretty much anything suspicious. Memory was reasonable, and we have few dozens GB still free. CPU utilization was moderate, with over half of the system capacity unused. I/O rates from the disk according to performance monitor was excellent. Everything was good. So we took a dump of the process and proceeded to try to figure out what is going on from that.

We saw something like this (spread over 150+ threads and in various modes, but this gets the point):


95% of the threads were stuck in various I/O related methods. Like NtCreateFile, NtCloseFile, etc. Other threads were hanging in .NET related code that was also doing file I/O. So we can pretty conclusively state that the problem is I/O related.

But what? We tested the disk I/O, it was good. We run I/O tests and they were fine. The measurements we got from the monitoring tools all said that the I/O is fine.

And yet we could see that I/O was the culprit. What the hell?!

At this point, we start suspecting that the issue was with something in the system. Now, each of the servers was capable to handling the entire load of the server (and then some) on its own. But both machine exhibit the same signs. That led us to believe that this isn’t some sort of hardware issue. It is something that both machine had.

A common trouble maker in this area is any kind of tool that hooks into sys calls. For example, anti virus software. In fact, that was one of the very first questions that we asked. Does this machine have anti virus on it? We got a negative answer early on. But after seeing those results, we went in and double checked. The machine did have an anti virus installed, and that is a huge red flag in those circumstances.

Anti virus software like to hook into all kind of system calls, and it is commonly associated with problems because of that. We asked the operations team to disable the anti virus on those servers to see if it would have an impact on the issue. Indeed, they stopped the anti virus and restarted the RavenDB servers, and then we waited. 10 minutes later, we saw no increase in the number of threads, and 25 minutes later, we were holding steady on the same performance that we came to expect.

Today, the customer sent us their monitoring system reports and there is a huge drop from 30% – 50% CPU usage to single digits on the day in question, and then it is flat graph for all the rest of the week.

Our hypothesis is that the anti virus either auto updated itself, or was actually updated by ops as part of routine maintenance task, and for some reason it has taken a liking to the kind of I/O RavenDB was generating. So it inspected us. And as it did so, it added overhead for each and every I/O call we made. And considering that we are a database, I/O is kinda of important for us. That additional overhead would cause us to respond much more slowly, which would mean that we would have additional request queue up, all the way until we start rejecting them. And we didn’t see those at the I/O metrics for the system because the overhead happen after the I/O is already completed.

I also suspect that this runs for all I/O operations, including things that should be buffered (therefor fast), so it has a big impact on our assumptions (this I/O call should be buffered, so it is fast, but it is checked by the AV, so it is not, leading to the convoy).

Because the Anti Virus was installed on both servers, they were both impacted, and the more load we put on them, the more the AV had to do, the more it slowed us down, the more requests we had queued, the bigger the trouble.

In other words, be very aware what is actually going on your system. As part of the post mortem analysis, we realized that we should have probably run fltmc.exe tool to check what are the file system filters that are installed and probably could have figured out that this is an Anti Virus issue much faster.


No future posts left, oh my!


  1. Technical observations from my wife (3):
    13 Nov 2015 - Production issues
  2. Production postmortem (13):
    13 Nov 2015 - The case of the “it is slow on that machine (only)”
  3. Speaking (5):
    09 Nov 2015 - Community talk in Kiev, Ukraine–What does it take to be a good developer
  4. Find the bug (5):
    11 Sep 2015 - The concurrent memory buster
  5. Buffer allocation strategies (3):
    09 Sep 2015 - Bad usage patterns
View all series


Main feed Feed Stats
Comments feed   Comments Feed Stats