Ayende @ Rahien

My name is Oren Eini
Founder of Hibernating Rhinos LTD and RavenDB.
You can reach me by phone or email:


+972 52-548-6969

, @ Q c

Posts: 6,026 | Comments: 44,842

filter by tags archive

Production postmortemThe case of the “it is slow on that machine (only)”

time to read 4 min | 661 words

A customer called with a major issue, on a customer machine, a particular operation took too long. In fact, it takes quite a bit more than too long. Instead of the few milliseconds or (at worst, seconds), the customer saw a value in the many minutes.

At first, we narrowed it down to an extreme load on the indexing engine. The customer had a small set of documents that were referenced using LoadDocument by large number of other documents. That meant that whenever those documents were updated, we would need to reindex all the referencing documents.

In their case, that was in the tens to hundreds of thousands of referencing documents in some cases. So an update to a single document could force re-indexing of quarter million documents. Except… that this wasn’t actually the case. What drove everyone crazy was that here was a reasonable, truthful and correct answer. And on one machine the exact same thing took 20 – 30 seconds, and on the customer machine the process took 20 minutes.

The customer also assured us that those documents that everyone referenced are very rarely, if at all, touched or modified, so that shouldn’t be the issue.

The machine with the problem was significantly more powerful from the one without the problem. This issue also started to occur recently, out of the blue. Tracing the resource utilization in the system showed moderate CPU usage, low I/O and memory consumption and nothing much really going on. We looked at the debug logs, and we couldn’t really figure out what it was doing. There were very large gaps in the log where nothing seems to be happening. % Time in GC was low, so that ruled out a long GC pause that would explain the gap in the logs.

This is in version 2.5, which predates all of our introspection efforts, so figuring out what was going on was pretty hard. I’ll have another post talking about that in this context later.

Eventually we gained access to the machine and was able to reproduce this, and take a few mini dumps along the way. Looking at the stack traces, we found this:


And now it all became clear. Suggestions in RavenDB is a really cool feature, which allows you to ask RavenDB to figure out what the user actually meant to ask. It is also extremely CPU intensive during indexing, which is really visible when you try to pump large number of documents through it. And it is a single threaded process.

Except that the customer wasn’t using Suggestions in their application…

So, what happened, in order to hit this issue the following things all needed to happen:

  • Suggestions to be enabled on the relevant index/indexes. Check while the customer wasn’t currently using it, they were using it in the past, and unfortunately that stuck.
  • A very large number of documents need to be indexed. Check – that happened when they updated one of the commonly referenced documents.
  • A commonly referenced document needed to be modified. Check – that happens when they started work for next year, which modified those rarely touched documents.

Now, why didn’t it manifest itself on the other machines? Simple, on those machines, they used the current version of the application, which didn’t use suggestions. On the machines that were problematic, they upgraded to the new version, so even though they weren’t using suggestions, that was still in affect, and still running.

According to a cursory check, those suggestions has been running there for over 6 months, and no one noticed, because you needed the confluence of all three aspects to actually get this issue.

Removing the suggestions once we know they were there was very easy, and the problem was resolved.

RavenDB in Action

time to read 3 min | 427 words

optimized-7mkgWriting books takes a lot of time, and quite a bit of effort. Which is why I was delighted when Itamar Syn-Hershko decided to write the RavenDB in Action book.

You can now get a 39% discount for the book (both physical and electronic versions) using the coupon code: 39ravendb

Itamar has worked at Hibernating Rhinos for several years, mostly dealing with Lucene integration, but really all over the place. Off the top of my head, Itamar is responsible for Spatial searches in RavenDB, he wrote the first periodic backup implementation as well as the “backup to cloud” functionality, implemented the server side document json (which supports cheap cloning and is key for some very interesting performance optimizations) and in general worked all over the RavenDB codebase.

In other words, this is a person who knows RavenDB quite well, and he has done an excellent in passing on that knowledge with the RavenDB in Action book. This book is likely to be the most up to date resource for getting started with RavenDB.

Itamar has covered working with RavenDB, how indexes work, and how to work with indexes (two distinctly different issues Smile) and most importantly, in my opinion, document based modeling.

Users bringing relational modeling to RavenDB is usually the most common reason they run into trouble. So document modeling is quite important, and Itamar did a good job in covering it. Both as an independent concept and by contrasting that with relational model and discussing the cons and pros as it relates to RavenDB.

And, naturally, the book also cover the fun parts of working with RavenDB.

  • Replication for high availability and load balancing
  • Sharding for scalability and increased throughput
  • Geo spatial queries
  • Full text, more like this and reporting queries
  • The advanced API options and how (and when) to use it.

There is even a full chapter talking about how you can extend RavenDB on both the client side and the server side.

Overall, I highly recommend the RavenDB in Action book, if you are using RavenDB, or even if you just want to learn about it, this is a great resource.

And remember that you can get a 39% discount for the book (both physical and electronic versions) using the coupon code: 39ravendb

Configuration is user input too

time to read 3 min | 587 words

Every programmer knows that input validation is important for good application behavior. If you aren’t validating the input, you will get… interesting behavior, to say the least.

The problem is that what developers generally don’t consider is that the system configuration is also users’ input, and it should be treated with suspicion. It can be something as obvious as a connection string that is malformed, or just wrong. Or it can be that the user has put executionFrequency=“Monday 9AM” into the the field. Typically, at least, those are fairly easy to discover and handle. Although most systems have a fairly poor behavior when their configuration doesn’t match their expectations (you typically get some error, usually with very few details, and frequently the system won’t start, so you need to dig into the logs…).

Then there is the perfectly valid and legal configuration items, such as dataDirectory=”H:\Databases\Northwind”, when the H: drive is in accessible for some reason (SAN drive that was snatched away). Or listenPort=”1234” when the port is currently busy and cannot be used. Handling those gracefully is something that devs tend to forget, especially if this is an unattended application (service, for example).

The problem is that we tend to assume that the people modifying the system configuration actually knows what they are doing, because they are the system administrators. That is not always the case, unfortunately, and I’m not talking about just issues of wrong path or busy port.

In a recent case, we had a sys admin that called us with high memory usage when creating an index. As it turned out, they setup the minimum number of items to load from disk to 10,000. And they have large documents (in the MBs range). The problem was that this configuration meant that before we could index, we had to load 10,000 documents to memory (each of them about 1 MB in average, and only then could we start indexing. That means 10GB of documents to load, and then start indexing (which has its own memory costs). That resulted in pushing other stuff from memory, and in general slowed things down considerably, because each indexing batch had to be at least 10GB in size.

We also couldn’t reduce the memory usage by reducing the batch size (as would normally would be the case under memory pressure), because the minimum amount was set so high.

In another case, a customer was experiencing a high I/O write rate, when we investigated, it looked like this was because of a very high fan out rate in the indexes. There is a limit on the number of fan out entries per document, and it is there for a reason, but it is something that we allow the system admin to configure. They have disabled this limit and went on with very high fan out rates, with the predictable result of issues as a result of it.

So now the problem is what to do?

On the one hand, accepting administrator input about how to tune the system is pretty much required. On the other hand, to quote a recent ops guy I spoke to “How this works? I have no idea, I’m in operations Smile, I just need it to work“, referring to a service that his company wrote.

So the question is, how do you handle such scenario? And no, writing warnings to the log won’t do, no one reads that.

SpeakingOredev 2015

time to read 1 min | 113 words

I’m going to be in Oredev again this year. In fact, several members of the RavenDB core team are going to be in a booth there, and we are going to be talking about what it does, what we are doing with it and where it is going.

We area also going to be giving away a ton of stuff (I know, I have to lug it Smile).

I’ll also be speaking about system architecture with a non relational database, which present different trade offs and require a change in thinking to get the best out of your system.

Raw data sections in Voron

time to read 5 min | 810 words

When laying out the design work for RavenDB 4.0, it became clear that we needed more from our storage layer. That require us to modify the way we store the information on disk. Voron is inspired by the LMDB project, and that project basically has a single data type, the B+Tree.

Now, LMDB does some really crazy stuff with this, a single LMDB file can contain any number of B+Trees, and you can also have trees whose values are also trees, but that is basically the sole thing that you can use.

This is roughly what this looks like:


Note that there are several types of values here. Small items can be stored directly inside the B+Tree, while bigger values require separate storage, and we only store the reference to the data in the tree.

As it turns out, this is actually quite important aspect of the way you can optimize behavior inside B+Trees. If you wondered about the difference between clustered and non clustered indexes, that is exactly it. In a clustered index, the data is directly in the B+Tree. In a non clustered index, you don’t have the data directly, but need to follow the reference.

Using C#, here are the differences:

var entry = clusteredIndex.Find(key);
return entry.Data;

var nonClusteredEntry = nonClusteredIndex.Find(key);
var entry = clusteredIndex.Find(nonClusteredIndex.Data);
return entry.Data;

I’m skipping on quite a lot here, but that should give you the gist of it.

Now, where is the problem? The problem is that there if all you have is trees, this is pretty much all that you can do here. Now, let us consider the case of storing documents inside Voron. What kind of questions do we need to answer?

  • Load documents by id
  • Scan documents by id prefix
  • Scan documents by etag

As it stands, doing this using just trees would require us to:

  • Define a tree to hold the data, where the key is the document id. This allows us to both find a document by id and by key prefix.
  • Define an index for the etag, where the key is the etag, and the value is the document id.

So, let us assume that we have a piece of code that need to read documents by etag (for indexing, replication, export, etc):

var enumerator = etagsIndex.ScanFrom(etag); // O(logN)
foreach(var entry in enumerator) // O(~1) for each MoveNext(); 
	var actualDocument = documentsIndex.Find(entry);// O(logN);
	// do something with this

So, if I want to read 32,000 documents based on their etags in a database that has a hundred million documents, that is going to perform close to 900,000 comparison operations. (log2(100 M) = 27, and we issue 32K O(logN) queries on the documentsIndex).

That can get expensive. And this is a relatively simple example. In some cases, a data item might have five covering indexes, and the access rate and size get very high.

In order to resolve that, we are going to change how we layout the data on the disk. Instead of storing the data directly in the tree, we are going to move the data into separate storage. We call this raw data sections.

A raw data section can be small (for values that are less than 2KB in size) or large. Small values are grouped together in sections that are 2 – 8 MB in length. While each larger sized value is going to be place independently, rounded up to the next page size (4KB, by default). The trees will not store that data, but references to the data on disk (effectively, the file position for the data).

Note that this has a few implications:

  • Small data is grouped together, and the likely access method will group together commonly inserted values.
  • We won’t need to move the data if something changes (as long as it doesn’t grow too much), so updating the value will usually mean that we don’t need to update all the indexes (because the location on disk didn’t change), unless the value they are covering did change.
  • Trees are going to be smaller, allowing more of them to reside in memory, and it allows us to do interesting tricks with things like PrefetchVirtualMemory calls on the locations from the index before access them, so we only pay the I/O cost (if any, once).

This is also quite exciting, because it means that we can share indexes on the data very easily. For example, we can have a “etags per collection” index for each collection, which will be much faster to run because it doesn’t need all the extra lookups.

How saving a single byte cost me over a day (and solved with a single byte)

time to read 3 min | 516 words

I’m currently doing something that I would rather not do. I’m doing significant refactoring to the lowest level of RavenDB, preparing some functionality that I need to develop some use visible features. But in the meantime, I’m mucking about in the storage layer, dealing with how Voron actually persists and manage the information we have. And I’m making drastic changes. That means that stuff breaks, but we have got the tests to cover for us, so we are good in that regard, yeah!

The case for this post is that I decided to break apart a class that had too many responsibilities. The class in question is:


As you can see, there are multiple concerns here. There are actually several types of Page in the system, but changing this would have been a major change, so I didn’t do that at the time, and just tucked more responsibility into this poor class.

When I finally got around to actually handling this, it was part of a much larger body of work, but I felt really great that I was able to do that.

Until I run the tests, and a few of them broke. In particular, only the tests that deal with large amount of information (over four million entries) broke. And they broke in a really annoying way, it looked like utter memory corruption was happening. It was scary, and it took me a long while to figure out what was going on.

Here is the fix:


This error happens when adding a new entry to a tree. In particular, this piece of code will only run when we have a page split on a branch page (that is, we need to move from 2 levels in the tree to 3 levels).

The issue turned out to be because when we split the Fixed Size Tree from the Variable Size Tree, I was able to save a single byte in the header of the fixed size tree page. The previous header was 17 bytes in size, and the new header was 16 bytes in size.

The size of BranchEntrySize is 16 bytes… and I think you get the error now, right?

Before, with page size of 4K, we had a PageMaxSize of 4079 bytes. So on the 255 entry, we would be over the limit, and split the page. By reducing a single byte from the header, the PageMaxSize was 4080, and because we only checked from greater than, we though that we could write to that page, but we ended up writing to the next page, corrupting it.

The fix, ironically enough, was to add another byte, to check for greater than or equal Smile.

Recording available - Highly Available & Scalable Solutions with RavenDB

time to read 1 min | 71 words

In this session, Oren explored how RavenDB deals with scaling under load and remain highly available even under failure conditions. Showed how RavenDB's data-driven sharding allows to increase the amount of the data in our cluster without giving up the benefits of data locality and work with complex distributed map-reduce queries on a sharded cluster, giving you lightning-fast responses over very large data volumes.

Production postmortemThe case of the Unicode Poo

time to read 4 min | 635 words

We got an error report from a customer about migration issue from 2.5 to 3.0. A particular document appear to have been corrupted, and caused issues.

We have an explicit endpoint to expose the database raw bytes to the client, so we can troubleshoot exactly those kind of errors. For fun, this is a compressed database, so the error was hiding beneath two level of indirection, but that is beside the point.

When looking at the raw document’s byte, we saw:


Which was… suspicious. It took a while to track down, but the end result was that this error would occur when you have:

  • A large string (over 1KB), and it is the first large string in the document.
  • At the 1023 position of the string (bytewise), you have a multi byte and multiple character value.

In those cases, we wouldn’t be able to read the document.

The underlying reason was an optimization we made in 3.0 to reduce buffer allocations during deserialization of documents. In order to properly handle that, we used an Encoding Decoder directly, without any intermediate buffers. This works great, except in this scenario, and the way JSON.Net calls us.

When JSON.Net find a large string, it will repeatedly read characters from the stream until it reached the end of the stream, and only then it will process it. If the string size is more than the buffer size, it will increase the buffer.

Let us imagine the following string:


When we serialize it, it looks like this:

var bytes = new byte[] { 65, 66, 67, 0xF0, 0x9F, 0x92, 0xA9 };

And let us say that we want to read that in a buffer of 4 characters. We’ll use it like so:

 int bytesPos = 0;
 int charsPos = 0;
 var chars = new char[4];
 while (bytesPos < bytes.Length) // read whole buffer
    while (charsPos < chars.Length) // process chars in chunks
         int bytesUsed;
         int charsUsed;
         bool completed;
         decoder.Convert(bytes, bytesPos, bytes.Length - bytesPos, chars, charsPos, chars.Length - charsPos, false,
             out bytesUsed, out charsUsed, out completed);
         bytesPos += bytesUsed;
         charsPos += charsUsed;
     Console.WriteLine(new string(chars));

On the first call, the Convert will convert the first three bytes into three characters, and stop. The JSON.Net code will then ask it to fill to the end of the buffer (simulated by the inner loop), but at that point, the Convert method will throw, because it has just one character available in the buffer to write to, but it can’t write that character.

Why is that? Look at the poo string above. How many character does it take?

If you answered four, you are correct visually, and wrong in buffer sense. This string actually takes 5 characters to represent. As I mentioned, in order to hit this error, we have to had a particular set of things align just right (or wrong). Even a single space difference would align things so no multi byte character would span the 1KB boundary.

The solution, by the way, was to drop the optimization, sadly, we’ll revisit this at a later time, probably, but now we’ll have a way to confirm that this scenario is also covered.


No future posts left, oh my!


  1. Technical observations from my wife (3):
    13 Nov 2015 - Production issues
  2. Production postmortem (13):
    13 Nov 2015 - The case of the “it is slow on that machine (only)”
  3. Speaking (5):
    09 Nov 2015 - Community talk in Kiev, Ukraine–What does it take to be a good developer
  4. Find the bug (5):
    11 Sep 2015 - The concurrent memory buster
  5. Buffer allocation strategies (3):
    09 Sep 2015 - Bad usage patterns
View all series


Main feed Feed Stats
Comments feed   Comments Feed Stats