Ayende @ Rahien

My name is Oren Eini
Founder of Hibernating Rhinos LTD and RavenDB.
You can reach me by phone or email:


+972 52-548-6969

, @ Q c

Posts: 6,229 | Comments: 46,315

filter by tags archive

That is my memory you’re freeing, you foreign thread!

time to read 4 min | 615 words

RavenDB is a pretty big project, and it has been around for quite a while. That means that we have run into a lot of strange stuff over the years. In particular, support incidents are something that we track and try to learn from. Today’s post is about one such lesson. We want to be able to track, on a per thread basis, how much memory is in use. Note that when we say that, we talk about unmanaged memory.

The idea is, once we track it, we can manage it. Here is one such example:


Note that this has already paid for itself when it showed us very clearly (and without using special tools), exactly who is allocating too much memory.

Memory allocation / de-allocation is often a big performance problem, and we are trying very hard to not get painted into performance corners. So a lot of our actual memory usage is allocate once, then keep around in the thread for additional use. This turn out to be quite useful. It also means that for the most part, we really don’t have to worry about thread safety. Memory allocations happen in the context of a thread, and are released to the thread once an operation is done.

This gives us high memory locality and it avoids having to take locks to manage memory. Which is great, except that we also have quite a bit of async request processing code. And async request processing code will quite gladly jump threads for you.

So that lead to a situation where you allocate memory in thread #17 at the beginning of the request and it waits for I/O, so when it finally completes, the request finish processing in thread #29. In this case, we keep the memory we go for next usage in the finishing thread. This is based on the observation that we typically see the following patterns:

  • Dedicated threads for tasks, that do no thread hopping, each have unique memory usage signature, and will eventually settle into the memory it needs to process everything properly.
  • Pools of similar threads that share roughly the same tasks with one another, and have thread hopping. Over time, things will average out and all threads will have roughly the same amount of memory.

That is great, but it does present us with a problem, how do we account for that? If thread #17 allocated some memory, and it is now sitting in thread #29’s bank, who is charged for that memory?

The answer is that we always charge the thread that initially allocated the memory, even if it currently doesn’t have that memory available. This is because it is frequently the initial allocation that we need to track, and usage over time just means that we are avoiding constant malloc/free calls.

It does present a problem, what happens if thread #29 is freeing memory that belongs to thread #17? Well, we can just decrement the allocated value, but that would force us to always do threads safe operations, which are more expensive.

Instead, we do this:


If the freeing thread is the same as the allocation thread, just use simple subtraction, crazy cheap. But if it was allocated from another thread, do the thread safe thing. Then we smash both values together to create the final, complete, picture. 

RavenDB RetrospectiveUnbounded result sets

time to read 4 min | 750 words

Image result for team retrospectiveWe spent some time recently looking into a lot of our old design decisions. Some of them make very little sense today (json vs. blittalbe as a good example), but made perfect sense at the time, and were essential to actually getting the product out.

Some of those design decisions, however, are still something that I very firmly believe in.  This series of posts is going to explore those decisions, their background and how they played out in the real world. So, without further ado, let us talk about unbounded result sets.

The design of RavenDB was heavily influenced by my experience as That NHibernate Guy (I got started with NHibernate over a decade ago, if you can believe that), where I saw the same types of error, repeated over and over again. I then read Release It!, and I suddenly discovered that I wasn’t alone fighting those kind of demons. When I designed RavenDB, I set out explicitly to prevent as many of those as I possibly could.

One of the major issues that I wanted to address was Unbounded Result Sets, simply put, this is when you have:

SELECT * FROM OrderLines WHERE OrderID = 1555

And you don’t realize that this order has three million line items (or, which is worst, that most of your orders have a few thousands line items, so you are generating a lot of load on the database, only to throw most of them away).

In order to prevent this type of issue, RavenDB has the notion of mandatory page sizes.

  • On the client side, if you don’t specify a limit, we’ll implicitly add one (by default set to 128).
  • On the server side, there is a database wide maximum page size (by default set to 1024). The server will trim all page sizes to the max if they are larger.

I think that this is one of the more controversial decisions in RavenDB design, and one that got a lot of heated discussion. But I still think that this is a good idea,because I have seen what happens when you don’t do that.   And the arguments are mostly about “RavenDB should trust developers to know what they are doing” and a particular irate guy called me while I was out shopping to complain how I broke the sacred contract of Linq with regards to “queries should return all by default, even if this is ten billion results”. I pointed out that this is actually configurable, and if he wanted to set the default to any size he wanted, he could do that, but apparently it is supposed to be “shoot my own foot first, then think” kind of deal.

Even though that I still think that this is a really good idea, we have added some features over the years to make it easy for people to access the entire dataset when they need it. Streaming has been around since 2.5 or so, giving you a dedicated API to stream unbounded results. Streams were built to make it efficient to process large sets of data, and they allow both client & server to process the data in parallel, instead of batching huge responses on the server, then consuming ridiculous amounts of memory on the client before giving you the full result set. Instead, you can get each result as soon as it arrive from server, and you can process it and send it further.

In 4.0, we are going to change the behavior of the paging limits so:

  • If you don’t specify a limit, we’ll supply a limit clause of 25 items. If there are more than 25 items, we’ll throw an exception (unless you asked otherwise in the conventions).
  • If you supply a limit explicitly, it will work as expected and page through the data.

The idea is that we want to reduce the surprise for users, and that can give them the experience to draw upon early on. Another thing that we’ll do is make sure that the operations guys can also change that, likely with an environment variable or something like that. If you need to modify the conventions on the fly, you usually have hard time deploying a new version, and an immediate action is needed.

In this manner, we can help users avoid expensive requests to the server, and they can be explicit with what they need to do.

Debug & Operations as a feature: Tracking allocations costs

time to read 2 min | 310 words

One of the things that we have learned from supporting RavenDB in production is that you by default, everything is a black box into which you have exactly zero input. And in order to figure out what the problems are, you need to use expert tools (WinDBG or VM MAP for example) that are typically more focused on developers, and not usually available in production.

In RavenDB 4.0, we have started from the get go with the notion that everything we do must be exposed, tracked and monitored. Here is the results of the latest effort in that direction.




There are several important things here. First, you can see that we are tracking the managed and unmanaged allocations that are happening in the system. More than that, we are now able to track down exactly which part of the system is responsible for that.

In the screenshots above, you can see that the UsageIpAndQuantity index has allocated about 65 MB of unmanaged memory, and that we have a few memory mapped files storing the data for index #3.

The idea is that we can now glance at this endpoint and tell very quickly what is going on. And this is something that can be done in production. In fact, that is something that we’ll expose in the studio so you can see those value change over time.

We are also waiting for the CoreCLR to expose the managed allocations on  a per thread basis, which will give us even better metrics.

Meeting the Joel Test 2.0

time to read 7 min | 1216 words

I run into this post, which updates the (by now) venerable Joel Test to our modern age. I remember reading the Joel Test (as well as pretty much anything by Joel) at the beginning of my career and I’m pretty sure that it influenced the way I choose employers and designed software. Seeing this post, I decided to see how Hibernating Rhinos would rank on this test today. I put both the original and updated version, and my comments are below.

  Original Updated

Do you use source control?


Can you make a build in one step?

Can you build and deploy your software in one step?


Do you make daily builds?

Do you build on every commit?


Do you have a bug database?


Do you fix bugs before writing new code?


Do you have an up-to-date schedule?

Do you measure your progress in terms of value delivered?


Do you have a spec?

Do you have a runnable spec?


Do programmers have quiet working conditions?

Does your environment foster collaboration?


Do you use the best tools money can buy?


Do you have testers?

Is testing everyone's responsibility?


Do new candidates write code during their interview?

12 Do you do hallway usability testing?  


  • Source control – Yes, thankfully, I think that the days of anyone not using source control for anything but a scratch project are behind us. Now the arguments are which source control.
  • Build & deploy in one step – Yes, the build process runs on a Team City server, and while it sometimes require some TLC (I’m looking at you , nuget), it pretty much runs without us having to pay much attention to it.
  • Build & verify on every commit – No. But yes. What we do is have the build server run the full suite on every Pull Request, which is our unit of integration. Commits are far less important, because we break them apart to make them easier to review.
  • Bug database – Yes, but see also the next topic. We mostly use it for bugs we find, and features / improvements, not so much for customers bugs.
  • Do you fix bugs before writing new code – No. But yes. The reason this is complex to answer is how you define bugs. A customer issue is typically handled from A to Z on the spot. We have a rotating function of support engineer that handle such scenarios, and they prioritize that over their routine work.
  • Do you have a schedule / do you measure progress in term of value – We have a rough schedule, with guidelines about this is hard deadline and this is a nice deadline. Hard deadline is about meeting outside commitments, typically. Nice deadlines are about things we would like to do, but we won’t kill ourselves doing them. We do have a sense of what is important and what isn’t. By that I mean is that we have a criteria for “we should be chasing after this” and “this is for when we run out of things to do”.
  • Do you have a (runnable) spec?  - Yes, we have a spec. It isn’t runnable, and I’m not sure what a runnable spec for a database would be. The spec outline thinks like the data format and how we do data fetches for indexes, architectural considerations and rough guidelines into where we are going. It isn’t detailed to the point of being runnable, and I don’t like the idea very much.
  • Developers have quite working conditions / environment encourage collaboration  – The typical setup we have is a separate office for every two developers. I typically see people move around the offices and collaborate on all sort of stuff. If it bugs the other dev in the room, they usually have headphones to deal with it, but that isn’t happening enough to be a major problem. A common issue for people who leave their workstation unattended and use headphones is that by the time they get back and put the headphones, the music has been changes to something suitably amusing, such as this one.
  • Best tools that money can buy – Procurement in Hibernating Rhinos is a process, it involves sending an email with “I need this tool”, and you must include the link to the tool. Then you have to wait anything between 15 minutes to 24 hours (depending on when you asked), and you’ll get the license. I have seen far too many stupid decisions of “oh, we don’t have a budget for this 200$ tool but we’ve no problem paying the 2000$ that it would cost us in time” to suffer that.
  • Testers / everyone is a responsible – Yes. Every single dev is writing tests, and every single PR is sent after tests has been run locally, and then on the build server.
  • Candidates write code in interview – Yes, oh yes they do.
  • Hallway usability testing – See below, too complex to answer here.

RavenDB has multiple level of “user interface”. The most obvious one is the RavenDB studio, but the one that we spend the most time on is the external (and internal) APIs. For the API, we have a review process in place to make sure that we are consistent and make sense. Most of the time we are doing things that follow the same design line as before, so there is not much to think about. For big things, we typically also solicit feedback from the community, to make sure that we aren’t looking into with colored glasses.

For our actual user interface, the Studio, we used to just have the other devs look at the new functionality.  But that led to a lot of stuff that worked, but the amount of attention we actually paid to the UI used to be really variable. Some features we would iterate over for multiple weeks, getting them just right (the most common operations, as we see them). But other stuff was just “we need to expose this functionality, let us do this”, which led to almost one to one mapping of the server side concept to the UI, which isn’t always helpful for the users.

We have started with a full UX study of the RavenDB Studio, and we are going to be doing full design analysis on each of our views with an eye to improve it significantly by 4.0.

Implementing Omni Search

time to read 3 min | 573 words

We have run a UX study on the RavenDB studio, and we have learned some interesting things about our own software. I’ll probably blog about it more in the future, in this post, I want to focus on one of the issues that was raised in the UX study. the search function. Here is how it looks now:


Is is a pretty simple feature. Given a prefix of a document id, show all the matches, and allow to go directly to the document.

In the UX study, the users utterly ignored the help text when the search box is empty and tried to put index names there to quickly find the relevant index.


This behavior makes… absolute sense. Of course they would assume that this is something that you can do.

So now we have new requirements for the search box:

  1. Allow to search for indexes or transformers or documents.
  2. Allow to search using contains, rather than starts with.
  3. Allow to search for functionality inside the studio.

This will allow the user to use the search box as the go to location to quickly do things in RavenDB.

The first item is pretty easy to explain, right? I can search for UsersIndex or Users/1. The second is a bit more problematic. In particular, given a database with several million documents, doing a contains query on the id is not practical. Oh, we can do a whole bunch of tricks around ngrams, preparing ahead of time, etc. But they aren’t worth it. This is a small feature, and we can’t have it costing us a lot during normal operations.

So we came up with the following design for how the search will work:

  • Searching for indexes and transformers will use contains, because there are usually very few of those and it is cheap to do.
  • Searching for documents will first:
    • Try to find using the exact prefix (“users/’ will find “users/1”, “users/2”, etc)
    • Then get the collection names and prepend them to the search term (so “123” will find “users/123”, “companies/123”, etc)
    • Then get the collection names and prepend them to the search term and do a prefix query (so “123” will find “users/1234”, “companies/12345”)

The idea is that all of those are pretty cheap to do, and we can do them without running into high costs all over the place. And it will give you good way to jump around in your database and find the relevant stuff easily.

Finally, we have the 3rd requirement. What is that about?

Well, one of the things that we have found if that RavenDB has enough features that navigating through them has became a problem. For example, if you want to do a db restore, you need to go to Manage Your Server, then Restore. And it is something that users need to hunt for. The idea is that they can just put “backup” in the search box, and the option that will pop up is the backup screen. So you can skip the hunting through screen and “who moved my cheese” moments.

Benchmarking with Go

time to read 3 min | 469 words

Every now and then, you really need to get out of your comfort zone, I decided that what I want to do is to play a bit with Go, which I haven’t done yet. Oh, I have read Go code, quite a lot of it, but it isn’t the same as writing and actually using it.

We are doing a lot of performance work recently, and while some of that is based entirely on micro benchmarks and focused on low level details such as the number of retired instructions at the CPU level, we also need to see the impact of larger changes. We have been using WRK to do that, but it is hard to get it running on Windows and we had to do some nasty things in Lua scripting to get what we wanted.

I decided that I’ll take the GoBench tool and transform it into a dedicated tool for benchmarking RavenDB.

Here is what we want to test:

  1. Read document by id
  2. Query documents by index
  3. Query map/reduce results
  4. Write new documents (single document)
  5. Write new documents (multiple documents in tx)
  6. Update existing documents

This is intended to be a RavenDB tool, so we won’t be trying to do anything generic, we’ll be writing specialized code.

In terms of the interface, gobench is using command line parameters to control itself, but I think that I’ll pass a configuration file instead. I started to write about the format of the configuration file, when I realized that I’m being absolutely stupid.

I don’t need to do that, I already have a good way to specify what I want to do in the code. It is called the code. The actual code to run the HTTP requests is here. But this is basically just getting a configuration object and using it to generate requests.

Of far more interest to me is the code that actually generate the requests themselves. Here is the piece that tests read requests:

We just spin off a number of go routines, that each does a portion of the work. This gives us concurrent clients and the ability to hammer the server. And the amount of code that we need to write for this is minimal.

To compare, here is the code for writing to the databases:

And then we are left with just deciding on a particular benchmark configuration. For example, here is us running simple load test for both reads and writes.


I think that this matches the low overhead for configuration, readability and high degree of flexibility quite well.

Database Building 101Graph querying over large datasets

time to read 3 min | 522 words

I mentioned that maintaining physical ids is important for performance reasons in my previous post, but I skipped on exactly why. The short answer is that if I have a physical ids, it is much easier to implement locality and much easier to implement parallel locality.

Let us imagine a database whose size is about 100GB, running on a machine that has 6 GB of RAM. You need to do run some sort of computation that traverse the graph, but doing so naively will likely cause us to trash quite a lot, as we page memory in and out of the disk, only to jump far away in the graph, paging even more, and effectively killing all your performance.

Instead, we can do something like this, let us imagine that you have a machine with 4 cores on it, and the previous mention setup. And then you start 4 threads (each marked with a different color on the image, and start processing nodes.


However, there is a trick here, each thread has a queue, and only ids that fall without the area of responsibility of the thread will arrive there. But we aren’t done, inside a thread we define additional regions, and route requests to process each region into each own queue.

Finally, within each thread, we process one region at a time. So the idea is that while we are running over a region, we may produce work that will need to run on other regions (or even other threads), but we don’t care, we queue that work and continue emptying the work that exists on our own region. Only once once we have completed all work in a particular region will we move to the next one. The whole task complete when, in all threads, there are no more regions with work to be done.

Note that the idea here is that each thread is working on one region at a time, and that region maps to a section of the database file that was memory mapped. So we keep that are of the page cache alive and well.

When we move between regions, we can hint to the memory manager that we are going to need the next region, etc. We can’t escape the need to process the same region multiple times, because processing in one region may lead us to processing in another, and then back, but assuming we run the regions using least recently accessed, we can take advantage on the stuff remaining in the page cache (hopefully) from the previous run and using that.

Which is why the physical location on disk is important.

Note that the actual query that we run is less important. Typical graph queries are fall into one of two categories:

  • Some sort of Breadth First Search or Depth First Search and walking through the graph. 
  • Finding a sub-graph in the larger graph that matches this criteria.

In both cases, we can process such queries using the aforementioned process, and the reduction in random work that the database has to do is big.

Database Building 101Stable node ids

time to read 4 min | 647 words

A few posts ago, I talked about the problem of having unstable ids, in particular, ids that can be reused. That leads to quite a lot of complexity, as anyone who ever had to deal with Lucene documents ids knows.

So we are willing to pay something toward stable ids, the questions is what?

One way of doing that is to just store the physical id (unstable) and a virtual id (stable) in a B+Tree (actually, a pair of them, since you’ll need to refer to them back and forth). That means that for the most part, internally to the engine, we’ll use the physical id (with its nice property of having O(1) access time), but externally we’ll expose the stable virtual id (probably sequential numbering, since that is easiest).

Note that I still want to use the physical ids, I’ll discuss exactly why that is important in my next post, for now, let us just say that it is an important component of ensuring high performance for large datasets.

The problem with using B+Tree is that the cost of finding the virtual <—> physical id mapping is O(logN), which for 10 million nodes and 100 million edges is 23 & 24 respectively. Except that this isn’t the real cost function for B+Tree.

Assuming that we have 255 items per page, we actually would need to do 4 page lookups, and a total of 54 comparisons to find the right value. For the edges, we would need 5 page look ups and over 60 comparisons.  Note that this isn’t an issue on its own, but it is an issue when we are talking about having this kind of cost in the hot path of the application. And this is very likely going to be in the hot path.

Oh, there are ways around it, we can only translate back and forth at the edges of the database, so internally we’ll always use the physical address, and only translate it out when we are done. But that is hard to actually do properly, since you need the virtual address for a whole lot of stuff all over the place.

We can steal the idea of page translation tables from the processor. Something like this:


Effectively, we’ll lazy allocate segments of pages and pull them together into a hierarchy. So finding out the physical address of id 84 would involve looking at the root, finding the next page down with mod operation, and so forth until we find the right value and check there. This has the advantage of being simple, O(1) and obvious. It is also pretty good in terms of space saving, since the virtual id can be “stored” without taking any space (it is the position of the physical id in the “array” we created.

This has one drawback, there is no way to recover space. Because the indexer into this data structure is meaningful, we can’t just compact things. Once space is allocated, that is it.  Now, to be fair, the cost in size here for all 100 million edges is about 0.75 GB, so not meaningful in the long run, but if we have a busy database that always write and delete, we have no way to recover the space.

The “proper” answer, by the way, is to implement an external hash table. That has the property of O(1), can grow and shrink as the amount of data changes. I’m not presenting it here mostly because it is something that we haven’t yet had the need to implement in Voron, so it isn’t something that I can just show and move on. Beside, it is fun to explore all the wrong ways of doing something.

Production postmortemThe insidious cost of managed memory

time to read 3 min | 542 words

A customer reported that under memory constrained system, a certain operation is taking all the memory and swapping hard. On a machine with just a bit more memory, the operation completed very quickly. It didn’t take long to figure out what was going on, we were reading too much, and we started swapping, and everything went to hell after that.

The problem is that we have code that is there specifically to prevent that, it is there to check that the size that we load from the disk isn’t too big, and that we aren’t doing something foolish. But something broke here.

Here is a sample document, it is simple JSON (without indentation), and it isn’t terribly large:


The problem happens when we convert it to a .NET object:


Yep, when we de-serialized it, it takes close to 13 times more space than the text format.

For fun, let us take the following JSON:


This generates a string whose size is less than 1KB.

But when parsing it:


The reason, by the way? It is the structure of the document.

The reason, by the way:


So each two bytes for object creation in JSON ( the {} ) are holding, we are allocating 116 bytes. No wonder this blows up so quickly.

This behavior is utterly dependent on the structure of the document, by the way, and is very hard to protect against, because you don’t really have a way of seeing how much you allocated.

We resolved it by not only watching the size of the documents that we are reading, but the amount of free memory available on the machine (aborting if it gets too low), but that is a really awkward way of doing that.  I’m pretty sure that this is also something that you can use to attack a server, forcing it to allocate a lot of memory by sending very little data to it.

I opened an issue on the CoreCLR about this, and we’ll see if there is something that can be done.

In RavenDB 4.0, we resolved that entirely by using the blittable format, and we have one-to-one mapping between the size of the document on disk and the allocated size (actually, since we map, there is not even allocation of the data, we just access it directly Smile).

Database Building 101High level graph operations

time to read 2 min | 247 words

I talked about high level and low level data operations. So far, all we have imageseen are very low level operations (get node, get edges for, etc).

Let us see how we’ll deal with a bigger challenge. In this case, we want to implement a classic graph operation, doing a depth first search, filtering by both nodes and edges.

Here is how we can implement this:

In the real world, we’ll need quite a bit more. On each node (and edge) we’ll need to decide if to return it from the query, or just traverse through it, etc. And that is just to start with.

But I think this demonstrate the point of how to layer behavior on top of the lower level database operations.

There is one thing that we need to talk about still, this code will actually use a lot of individual transactions, one for each independent operation. That is quite expensive, we can open a single transaction and pass it to the functions we call, so there is just a single cost for the entire duration of the operation.

Other things we can do is to explicitly designate specific scenarios as important and change the design so we can answer them very quickly (as in the O(1) cost for accessing nodes/edge data).


  1. Optimizing read transaction startup time: Don’t ignore the context - about one day from now
  2. Optimizing read transaction startup time: Unicode ate my perf and all I got was - 3 days from now
  3. Optimizing read transaction startup time: The performance triage - 4 days from now
  4. Optimizing read transaction startup time: Every little bit helps, a LOT - 5 days from now

There are posts all the way to Oct 27, 2016


  1. Optimizing read transaction startup time (6):
    21 Oct 2016 - Getting frisky
  2. RavenDB Retrospective (4):
    17 Oct 2016 - The governors
  3. Timing the time it takes to parse time (2):
    11 Oct 2016 - Part II
  4. Performance analysis (2):
    04 Oct 2016 - Simple indexes
  5. Interview question (3):
    29 Sep 2016 - Stackoverflow THAT
View all series



Main feed Feed Stats
Comments feed   Comments Feed Stats