Ayende @ Rahien

My name is Oren Eini
Founder of Hibernating Rhinos LTD and RavenDB.
You can reach me by phone or email:


+972 52-548-6969

, @ Q c

Posts: 6,194 | Comments: 46,073

filter by tags archive

Production postmortemThe insidious cost of managed memory

time to read 3 min | 542 words

A customer reported that under memory constrained system, a certain operation is taking all the memory and swapping hard. On a machine with just a bit more memory, the operation completed very quickly. It didn’t take long to figure out what was going on, we were reading too much, and we started swapping, and everything went to hell after that.

The problem is that we have code that is there specifically to prevent that, it is there to check that the size that we load from the disk isn’t too big, and that we aren’t doing something foolish. But something broke here.

Here is a sample document, it is simple JSON (without indentation), and it isn’t terribly large:


The problem happens when we convert it to a .NET object:


Yep, when we de-serialized it, it takes close to 13 times more space than the text format.

For fun, let us take the following JSON:


This generates a string whose size is less than 1KB.

But when parsing it:


The reason, by the way? It is the structure of the document.

The reason, by the way:


So each two bytes for object creation in JSON ( the {} ) are holding, we are allocating 116 bytes. No wonder this blows up so quickly.

This behavior is utterly dependent on the structure of the document, by the way, and is very hard to protect against, because you don’t really have a way of seeing how much you allocated.

We resolved it by not only watching the size of the documents that we are reading, but the amount of free memory available on the machine (aborting if it gets too low), but that is a really awkward way of doing that.  I’m pretty sure that this is also something that you can use to attack a server, forcing it to allocate a lot of memory by sending very little data to it.

I opened an issue on the CoreCLR about this, and we’ll see if there is something that can be done.

In RavenDB 4.0, we resolved that entirely by using the blittable format, and we have one-to-one mapping between the size of the document on disk and the allocated size (actually, since we map, there is not even allocation of the data, we just access it directly Smile).

Database Building 101High level graph operations

time to read 2 min | 247 words

I talked about high level and low level data operations. So far, all we have imageseen are very low level operations (get node, get edges for, etc).

Let us see how we’ll deal with a bigger challenge. In this case, we want to implement a classic graph operation, doing a depth first search, filtering by both nodes and edges.

Here is how we can implement this:

In the real world, we’ll need quite a bit more. On each node (and edge) we’ll need to decide if to return it from the query, or just traverse through it, etc. And that is just to start with.

But I think this demonstrate the point of how to layer behavior on top of the lower level database operations.

There is one thing that we need to talk about still, this code will actually use a lot of individual transactions, one for each independent operation. That is quite expensive, we can open a single transaction and pass it to the functions we call, so there is just a single cost for the entire duration of the operation.

Other things we can do is to explicitly designate specific scenarios as important and change the design so we can answer them very quickly (as in the O(1) cost for accessing nodes/edge data).

Database Building 101Graphs aren’t toys

time to read 2 min | 396 words


I  keep calling this a toy database, and it is true for more reasons than the code existing mostly as unconnected snippets. When defining the storage layer, I breezed through quite a lot of stuff because they didn’t really matter for the point I was trying to make.

We’ll start with talking about node IDs. When we save a node to the database, we get an int64 ID back. What is that? We know that it gives us an O(1) access time to the node (or the edge data), but that’s about it. Typically, we don’t expose the internal table ID in this manner. Because the ID we get back from the Insert corresponds to the location on the disk where that data exists. So the node ID is something that is arbitrary and doesn’t correlate to anything. It isn’t a sequential number, or something that the user defines, or anything like that.

That can lead to issues. In particular, if you want to look up a node by some property, you need to have a way to do so that will retrieve its ID, and only then you can do graph operations from there. The common solution is to use a Lucene index for searching by the properties and finding the root node(s) and proceed from there.

But what about deletes? Deleting a node is possible, and when you do that, the space that was reserved for that node will be freed, and can be reused, so you’ll have a different node with the same ID. This leads to some awkwardness (you can see that with the Lucene document IDs, which have the same problem, although for very different reasons).

Updates also pose a problem, because if you need to extend the size of the node, it might be relocated, which changes the ID. Deleting is a challenge, because you need to remove the deleted node ID from all the edges that reference it, instead of cleaning it up on the fly.

This leads us to either abstract the physical ID with a virtual one (and pay the O(logN) cost for resolving it) or find a way to deal with the above inside your API.

Database Building 101The cost of graphing

time to read 2 min | 368 words


So far we looked into how we can store the nodes and the edges, and we explored some interesting data structures inside Voron. Now, let’s see how we can traverse the graph.

So getting the node is pretty easy, and remember that to access the table by ID gives us an O(1) cost.

Now, what about finding all the edges from a node?

This is a lot heftier, but let’s try to break it into individual pieces. First, we find the tree holding all the edges of that particular type, then we access (by the ID of the source node) the fixed size tree that holds all the connected nodes and the edge data ID.

Finally, we access the edges data (if it exists) to get the data about the edge itself, after which, we return it all to the caller.

Unlike the previous method, here we can’t claim to have O(1) costs. Instead, our costs are composed of:

  1. O(logN) – where N is the number of edges types (typically, very few), to search for the right edges tree.
  2. O(logN) – where N is the number of source nodes (typically, very large, but the fixed size trees are very good at being fast, and they don’t have to do much).
  3. O(N) – where N is the number of connection between the source node and the destination node.

I’m excluding the cost of loading the edge data, since this is an O(1) operation and is directly relevant to the iteration over all nodes connected to the source node.

Ideally, we can find a way to turn the 2nd operation into an O(1) cost, but that should be more than good enough for the moment.

Now, this just gives us traversal of the nodes, but going from here to Breadth First Search / Depth First Search and doing them well is fairly trivial, although there are a few things that we'll need to talk about, but that would serve as a separate post.

Database Building 101The storage layer

time to read 3 min | 559 words

We are going to be using Voron to actually handle the storage of the data. Voron is a low level storage engine, which provide, among other things, fully ACID, MVCC and high performance.

For today, what'll look at are the following operations:

  • Add a node.
  • Add an edge between two nodes.
  • Traverse from a node to all its edges (cheaply)
  • Traverse from an edge to the other node (cheaply).

Here is the interface code that we will need to build:

I tried to make it as simple as it possible could be. I’m using NameValueCollection because it is a trivial property bag to serialize / deserialize without bringing any additional dependencies.

Let us focus on the initialization of this class. We need to create a StorageEnvironment in Voron, and setup some structures.

Voron has several different storage options for data, and in this case, we are using a table. A table (which is very much unlike a RDMBS table) it a way to store records that has some indexes, but in this case, we are doing something strange. We are defining a table that has no indexes (this is so strange that we need to explicitly allow it). This is useful because tables manages indexes for us automatically, but in this case we will use them strictly for storage. We’ll see why in just a bit.


The code here is a bit strange. Tables are actually meant to hold multiple values, and define indexes on this. So using them to store a single value is something that you wouldn’t normally do. Except that the id that we get back from the table has a very interesting property. It has O(1) cost of access. So given a node id, I can access it directly, regardless of how big my database is. That is a property that I want, because effectively random access of nodes is something that happens quite a lot and is highly desirable.

Now, let us see how we can connect two edges, shall we. This code ignores all error handling, missing nodes, etc. It is meant to explain the core stuff, this is a toy database, after all.

Here we continue doing strange stuff. We already know that we use the empty schema table to have an O(1) access to data. And we store the edge’s data there. But then we run into some new stuff. We create a B+Tree called “Edges_”+type, which hold all of the edges of a particular type. But the content of this B+Tree is not simple. Instead, it is using fixed size trees. Those are also B+Trees, but they have well known size both for keys (which must be long) and the value (which must be small < 256 bytes). Because they are very compact, we can pack quite a lot of data into them, and work with them efficiently.

The end result is that we are now storing the node data and access it at O(1) cost. We also store a B+Tree full of fixed size tree (whose name is the source node id) and whose keys are the destination nodes, and the values are the edge data.

Confusing yet? Not much different than Dictionary<SourceNodeId, Dictionary<DestinationNodeId, EdgeDataId>>. That is quite enough for today, tomorrow I’ll show how we can traverse the data, and what kind of costs are associated with it.

Database Building 101Let us graph this for real

time to read 3 min | 487 words

In the Guts n’ Glory of Database Internals posts series (which I’ll probably continue if people suggest new topics), I talked about the very low level things that are involved in actually building a database. From how to ensure consistency to the network protocols. But those are very low level concerns. Important ones, but very low level. In this series, I want to start going up a bit in the stack and actually implement a toy database on top of real production system, to show you what the database engine actually does.

In practice, we divide the layers of a database engine this way:

  1. Low level storage (how we save the bits to disk), journaling, ACID.
  2. High level storage (what kind of storage options do we have, B+Tree, nested trees, etc).
  3. Low level data operations (working on a single item at time).
  4. High level data operations (large scale operations, typically).
  5. Additional features (subscribing to changes, for example).

In order to do something interesting, we are going to be writing a toy graph database. I’m going to focus on levels 3 & 4 here, the kind of data operations that we need to provide the database we want, and we are going to build over pre-existing storage solution that handles 1 & 2.

Selecting the storage engine – sometimes it make sense to go elsewhere for the storage engine. Typical examples includes using LMDB or LevelDB as embedded databases that handles the storage, and you build the data operations on top of that. This works, but it is limiting. You can’t do certain things, and sometimes you really want to. For example, LMDB supports the notion of multiple trees (and even recursive trees), while LevelDB has a flat key space. That has a big impact on how you design and build the database engine.

At any rate, I don’t think it will surprise anyone that I’m using Voron as the storage engine. It was developed to be a very flexible storage engine, and it works very well for the purpose.

We’ll get to the actual code in tomorrow’s post, but let’s lay out what we want to end up with.

  • The ability to store nodes (for simplicity, a node is just an arbitrary property bag).
  • The ability to connect nodes using edges.
    • Edges belong to types, so KNOWS and WORKED_AT are two different connection types.
    • An edge can be bare (no properties) or have data (again, for simplicity, just arbitrary property bag)

The purpose of the toy database we build is to allow the following low level operations:

  • Add a node.
  • Add an edge between two nodes.
  • Traverse from a node to all its edges (cheaply)
  • Traverse from an edge to the other node (cheaply).

That is it, should be simple enough, right?

Digging into the CoreCLRJIT Introduction

time to read 3 min | 401 words

Note, this post was written by Federico.

imageThe .Net Just-In-Time compiler (or JIT for short) is one of those marvels you don't even realize is there, unless you have to smooth talk it to do your bidding. At the Voron level, most routines are so tight that the assembler emitted becomes important. It is not uncommon to have to reason about how the instructions are going to be decoded by the CPU, if the microarchitecture is able to execute two instructions independently or not or we are memory-bound (having too many cache misses).

Those with a C/C++ background know the optimizing compiler is actually quite good at its job; and even so, tricks are required to squeeze the last performance bits. The JIT is no different, with the addition that while the C++ compiler has all the time in the world (sort-of); the restrictions imposed by compiling while your code is executing are far more severe. However, even on a budget, the CoreCLR does a pretty amazing job at optimizing. Until, of course, you expect the performance feats of the offline C++ compiler from it.

So the question you may probably have is: "Why bother? Why don't you just go and write the database in C++?”. I won't go in detail on that, since we covered that already . What I can tell you is, that when you dig into the JIT world, it is pretty easy to achieve performance on par with C++ if the JIT by design doesn't get in your way with unsupported features. Good thing is the CoreCLR team is quite open to tackle those  issues if presented with a detailed analysis on the cost-effectiveness of your optimization.

In this series we will talk about the tricks we had to teach our code to squeeze those last bits of micro-optimizations. From specific optimizations requested to the development team and workarounds until they are available, to tricks we use to ensure the assembler code generated is of the highest quality possible.

One word of caution: The JIT is ever improving so it is important to understand that the optimizations opportunities presented here may not apply tomorrow or just be done automatically in future versions of CoreCLR.

Non reproducible / intermittent error handling

time to read 2 min | 370 words

We recently had to deal with a stress test that was failing (very) occasionally. We looked into that, and we figured out that this was actually exposing a high severity bug. So we looked into it pretty seriously.

And we keep hitting a wall.

There are a few reasons why you would have a non-reproducible error. The most common issue is if you have a race. For example, if two threads are modifying the same variable without proper synchronization, this will cause those kind of symptoms. We have a lot of experience in dealing with that, and all the signs were there. But we still hit a wall.

You see, the problem was that the entire section of the code we were looking at was protected by a lock. There was no possibility of an error because of threading, since this whole part of the code just couldn’t run concurrently anyway.

So why was it failing only occasionally? If it is single threaded, it should be predictable. In fact, the reason there was a lock there, instead of the more complex merge operations we used to have, was specifically to support reproducibility. The other kind of issue that can create this sort is I/O (which has its own timing issues), but the error happened in a part of the code that was operating on internal data structures, so it wasn’t that.

Eventually we figured it out. This was a long background operation, and because we needed to have the lock in place, we had a piece of code similar to this:

Because this is a long running operation, under lock, we do this in stages, and make sure that other things can run while we do that. But this was exactly what introduced the variability in the test results, and that made it so random and annoying. Once we figured that this was the cause for the difference, all we had to do was write the proper log of operations, and execute it again.

The rest was just finding out which of the 200,000 transactions executed actually caused the problem, mere detail work Smile.

Tri state waiting with async tcp streams

time to read 4 min | 714 words

We recently had the need to develop a feature that requires a client to hold a connection to the server and listen to a certain event. Imagine that we are talking about a new document arriving to the database.

This led to a very simple design:

  • Open a TCP connection and let the server know about which IDs you care about.
  • Wait for any of those IDs to change.
  • Let the client know about it.

Effectively, it was:

Unfortunately, this simple design didn’t quite work. As it turns out, having a dedicated TCP connection per id is very expensive, so we would like to be able to use a single TCP connection in order to watch multiple documents. And we don’t know about all of them upfront, so we need to find a way to talk to the server throughout the process. Another issue that we have is the problem of steady state. If none of the documents we care about actually change for a long time, there is nothing going on over the network. This is going to lead the TCP connection to fail with a timeout.

Actually, a TCP connection that passes no packets is something that is perfectly fine in theory, but the problem is that it requires resource that that need to be maintained. As long as you have systems that are not busy, it will probably be fine, but the moment that it reaches the real world, the proxies / firewalls / network appliances along the way use a very brutal policy, “if I’m not seeing packets, I’m not maintaining the connection”, and it will be dropped, usually without even a RST packet. That makes debugging this sort of things interesting.

So our new requirements are:

  • Continue to accept IDs to watch throughout the lifetime of the connection.
  • Let the client know of any changes.
  • Make sure that we send the occasional heartbeat to keep the TCP connection alive.

This is much more fun to write, to be honest, and the code we ended up with was pretty, here it is:

There are a couple of things to note here. We are starting an async read operation from the TCP stream without waiting for it to complete, and then we go into a loop and wait for one of three options.

  1. A document that we watched has been changed (we are notified about that by the documentChanged task completion), in which case we notify the user. Note that we first replace the documentChanged task and then we drain the queue from all pending documents changes for this collection, after which we’ll go back to waiting. On the doc changed event, we first enqueue the document that was changed, and then complete the task. This ensure that we won’t miss anything.
  2. New data is available from the client. In this case we read it and add it to the IDs we are watching, while starting another async read operation (for which we’ll wait on the next loop iteration). I’m creating a new instance of the IDs collection here to avoid threading issues, and also because the number of items is likely to be very small and rarely change. If there were a lot of changes, I would probably go with a concurrent data structure, but I don’t think it is warranted at this time.
  3. Simple timeout.

Then, based on which task has been completed, we select the appropriate behavior (send message to client, accept new doc ID to watch, send heartbeat, etc).

The nice thing about this code is that errors are also handled quite nicely. If the client disconnects, we will get an error from the read, and know that it happened and exit gracefully (well, we might be getting that just when we are writing data to the client, but that is pretty much the same thing in terms of our error handling).

The really nice thing about this code is that for the common cases, where there isn’t actually anything for us to do except maintain the TCP connection, this code is almost never in runnable state, and we can support a very large number of clients with very few resources.

Code reviewThe bounded queue

time to read 1 min | 64 words

The following code has just been written (never run, never tested).

It’s purpose is to serve as a high speed, no locking transport between two threads, once of them producing information, the other consuming it, in a bounded, non blocking manner.

Note that this is done because the default usage of BlockingCollection<T> here generated roughly 80% of the load, which is not ideal


  1. Database Building 101: Stable node ids - 6 hours from now
  2. Database Building 101: Graph querying over large datasets - about one day from now
  3. Voron’s internals: MVCC - All the moving parts - 2 days from now
  4. Voron internals: Cleaning up scratch buffers - 5 days from now
  5. Voron internals: The transaction journal & recovery - 6 days from now

And 4 more posts are pending...

There are posts all the way to Sep 05, 2016


  1. Database Building 101 (8):
    22 Aug 2016 - High level graph operations
  2. Production postmortem (16):
    23 Aug 2016 - The insidious cost of managed memory
  3. Digging into the CoreCLR (3):
    12 Aug 2016 - Exceptional costs, Part II
  4. The Guts n’ Glory of Database Internals (20):
    08 Aug 2016 - Early lock release
View all series



Main feed Feed Stats
Comments feed   Comments Feed Stats