Ayende @ Rahien

My name is Oren Eini
Founder of Hibernating Rhinos LTD and RavenDB.
You can reach me by phone or email:


+972 52-548-6969

, @ Q c

Posts: 6,194 | Comments: 46,073

filter by tags archive

Production postmortemThe insidious cost of managed memory

time to read 3 min | 542 words

A customer reported that under memory constrained system, a certain operation is taking all the memory and swapping hard. On a machine with just a bit more memory, the operation completed very quickly. It didn’t take long to figure out what was going on, we were reading too much, and we started swapping, and everything went to hell after that.

The problem is that we have code that is there specifically to prevent that, it is there to check that the size that we load from the disk isn’t too big, and that we aren’t doing something foolish. But something broke here.

Here is a sample document, it is simple JSON (without indentation), and it isn’t terribly large:


The problem happens when we convert it to a .NET object:


Yep, when we de-serialized it, it takes close to 13 times more space than the text format.

For fun, let us take the following JSON:


This generates a string whose size is less than 1KB.

But when parsing it:


The reason, by the way? It is the structure of the document.

The reason, by the way:


So each two bytes for object creation in JSON ( the {} ) are holding, we are allocating 116 bytes. No wonder this blows up so quickly.

This behavior is utterly dependent on the structure of the document, by the way, and is very hard to protect against, because you don’t really have a way of seeing how much you allocated.

We resolved it by not only watching the size of the documents that we are reading, but the amount of free memory available on the machine (aborting if it gets too low), but that is a really awkward way of doing that.  I’m pretty sure that this is also something that you can use to attack a server, forcing it to allocate a lot of memory by sending very little data to it.

I opened an issue on the CoreCLR about this, and we’ll see if there is something that can be done.

In RavenDB 4.0, we resolved that entirely by using the blittable format, and we have one-to-one mapping between the size of the document on disk and the allocated size (actually, since we map, there is not even allocation of the data, we just access it directly Smile).

RavenDB 3.5 Release Candidate

time to read 2 min | 365 words

I’m very happy to announce that the Release Candidate of RavenDB is now available. We have been working hard for the past few months to incorporate feedback that we got from customers who started using the beta, and we are finally ready.

There have been quite a lot of fixes, improvements and enhancements, but in this post I want to focus on the most exciting three.

Waiting for write assurance

Take a look at the following code:

It asks RavenDB to save a new user, and then wait (on the server) for the newly saved changes to be replicated to at least another 2 servers, up to a maximum of 30 seconds.

This give you the option of choosing how safe you want to be on saves. For high value documents (a million dollar order, for example), we will accept the additional delay.

Waiting for indexes

In the same vein, we have this:

This feature is meant specific for PRG (Post, Redirect, Get) scenarios, where you add a new document and immediately query for it. In this case, you can ask the server to wait until the indexes are caught up with this particular write.

You can also set a timeout and whatever to throw or not. But more interestingly, you can specify specific indexes that you want to wait for. If you don’t specify anything, RavenDB will automatically select just the indexes that are impacted by this write.

Dynamic leader selection

In the RavenDB Conference, we showed how you can define a cluster of RavenDB servers and have them vote on a leader dynamically, then route all writes to the leader. In the presence of failure, we will select a new leader and stick to it as long as it is able.

Some of the feedback we got was related to how we select the leader. We now take into account the state of the databases on the server, and will not select a leader whose databases aren’t up to date with regards to replication. We also reduced the cost of replication in general, by significantly reducing the amount of metadata that we need to track.

On why RavenDB is written in C#

time to read 4 min | 664 words

Greg Young has an interesting reply to my post here. I’m going to reply to it here.

RavenDB nor EventStore should be written in C#.

That may be true for the EventStore, but it isn’t true for RavenDB. Being able to work with the .NET framework makes for a lot of things a lot simpler. Let us take a simple example, every day, I have to clear the unused auto indexes, but only if there has been queries to the collection in question. In C#, here is the code to give me the last query time for all collections in the database:


No, if we wanted to do this in C, as the go to example for system level programming, I would need to select which hash function to use. RavenDB being a non trivial project, I’ll probably have one, for simplicity sake, I selected this one, which gives us the following:


I’ve elided all error handling code, and I’ve introduced a silent potential “use after free” memory issue.

This is something that runs one a day, but the amount of code, effort and potential things to manage is quite a lot, compare to the simplicity we get in C#.

Now, this is somewhat of a straw man’s argument. Because the code would be much simpler in C++ or in Go (no idea about Rust, I’m afraid), but it demonstrate the disparity between the two options.

But Greg isn’t actually talking about code.

A different environment such as C/C++/Go/Rust would be far superior to haven been written in C#. Cross compiling C# is a pain in the ass to say the least whether we talk about supporting linux via mono or .netcore, both are very immature solutions compared to the alternatives. The amount of time saved by writing C# code is lost on the debugging of issues and having unusual viewpoints of them. The amount of time saved by writing in C# is completely lost from an operational perspective of having something that is not-so-standard. We have not yet talked about the amount of time spent in dealing with things like packaging for 5+ package managers and making things idiomatic for varying distributions.

I have one statement in response to this. Mono sucks, a lot.

I suspect that a lot of Greg’s pain has been involved directly with Mono. And while I have some frustrations with .NET Core, I don’t agree with lumping it together with Mono. And Greg’s very market was primarily Linux admins. While I would love to that be a major market for us, we see a lot more usage in Windows and Enterprise shops. The rules seems to be simpler there. I can deploy the .NET Core along with my application, no separate install necessary, so that makes it easier. What we have found customers want is docker instances so they can  just throw RavenDB up there and be happy.

A lot of the pain Greg has experienced isn’t relevant any longer.

What is more, he keep considering just the core product. There are quite a lot of additional things that needs to happen to take a project to a product. Some of this is external (documentation, for example) and some of this is internal (high level of management studio, good monitoring and debugging, etc). I have seen what “system level guys” typically consider sufficient into itself in those regards. Thank you, but no.

A lot of our time is actually invested in making sure the whole thing is better, spreading oil over the machinery, and the more lightweight and easy we can make it, the better. 

RavenDB Conference 2016–Slides

time to read 1 min | 145 words

The slides from the RavenDB Conference are now available, we’ll post videos in a few weeks, once we are done post processing them.

Day 1:

Day 2:

The design of RavenDB 4.0The client side

time to read 5 min | 926 words

We didn’t plan to do a lot of changes on the client side for RavenDB 4.0. We want to do most changes on the server side, and just adapt the client when we are doing something differently.

However, we run into an interesting problem. A while ago we asked Idan, a new guy at the company, to write a Python client for RavenDB as part of the onboarding process. Unlike with the JVM client, we didn’t want to do an exact duplication, with just the changes to match the new platform. We basically told him, go ahead and do this. And he went, and he did. But along the way he got to choose his own approach for the implementation, and he didn’t copy the same internal architecture. The end result is that the Python client is significantly simpler than the C# one.

That has a few reasons. To start with, the Python client doesn’t need to implement features such as async or Linq. But for the most part, it is because Idan was able to look at the entire system, grasp it, and then just implement the whole thing in one go.

The RavenDB network layer on the client side has gone through many changes over the years. In particular, we have the following concerns handled in this layer:

  • Actual sending of requests to the server.
  • High availability, Failover, load balancing and SLA matching.
  • Authentication of the requests
  • Caching server responses.

I’m probably ignoring a few that snuck in there, but those should be the main ones. The primary responsibility of the network layer in RavenDB is to send requests to the server (with all the concerns mentioned above) and give us the JSON object back.

The problem is that right now we have those responsibilities scattered all over the place. Each was added at a different time, and using several different ways, and the handling is done in very different locations in the code.  This leads to complexity, and seeing everything in one place in the Python client is a great motivation to simplify things. So we are going to do that.


Having a single location where all of those concerned will make things simpler for us. But we can do better. Instead of just returning a JSON object, we can solve a few additional issues. Performance, stability and ease of deployment.

We can do that by removing the the internal dependency on JSON.Net. A while ago we got tired from conflicting JSON.Net versions, and we decided to just internalize it in our code. That led to much simpler deployment life, but does add some level of complexity if you want to customize your entities over the wire and in RavenDB (because you have to use our own copy of JSON.Net for that). And it complicates getting new JSON.Net updates, which we want.

So we are going to do something quite interesting here. We are going to drop the dependency on JSON.Net entirely. We already have a JSON parser, and one that is extremely efficient, using the blittable format. More importantly, it can writes directly to native code. That give us a very important property, it make it very easy to store the data, already parsed, with a well known size. Let me try to explain my reasoning.

A very important feature of RavenDB is the idea that it can cache requests for you. When you make a request to the server, the server returns the reply as well as an etag. Next time you make this request, you’ll use the same etag, and if the response hasn’t changed, the server can just tell you to use the value in the cache. Right now we are storing the full string of the request in the cache. That lead to a few issues, in particular, while we saved on the I/O of sending the request, we still need to parse the JSON, and we need to keep relatively large (sometimes very large) .NET strings around for a long time.

But if we use blittable data in the client cache, then we have no need to do additional parsing, and we don’t need to involve the GC on the client side to collect things, we can manage that explicitly and directly. So the 2nd time you make a request, we’ll hit the server, learn that the value is still relevant, and then just use it.

That is great for us. Except for one thing. JSON.Net is doing really great job in translating between JSON and .NET objects, and that isn’t really something that we want to do. Instead, we are going to handle blittable objects throughout the RavenDB codebase. We don’t actually need to deal with the user .NET’s types until the very top layers of the system. And we can handle that by having a convention, that will call into JSON.Net (the only place that will happen) and translate .NET objects to JSON and back. The nice thing about it is that since this is just a single location where this happens, we can do that dynamically, without having a hard dependency on JSON.Net version.

That, in turn, also expose the ability to do JSON <—> .Net objects using other parsers, such as Jil, for example, or whatever the end user decides.

RavenDB 3.5 whirl wind tourGot anything to declare, ya smuggler?

time to read 1 min | 189 words

Here we have another aspect of making operations’ life easier. Supporting server-side import/export, including multiple databases, and using various options.


Leaving aside the UI bugs in the column alignment (which will be fixed long before you should see this post), there are a couple of things to note here. I have actually written about this feature before, although I do think that this is a snazzy feature.

What is more important is that we managed to get some early feedback on the released version from actual ops people and then noted that while this is very nice, what they actually want is to be able to script this. So this serves as both the UI for activating this feature, and also generating the curl script to execute it from a cron job.

As a reminder, we have the RavenDB Conference in Texas in a few months, where we’ll present RavenDB 3.5 in all its glory.


The design of RavenDB 4.0Replication from server side

time to read 12 min | 2270 words

Replication with RavenDB is one of our core features. Something that we had in the product from the very first release (although we did simplify things by several orders of magnitudes over the years). Replication is responsible for high availability, load balancing and several other goodies. For the most part, replication works quite well, and it is a lot less complex then some of the other things that grew over the years (LoadDocument, for example). That said, it doesn’t mean that it can’t be improved. And since this is such an important aspect of RavenDB, we spent quite a lot of time in seeing what we can do to improve it.

Here are the basic design guidelines:

  • RavenDB is going to remain a multi master system, where each node can accept writes and distribute it to its siblings.
    • We intend to use Raft for dynamic leader selection, but that is a layer on top of the basic replication.
    • That means that RavenDB is an AP system, and needs to handle conflicts.
  • We mostly deal with fully connected graphs of relatively small clusters (less than 10 nodes).
    • Higher number of nodes are quite frequent, but they don’t use a mesh topology, but typically go for a hierarchy.

This post is going to focus solely on the server side aspects of replication, I’ll do another post about changes from the clients perspective.

Probably the first thing that we intend to change is how we track the replication history. Currently, we track the last 50 changes made on a document. This has several problems:

  • if there have been more than 50 changes on a document between replication batches, we’ll get a false conflict.
  • if the documents are small, in many cases the replication metadata is actually bigger than the document itself.

We are going to move to an explicit vector clock implementation. This is a bit complex, because there are multiple concepts that we need to track concurrently here.

Every time that a document changes, the server generate an etag for that change. This etag is an int64 number that is always increasing. This is used for optimistic concurrently, indexing, etc. The etag value is per server, and cannot be used across servers. Each server has a unique identifier. Joining the two together, whenever a document is changed on a server directly (not via replication), we’ll stamp it with the server id and the current document etag.

In other words, let us imagine the following set of operations happening in a three node cluster.

Users/1 is created on Server A, it gets an etag of 1 and a vector clock of {A:1}. Users/2 is created on Server A, it gets an etag of 2 and a vector clock of {A:2}. Users/3 is created on Server C, it gets etag 1 (because etags are local per server) and its vector clock is {C:1}. Servers A and C both replicate to Server B, and to each other, resulting in the following cluster wide setup:

  Server A Server B Server C
Users/1 etag 1, {A:1} etag 1, {A:1} etag 2, {A:1}
Users/2 etag 2, {A:2} etag 3, {A:2} etag 3, {A:2}
Users/3 etag 3, {C:1} etag 2, {C:1} etag 1, {C:1}

Note that the etags assigned for each document are not consistent across the different servers, but that they are temporally consistent with respect to the writes. In other words, Users/1 will always have a lower etag than Users/2.

Now, when we modify Users/3 on server B, we’ll get the following cluster wide picture:

  Server A Server B Server C
Users/1 etag 1, {A:1} etag 1, {A:1} etag 2, {A:1}
Users/2 etag 2, {A:2} etag 3, {A:2} etag 3, {A:2}
Users/3 etag 4, {B:4,C:1} etag 4, {B:4,C:1} etag 4, {B:4,C:1}

As I said, only changed on the server directly (and not via replication) will impact the document vector clock, but any modification (replication or directly on the node) will modify a document’s etag.

Using such vectors clocks, we gain two major features. First, it is very easy to see if we have conflicting changes. {C:1, B:4} is obviously a parent of {C:4,B:6}, while {C:2,A:6} is a conflict. The other is that we can now form a very easy view of the kind of changes that we have received. We do that using a server wide vector clock. In the case of the table above, the server wide vector clock would be {A:2,B4,C:1}. In other words, it will contain the latest etag seen from each server.

We’ll get to why exactly this is important for us in a bit. For now, just accept that it does, because the next part is about how we are going to actually do the replication. In previous versions of RavenDB, we did each replication batch through a separate REST call to the remote server. This has a few disadvantages. It meant that we had to authenticate every single time, and we couldn’t make any assumptions about the state of the remote server.

In RavenDB 4.0, we intend to move replication to use pure Websockets only usage. On startup, a node will connect to all its siblings, and stay connected to them (retrying the connection on any interruption). This has the nice benefit of only doing authentication once, of course, but far more interesting from our perspective is the fact that it means that we can rely on the state of the server on the other side. TCP has a few very interesting properties here for us.  In particular, it guarantee that we’ll have ordered delivery of messages. Which means that we can assume that once we sent a message to a server on a TCP connection, it either got it, or the TCP connection will return an error at some point, forcing us to reconnect.

In other words, it isn’t just authentication that I can do just once, I can also query the remote server for its state (as it regards me), and since I’m the only person that can talk as myself, and I’m the one sending the details. As long as the connection lasts, I know what the other side knows about me. Confusing, isn’t it?

But basically it means that instead of having to ask on each batch what is the last document that the destination server saw of me, I can assume that the last document that I sent was received. That lasts until the connection breaks, in which case I can need to figure out what actually arrived. This seems like a small thing, but this will actually allow me to reduce the number of roundtrips for a batch by half. There are other aspects here that are really nice, I get to piggyback on TCP’s congestion protocol, so if the remote server is slow in accepting updates, it will (eventually) reflect as a blocking write on my end. That seems like a bad thing, right? But this is actually what I want.

Each destination server in RavenDB 4.0 is going to get its own dedicated thread. This thread will manage all outgoing communication with this server. That gives us several really important behaviors. It means that we can easily account for problems by just looking at the thread responsible (hm… I see that replication to node C is consuming a lot of CPU) and it also turn the entire replication process to a pretty simple single threaded operation. Because of the blittable format, we don’t need complex prefetching strategies or sharing of memory in the replication, and a slow node will not impact any other replication behavior. That, in turn, basically mean a thread per connection (see previous discussion on the expected number of nodes being relatively small) and a very simple programming / error handling / communication model.

The replication sending logic goes something like this:

Yes, my scratch pad language is still Boo (Python, if you aren’t familiar with it), and this is meant to convey how simple that thing is. All the complexity that we currently have to deal with is out. Of course, the real code will need to have error handling, reconnection logic, etc, but that is roughly all you’ll need.

Actually, that is a lie. The problem with the code above is that it doesn’t work well with multiple servers. In other words, it is perfect for two nodes, replicating to one another, but when you have multiple nodes, you don’t want a single document update to replication from each node to every other node. That is why we have the concept of vector clocks. At the document level, this serves as an easy way to detect conflicts and see what version of a document is casually later than another version of a document. But on the server level, we gather the latest writes from all nodes that we saw to get the server wide vector clock.

When a document is modified on a server, that server will immediately send that document to all its siblings. Because there is no way that they already have it. But if a document was replicated to a node, it will not start replicating right away. Instead, it will let a set amount of time go by (defaulting to once a minute) and then ask each sibling what is the latest server wide vector clock that it is aware of. If the remote vector clock is equal to or higher than the local server wide vector clock, then we know that they are up to date. In this case, the local server will let the remote server know that they are a match to the current etag on that server.

If, however, the local vector clock is smaller (or conflicting) from the remote server, then we need to send the relevant documents. We already know what is the last etag that the remote server has from us (we negotiated that when we established the connection, and we updated it every time we sent a document to the remote server. Since we have the current vector clock from the remote server, we aren’t going to just blindly send all documents after the last etag we sent to the remote server. Instead, we are going to check each of those to see if the vector clock for the document is larger (or conflicting) than the remote server vector clock. In this way, we can send the remote server only the documents that it doesn’t have.

What about delayed servers? If we had a new node in the cluster, and we just started replicating to it, what happens when a new document is being written. Above, I mentioned that the written to server will immediately write it to all its siblings, but that is an over simplification. An extremely important property of RavenDB replication is that documents are always replicated in the order the server saw them (either written to it directly, or the order they were replicated to it). If we allow a server to replicate documents directly to another server, that might break this guarantee. Looking at the code above, it will also require us to write a separate code path to handle such things. But that is the beauty in this design. All of this logic is actually encapsulated in WaitForMoreDocuments(). You can this of WaitForMoreDocuments() as a simple manual reset event. Whenever a document is written to a document directly, it will be set. But not when a document is replicated to us.

So WaitForMoreDocuments() will basically wait for a document to be written to us, or a timeout, in which case it will check with its sibling for new stuff that need to go over the wire because it was replicated to us. But the code is the same code, and the behavior is the same. If we are busy sending data to a new server? We’ll still set the event, but that will have no effect on the actual behavior. And when we are working with a fully caught up server, the act of writing a single document will immediately free the replication threads to start sending it to the sibling. All the desired behaviors, and very little actual complexity.

On the receiving end, we get just the documents we don’t have, as well as the last etag from that source server (which we’ll keep in persistent storage). Whenever we get a new document, we’ll check if it is conflicting. If so, we’ll mark the document as conflicting and allow the user to define default strategies to handle that (latest, resolve to remote, resolve to local). But we are also going to allow the user to define a Javascript function that will merge the conflicted documents directly. This way you can have your business logic for the resolution directly on the server, and you’ll never actually see any conflicts externally.

There are quite a lot of small details that I’m skipping, but this is already long enough, and should give you a pretty good idea about where we are headed.

RavenDB 3.5 whirl wind tourI'm no longer conflicted about this

time to read 2 min | 286 words

A natural consequence of RavenDB decision to never reject writes (a design decision that was influenced heavily by the Dynamo paper) is that it is possible for two servers to get client writes to the same document without coordination. RavenDB will detect that and handle it. Here is a RavenDB 2.5 in conflict detection mode, for example:


In RavenDB 3.0, we added the ability to have the server resolve conflicts automatically, based on a few predefined strategies.


This is in addition to giving you the option for writing your own conflict resolution strategies, which can apply your own business logic.

What we have found was that while some users deployed RavenDB from the get go with the conflict resolution strategy planned and already set, in many cases, users were only getting around to doing this when they actually had this happen in their production systems. In particular, when something failed to the user in such a way that they make a fuss about it.

At that point, they investigate, and figure out that they have a whole bunch of conflicts, and set the appropriate conflict resolution strategy for their needs. But this strategy only applies to future conflicts. That is why RavenDB 3.5 added the ability to also apply those strategies in the past:


Now you can easily select the right behavior and apply it, no need to worry.

The design of RavenDB 4.0Getting RavenDB running on Linux

time to read 4 min | 748 words

We have been trying to get RavenDB to run on Linux for the over 4 years. A large portion of our motivation to build Voron was that it will also allow us to run on Linux natively, and free us from dependencies on Windows OS versions.

The attempt was actually made several times, and Voron has been running successfully on Linux for the past 2 years, but Mono was never really good enough for our needs. My hypothesis is that if we were working with it from day one, it would have been sort of possible to do it. But trying to port a non trivial (and quite a bit more complex and demanding than your run of the mill  business app) to Mono after the fact was just a no go. There was too much that we did in ways that Mono just couldn’t handle. From GC corruption to just plain “no one ever called this method ever” bugs. We hired a full time developer to handle porting to Linux, and after about six months of effort, all we had to show for that was !@&#! and stuff that would randomly crash in the Mono VM.

The CoreCLR changed things dramatically. It still takes a lot of work, but now it isn’t about fighting tooth and nail to get anything working. My issues with the CoreCLR are primarily in the area of “I wanna have more of the goodies”. We had our share of issues porting, some of them were obvious, a very different I/O subsystem and behaviors. Other were just weird (you can’t convince me that the Out Of Memory Killer is the way things are supposed to be or the fsync dance for creating files), but a lot of that was obvious (case sensitive paths, / vs \, etc). But pretty much all of this was what it was supposed to be. We would have seen the same stuff if were working in C.

So right now, we have RavenDB 4.0 running on:

  • Windows x64 arch
  • Linux x64 arch

We are working on getting it running on Windows and Linux in 32 bits modes as well, and we hope to be able to run it on ARM (a lot of that depend on the porting speed of the CoreCLR to ARM, which seems to be moving quite nicely).

While there is still a lot to be done, let me take you into a tour of what we already have.

First, the setup instructions:


This should take care of all the dependencies (including installing CoreCLR if needed), and run all the tests.

You can now run the dnx command (or the dotnet cli, as soon as that become stable enough for us to use), which will give you RavenDB on Linux:


By and large, RavenDB on Windows and Linux behaves pretty much in the same manner. But there are some differences.

I mentioned that the out of memory killer nonsense behavior, right? Instead of relying on swap files and the inherent unreliability of Linux in memory allocations, we create temporary files and map them as our scratch buffers, to avoid the OS suddenly deciding that we are nasty memory hogs and that it needs some bacon. Windows has features that allow the OS to tell applications that it is about to run out of memory, and we can respond to that. In Linux, the OS goes into a killing spree, so we need to monitor that actively and takes steps accordingly.

Even so, administrators are expected to set vm.overcommit_memory and vm.oom-kill to proper values (2 and 0, respectively, are the values we are currently recommending, but that might change).

Websockets client handling is also currently not available on the CoreCLR for Linux. We have our own trivial implementation based on TcpClient, which currently supports on HTTP. We’ll replace that with the real implementation as soon as the functionality becomes available on Linux.

Right now we are seeing identical behaviors on Linux and Windows, with similar performance profiles, and are quite excited by this.


  1. Database Building 101: Stable node ids - 7 hours from now
  2. Database Building 101: Graph querying over large datasets - about one day from now
  3. Voron’s internals: MVCC - All the moving parts - 2 days from now
  4. Voron internals: Cleaning up scratch buffers - 5 days from now
  5. Voron internals: The transaction journal & recovery - 6 days from now

And 4 more posts are pending...

There are posts all the way to Sep 05, 2016


  1. Database Building 101 (8):
    22 Aug 2016 - High level graph operations
  2. Production postmortem (16):
    23 Aug 2016 - The insidious cost of managed memory
  3. Digging into the CoreCLR (3):
    12 Aug 2016 - Exceptional costs, Part II
  4. The Guts n’ Glory of Database Internals (20):
    08 Aug 2016 - Early lock release
View all series



Main feed Feed Stats
Comments feed   Comments Feed Stats