Ayende @ Rahien

My name is Oren Eini
Founder of Hibernating Rhinos LTD and RavenDB.
You can reach me by email or phone:


+972 52-548-6969

, @ Q c

Posts: 6,400 | Comments: 47,414

filter by tags archive

RavenDB 4.0The etag simplification

time to read 2 min | 357 words

A seemingly small change in RavenDB 4.0 is the way we implement the etag. In RavenDB 3.x and previous we used a 128 bits number, that was divided into 8 bits of type, 56 bits of restarts counter and 64 bits of changes within the current restart period. Visually, this looks like a GUID:  01000000-0000-0018-0000-000000000002.

The advantage of this format is that it is always increasing, very cheap to handle and requires very little persistent data. The disadvantage is that it is very big, not very human readable and the fact that the number of changes reset on every restart means that you can’t make meaningful deduction about relative sizes between any two etags.

In RavenDB 4.0 we shifted to use a single 64 bits number for all etag calculations. That means that we can just expose a long (no need for the Etag class) which is more natural for most usages. This decision also means that we need to store a lot less information, and etags are one of those things that we go over a lot.  A really nice side affect which was totally planned is that we can now take two etags and subtract them and get a pretty good idea bout the range that needs to be traversed.

Another important decision is that everything uses the same etag range. So documents, revisions, attachments and everything share the same etag, which make it very simple to scan through and find the relevant item just based on a single number. This make it very easy to implement replication, for example, because the wire protocol and persistence format remain the same.

I haven’t thought to write about this, seemed like too small a topic for post, but there was some interest about it in the mailing list, and enumerating all the reasons, it suddenly seems like it isn’t such a small topic.

Update: I forgot to mention, a really important factor of this decision is that we can do do this:


So we can give detailed information and expected timeframes easily.

Throw away the tests, I’m debugging this in production mode

time to read 2 min | 218 words

Prismatic-Sketched-Brain-Silhouette-300pxToday I had a really strange revelation. We had an issue that related to race conditions in a distributed group, and we could just not figure out what was going on from the tests.

Then we switch to a production mode, where each node was a separate process, and we debugged one of them. And it was easy, and it was obvious, and everything just worked.  That was a profound revelation.

What happened was that we build our system so we can service them in production, that includes the internal design and exposed instrumentation data that we have. When we were trying to figure things out from the tests, we were running everything in a single process, and the act of trying to debug a single thread would cause all other threads to stop. That would trigger cascade behaviors (since timeout would be fired, and the cluster would move into recovery mode).

With us debugging just a single node in the cluster, the rest of the cluster just thought it was slow, it didn’t have any impact, but allow us to observe the behavior of the system very easily. The solution was quite obvious once we got into that stage.

Fun with C# local functions

time to read 1 min | 199 words

The reason for this post is simple, this code is so brilliantly simple that I just had to write about it.

On the face of it, it isn’t doing much that is interesting, but it is showing off something very critical. It is both obvious and easy to reason about. And if you don’t have local functions, trying to do something like that will require you to jump through several hops and pretty much always generate code that the compiler and any static analysis tool will consider suspect.

The fact that the reference to the local function can be added and then remove also means that we can do things like this:

A self cleaning delegate, which is only usually possible with code trickery that force you to capture a variable that you have set to null and then set to the value you are trying to use.

I know that this isn’t a really a major thing, but it make certain very specific scenarios so much easier, so it is just a joy to see. And yes, the impetus for this code was actually seeing it used in our code and going Wow!

The 7 years old disk test machine

time to read 3 min | 465 words

rodentia-icons_fsguard-plugin-urgent-300pxWe are testing RavenDB on a wide variety of software and hardware, and a few weeks ago one of our guys came to me with grave concern. We had a major regression in performance on Linux. And major as in 75% slower than what it used to be a few weeks ago.

Testing at that point that showed that indeed, there is a big performance gap between the same benchmark on that Linux machine and a comparable machine running Windows. That was worrying, and took us a while to figure out what was going on. The problem was that we previously had that exact same scenario. The I/O pattern that are most suitable for Linux are pretty bad for Windows, and vice versa, so optimizing for each requires a delicate hand. The expectation was that we did something that would overload the system somehow and caused major regression.

A major discovery was that it wasn’t Linux per se that was slow. Testing the same thing on a significantly smaller machine showed much better performance. We still had to rule out a bunch of other things, such as specific setting / behavior that we would trigger on that particular machine, but it seemed promising. And that was the point when we looked at the hardware. That particular Linux machine is an old development machine that has gone through several developer upgrade cycles, and when it was rebuilt, we used the most easily available disk that we had on hand.

That turned out to be a Crucial SSD 128GB M22 disk. To those of you who don’t keep a catalog of all hard disks and their numbers, there is Google, which will tell you that this has been out for nearly a decade, and that particular disk has been shuffling bits in our offices for about 7 years or so. In its life, it has been subject to literally thousands of database benchmarks, reading and writing very large amount of data.

I’m frankly shocked that it is still working, and it is likely that there is a lot of internal error correction that is going on. But the end result is that it is predictably generate very unpredictable I/O patterns, and it is a great machine to test what happens when things start to fail in a very ungraceful manner (a write to the local disk that takes 5 seconds but also blocks all other I/O operations in the system, for example).

I’m aware of things like nbd & trickle, but it was a lot more fun to discover that we can just run stuff on that particular machine and find out what happens when a lot of our assumptions are broken.

Storing secrets in Linux

time to read 3 min | 545 words

We need to store an encryption key on Linux & Windows. On Windows, the decision is pretty much trivial, you throw that into DPAPI, and can rely on the operating system to handle that for us. In particular, it is very easy to analyze key scenarios such as “someone stole the hard disk” and say that either the thief wouldn’t be able to get the plain text key, or we can blame Microsoft for that Smile.

On Linux, the situation seems to be much more chaotic. There is libsecret, which seems to be much wider in scope than DPAPI. Whereas DPAPI has 2 methods (protect & unprotect), libsecret has a lot of moving pieces, which is quite scary. That is leaving aside the issue of having no managed implementation and having to dance around Gnome specific data types in the API (need to pass GCancellable & GError into it) which increase the complexity.

Other options include using some sort of hardware / software security modules (such as HashiCorp Vault), which is great in theory, but requires us to either take a dependency on something that might not be there, or try to support a wide variety of options (Keywhiz, Chef, Puppet, CloudHSM, etc). That isn’t a really good alternative from our point of view.

Looking into how Mono implemented the DPAPI on Linux, they did it by writing a master key to an XML file and relied on file system ACLs to prevent anyone from seeing that information. This end up being this:

chmod(path, (S_IRUSR | S_IWUSR | S_IXUSR) );

Which has the benefit of only allowing that user to access it, but given that I’ve gotten the physical disk, I’m able to easily mount that on machine that I control as root and access anything that I like. On Windows, by the way, the way this is prevented is that the user must have logged in, and a key that is derived from their password is used to decrypt all protected data as needed, so without the user logging in, you cannot decrypt anything. For that matter, even the administrator on the machine can’t recover the data if they want to, because resetting the user’s password will cause all such information to be lost.

There is the Gnome.Keyring project as well, which hasn’t been updated in 7 years, and obviously doesn’t support the kwallet (which libsecret does). OWASP seems to be throwing the towel there and just recommend to rely on the file system ACL.

The Linux Kernel has a Key Retention API, but it seems to be targeted primarily toward giving file systems access to the secrets they need, and it looks like it isn’t going to survive reboots (it is primarily a caching mechanism, it looks like?).

So after all this research, I can say that I don’t like libsecret, it seems too cumbersome and will need users to jump through some hoops in some cases (install additional packages, setup access, etc). 

Setting up the permissions via the ACL seems to be the common way to handle this scenario, but it doesn’t sit well with me.

Any recommendations?

Tripping on the TPL

time to read 2 min | 364 words

I thought that I found a bug in the TPL, but it looks like its working (more or less) by design. Basically, when a task is completed, all awaiting tasks will be notified on that, which is pretty much what you would expect. What isn’t usually expected is that those tasks can interfere with one another. Consider the following code:

We have two tasks, which accept a parent task and do something with it. What do you think will happen when we run the following code?

Unless you are very quick on the draw, running this code will result in a timeout message, but how? We know that we have a much shorter duration for the task than the timeout, so what is going on?

Well, effectively what is going on is that the parent task has a list of children that it will notify, and by default, it will do so synchronously and sequentially. If a child task blocks for whatever reason (for example, it might be processing a lot of work), the other children of the parent task will not be notified.

If there is a timeout setup, it will be triggered, even though the parent task was already completed. It took us a lot of time to figure out the repro in this issue, and we were certain that this is some sort of race condition in the TPL. I had a blog post talking all about it, but the Microsoft team is fast enough that they were able to literally answer my issue before I had the time to complete my blog post. That is really impressive.

I should note that the suggestion, using RunContinuationsAsynchronously, works quite well for creating a new Task or using TaskCompletionSource, but there is no way to specify that when you are using Task.Run. What is worse for us is that since this is not the default (for perfectly good performance reasons, by the way), this means that any code that we call into might trigger this. I would have much rather to be able to specify than when waiting on the task, rather than when creating it.

Distributed task assignment with failover

time to read 5 min | 818 words

When building a distributed system, one of the more interesting aspects is how you are going to distribute tasks assignment. In other words, given that you have multiple nodes, how do you decide which node will do what? In some cases, that is relatively easy, you can say “all nodes will process read requests”, but in others, this is more complex. Let us take the case where you have several nodes, and you need to have a regular backup of a database that is replicated between all those nodes. You probably don’t want to run the backup across all the nodes, after all, they are pretty much the same and you don’t want to backup the exact same thing multiple times. On the other hand, you probably don’t want to assign this work statically, if you do, and if the node that is responsible for the backup is down, you got no backup.

Another example of the problem can be seen when you have other processes that you would like to be sticky if possible, and only jump around if there is a failure. Brining up a new node online is a common thing to do in a cluster, and the ideal scenario in that case is that a single node will feed it all the data that it needs. If we have multiple nodes doing that, they are likely to overlap and they might very well overload the poor new server. So we want just one node to update its state, but if that node goes down midway, we need someone else to pick up the slack.. For RavenDB, those sort of tasks includes things like ETL processes, Subscriptions, backup, bootstrapping new servers and more. We keep discovering new things that can use this sort of behavior.

But how do we actually make this work?

One way of doing this is to take advantage of the fact that RavenDB is using Raft and have the leader do task assignment. That works quite well, but it doesn’t scale. What do I meant by that? I mean that as the number of tasks that we need to manage grows, the complexity in the task assignment portion of the code grows as well. Ideally, I don’t want to have to consider twenty different variables and considerations before deciding what operation should go on which server, and trying to balance that sort of work in one place has proven to be challenging.

Instead, we have decided to allocate work in the cluster based on simple rules. Each task in the cluster has a key (which is typically generated by hashing some of its parameters), and that task can be assigned to a group of servers. Given those two details, we can use Jump Consistent Hashing to spread the load around. However, that doesn’t handle failover. We have a heartbeat process that can detect and notify nodes that a sibling has went down, so combining those two, we get the following code:

What we are doing here is rely on two different properties. Jump Consistent Hashing to let us know which node is responsible for what, and the Raft cluster leader that keep track of all the live nodes and let us know when a node goes down. When we need to assign a task, we use its hashed key to find its preferred owner, and if it is alive, that is that. But if it currently down, we do two things, we remove the downed node from the topology and re-hash the key with the new number of nodes in the cluster. That gives us a new preferred node, and so on until we find a live one.

The reason we rehash on failover is that Jump Consistent Hashing is going to usually point to the same position in the topology (that is why we choose it in the first place, after all), so we rehash to get a different position so it won’t all fall unto the next node in the list. All downed node tasks are fairly distributed among the remaining live cluster members.

The nice thing about this is that aside from keeping the live/down list up to date, the Raft cluster doesn’t really need to do something. This is a consistent algorithm, so different nodes operating on the same data can arrive at the same result, so a node going down will result in another node picking up on updating the new server up to spec and another will start a backup process. And all of that logic is right where we want it, right next to where the task logic itself is written.

This allow us to reason much more effectively about the behavior of each independent task, and also allow each node to let you know where each task is executing.

I was wrong, reflecting on the .NET design choices

time to read 3 min | 480 words

I have been re-thinking about some of my previous positions with regards to development, and it appear that I have been quite wrong in the past.

In particular, I’m talking about things like:

Note that those posts are parts of a much larger discussion, and both are close to a decade old. They aren’t really relevant anymore, I think, but it still bugs me, and I wanted to outline my current thinking on the matter.

C# is non virtual by default, while Java is virtual by default. That seems like a minor distinction, but it has huge implications. It means that proxying / mocking / runtime subclassing is a lot easier with Java than with C#. In fact, a lot of frameworks that were ported from Java rely on this heavily, and that made it much harder to use them in C#. The most common one being NHibernate, and one of the chief frustrations that I kept running into.

However, given that I’m working on a database engine now, not on business software, I can see a whole different world of constraints. In particular, a virtual method call is significantly more expensive than a direct call, and that adds up quite quickly. One of the things that we routinely do is try to de-virtualize method calls using various tricks, and we are eagerly waiting .NET Core 2.0 with the de-virtualization support in the JIT (we already start writing code to take advantage of it).

Another issue is that my approach to software design has significantly changed. Where I would previously do a lot of inheritance and explicit design patterns, I’m far more motivated toward using composition, instead. I’m also marking very clear boundaries between My Code and Client Code. In My Code, I don’t try to maintain encapsulation, or hide state, whereas with stuff that is expected to be used externally, that is very much the case. But that give a very different feel to the API and usage patterns that we handle.

This also relates to abstract class vs interfaces, and why you should care. As a consumer, unless you are busy doling some mocking or so such, you likely don’t, but as a library author, that matters a lot to the amount of flexibility you get.

I think that a lot of this has to do with my view point, not just as an Open Source author, but someone who runs a project where customers are using us for years on end, and they really don’t want us to make any changes that would impact their code. That lead to a lot more emphasis on backward compact (source, binary & behavior), and if you mess it up, you get ricochets from people who pay you money because their job is harder.

A tricky bit of code

time to read 1 min | 111 words

I run into the following bit of code while doing a code review on a pull request:

This was very strange, because the code appeared to compile properly, but it shouldn’t. I mean, look at it. The generic parameter is not constrained, and I don’t have any extension methods on Object that can apply here, so why would this compile?

The secret was in the base class:

Basically, we specified the constraint on the abstract method, and then inherited it, which was really confusing to me until I figured it out.

You can’t do the same with interfaces, though, although explicit interface implementation does allow it.

Sometimes it really IS not our fault

time to read 1 min | 150 words

imageSo we got an emergency support call during the Passover holiday, and as you can imagine, it was a strange one. Our investigation of the error basically boiled down (cutting down a lot of effort in between): “This can’t be happening.”

I hate this kind of answer, because it usually means that we are missing something. Usually that can be a strange error code, some race condition or just something strange about the environment.

While we were working the problem, the customer came back with, “Oh, we found the issue. A memory unit went rogue, and the firmware wasn’t able to catch it.” When they updated the firmware, it apparently caught it immediately.

So I guess we can close this support incident. Smile


  1. Bug stories: The memory ownership in the timeout - one day from now
  2. We won’t be fixing this race condition - about one day from now
  3. Batch processing with subscriptions in RavenDB 4.0 - 5 days from now
  4. RavenDB 4.0: Unbounded results sets - 6 days from now
  5. The ghost of the zombie of revisions past - 7 days from now

There are posts all the way to Jul 05, 2017


  1. RavenDB 4.0 (8):
    13 Jun 2017 - The etag simplification
  2. Bug stories (3):
    28 Jun 2017 - How do I call myself?
  3. PR Review (2):
    23 Jun 2017 - avoid too many parameters
  4. Reviewing Noise Search Engine (4):
    20 Jun 2017 - Summary
  5. De-virtualization in CoreCLR (2):
    01 May 2017 - Part II
View all series


Main feed Feed Stats
Comments feed   Comments Feed Stats