Ayende @ Rahien

My name is Oren Eini
Founder of Hibernating Rhinos LTD and RavenDB.
You can reach me by email or phone:


+972 52-548-6969

, @ Q j

Posts: 6,739 | Comments: 48,778

filter by tags archive

Use cases for MADV_DONTNEED in Voron

time to read 2 min | 341 words

The rant in this video is an absolute beautiful one. I run into this rant figuring out how MADV_DONTNEED work and I thought I would give some context on why the behavior is exactly what I want. In fact, you can read my reasoning directly in the Linux Kernel source code.

During a transaction, we need to put the memory pages modified by the transaction somewhere. We put them in temporary storage which we call scratch. Once a transaction is committed, we still use the scratch memory for a short period of time (due to MVCC) and then we flush these pages to the data file. At this point, we are never going to use the data on the scratch pages again. Just leaving them around means that the kernel needs to write them to the file they were mapped from. Under load, that can actually be a large portion of the I/O the system is doing.

We handle that by tell the OS that we don’t actually need this memory and it should throw it away and not write it to disk using MADV_DONTNEED. We are still checking whatever this cause us excessive reads when we do that (when the kernel tries to re-read the data from disk).

There are things that seems better, though. There is MADV_REMOVE, which will do the same and also zero (efficiently) the data on disk if needed, so it is not likely to cause page faults when reading it back again. The problem is that this is limited to certain file systems. In particular, SMB file systems are really common for containers, so that is something to take into account.

MADV_FREE, on the other hand, does exactly what we want, but will only work on anonymous maps. Our scratch files use actual files, not anonymous maps. This is because we want to give the memory a backing store in the case of memory overload (and to avoid the wrath of the OOM killer). So we explicitly define them as file (although temporary ones). h

RavenDB Node.JS client updated (now supporting subscription)

time to read 1 min | 85 words

The node.js RavenDB client had a milestone release last month, which I only now got to talking about. The most important factor about this release is that now node.js users can use subscriptions to process data from RavenDB.

Here is how you can do this:

This is one of the most powerful abilities RavenDB have, you will continuously get new orders in your script and be able to process them. This make it very easy to build features such as data pipelines, background processing, etc.

Analyzing the GitHub outage

time to read 8 min | 1516 words

imageA couple of weeks ago, GitHUb had a major outage, lasting over 24 hours and resulted in wide spread disruption of many operations for customers. A few days after everything was fixed, they posted their analysis on what happened, which makes for a really good read.

The pebble that started all of this was a connection disruption that lasted 43 seconds(!). A couple of months ago I talked about people who say that you can assume that distributed failures are no longer meaningful. The real world will keep serving up examples of weird / strange / nasty stuff to your productions systems, and you need to handle that. Quoting from the original post:

Therefore: the question becomes: how much availability is lost when we guarantee consistency? In practice, the answer is very little. Systems that guarantee consistency only experience a necessary reduction in availability in the event of a network partition. As networks become more redundant, partitions become an increasingly rare event. And even if there is a partition, it is still possible for the majority partition to be available. Only the minority partition must become unavailable. Therefore, for the reduction in availability to be perceived, there must be both a network partition, and also clients that are able to communicate with the nodes in the minority partition (and not the majority partition). This combination of events is typically rarer than other causes of system unavailability.

So no, not really. There is a good point here on the fact only the minority portion of the system must become unavailable, but given typical production deployment, any disconnect between data centers will cause a minority portion to be visible to clients and become unavailable.

The actual GitHub issues that are discussed in the post are a lot more interesting. First, we have the obvious problem that most applications assume that their database access is fast and they make multiple such calls during the processing of a single request (sometimes, many calls). This is just another example of the Fallacies of Distributed Computing in action. RavenDB has a builtin detection for that and a host of features that allow you to go to the database server once, instead of multiple times. In such a case, even if you need to failover to a remote server, you won’t pay the roundtrip costs multiple times.

However, this is such a common problem that I don’t think that it deserve much attention. There isn’t much that you can do about it without careful consideration and support from the whole stack. Usually, this happens on projects when you have a strong leader that institute a performance budget and enforce that. This has costs of its own and usually it is cheaper to just not failover across data center boundaries.

The next part that I find really interesting is that the system that GitHub uses for managing topologies is not consistent but is required to be. The problem is that there is an inherent delay between their orchestrator re-organizing the cluster after a failure and when the failure actually occurs. That would have been fine, if they had a way to successfully merge histories, but that is not the case. In fact, looking at just the information that they have published (and ignoring that I have the benefit of hindsight) the issue is glaringly obvious.

A deep dive (and a fascinating read) into how GitHub handles high availability talks about the underlying details and expose the root cause. You cannot layer distinct distributed architectures on top of one another and expect to come up with a good result. Here is what happens in a master crash scenario:

In a master crash scenario:

  • The orchestrator nodes detect failures.
  • The orchestrator/raft leader kicks off a recovery. A new master gets promoted.
  • orchestrator/raft advertises the master change to all raft cluster nodes.
  • Each orchestrator/raft member receives a leader change notification. They each update the local Consul’s KV store with the identity of the new master.
  • Each GLB/HAProxy has consul-template running, which observes the change in Consul’s KV store, and reconfigures and reloads HAProxy.
  • Client traffic gets redirected to the new master.

I read this and feel a bit queasy, because the master crash scenario is not the interesting bit. That is the easy part. The really hard part is how you manage things when you have a network disruption, with both sides still up and functioning. In fact, that is exactly what happened to GitHub. In this case, on the minority side, their orchestrator cannot get a majority (so cannot make any forward process). However, the rest of the system cannot proceed, the whole thing stops at either the first or second stage.

That means that the rest of the system will continue to write to the old master, resulting in a conflict. And this is where things gets complicated. The issue here is that with MySQL (and most other systems that relies on log replication) you must have a single master at any given time. That is an absolute requirement. If you got to the point where you had two writes with divergent histories, you are in for selecting which one you’ll accept (and what data you’ll discard) and trying to manually fix things after the fact.

The proper way to handle something like this would have been to use Raft to actually send the commands themselves to the server. This ensures a consistent set of statements that run in the same order for all servers. Rqlite is a great example of this, where you can get consistent and distributed system on top of individual components. That would be the proper way to do it, mind, not the way anyone would do it.

You wouldn’t be able to get any reasonable performance from the system using this kind of approach. Rqlite, for example, talks about being able to get 10 – 200 operations per second. I’m going to assume that GitHub has a need for something better than that. So the underlying distributed architecture looks like this:

  • MySQL write master with multiple read-only secondaries using binlog.
  • Orchestrator to provide health monitoring and selection of new write primary (consistent using Raft)
  • Underlying infrastructure that uses (different) Raft to store routing configuration.

If you break Orchestrator’s ability to make decisions (easy, just create a partition), you take away the ability to change the write master, and if the failure mode you are dealing with is not a failed master (for example, you have partition) you are going to accept new writes to the old master.  That breaks completely the whole idea of binlog replication, of course, so you are sort of stuck at that point. In short, I think that Orchestrator is something that was meant to solve an entirely different problem, it was meant to deal with the failure of a single node, not to handle a full data center partition.

When looking at such incidents, I always compare to what would have happened if RavenDB was used instead. This is not really fair in this case because RavenDB was designed upfront to be a distributed database. RavenDB doesn’t really have the concept a a write master. For simplicity’s sake, we usually try to direct all writes to a single node for each database, because this simplify how you usually work. However, but any node can accept writes and will distribute it to the rest of the nodes in the cluster. In a situation like the one GitHub faced, both sides of the partition would keep accepting writes (just like happened in GitHub’s case with MySQL).

The difference is what will happen when the partition is healed. Both sides of the partition will update the other with the data that is missing on the other side. Any conflicting writes (by which I mean writes on both sides of the partition to the same document or documents) will be detected and resolved automatically. Automatic resolution is very important to keeping everything up and running. This can be a custom resolution policy defined by the user or arbitrary by RavenDB. Regardless of the conflict resolution policy, the administrator will be notified about the conflicts and can review the actions taken by RavenDB and decide what to do about that.

In GitHub’s case, their busiest cluster had less than a thousand writes in the time period in question. Most of which aren’t going to conflict. I would expect the timeline with RavenDB to be:

  • 2018 October 21 22:52 UTC – initial network partition, lasting 43 seconds
  • 2018 October 21 22:54 UTC – internal monitoring alert about any conflicts (but I consider this unlikely)
  • 2018 October 21 23:00 UTC – issue resolved, go home

The difference is mostly because RavenDB was designed to live in these kind of environment, deployed in multiple data centers and actually handling, in the real world and with very little assistance, the task of keeping applications up and running without blowing things up. It is quite literally one of the basic building blocks we have, so it shouldn’t be surprising that we are pretty good at it.

Graphs in RavenDBReal world use cases

time to read 6 min | 1032 words

imageI talked a lot about how graph queries in RavenDB will work, but one missing piece of the puzzle is how they are going to be used. I’m going to use this post to discuss some of the options that the new features enables. We’ll take the following simple model, issue tracking. Let’s imagine that a big (secret) project is going on and we need to track down the tasks for it. On the right, you have an image of the permissions graph that we are interested in.

The permission model is pretty standard I think. We have the notion of users and groups. A user can be associated to one or more groups. Groups memberships are hierarchical. An issue’s access is controlled either by giving access to a specific users or to a group. The act of assigning a group will also allow access to all the group’s parents and any user that is associated to any of them.

Here is the issue in question:


Given the graph on the right, let’s see, for each user, how we can find out what issues they have access to.

We’ll start with Sunny, who is named directly in the issue as allowed access. Here is the query to find the issues on which we are directly named.


This was easy, I must admit. We could also write this without any need for graph queries, because it is so simple:


It gets a bit trickier when we have to look at Max. Unlike Sunny, Max isn’t directly named. Instead, he is a member in a group that is directly named in the issue. Let’s see how we can query on issues that Max has access to:


This is where things gets start to get interesting. If you’ll look into the match clause, you’ll see that we have arrows that go both left and right. We are telling RavenDB that we want to find issues that have the same group (denoted as g in the query) as the user. This kind of query you already can’t express with RavenDB right now without the graph query syntax. Another way to write this query is to use explicit clauses, like so:


In this case, all the arrows go in a single direction, but we have two clauses that are being and together. These two queries are exactly the same, in fact, RavenDB will translate the one with arrows going both ways to the one with the and.

The key here is that between any two clauses with an and, RavenDB will match the same alias to the same value.

So we now can find issues for Max and Sunny, but what about all the rest of the people? We can, of course, do this manually. Here is how we can find issues for people a group removed from the issue.


This query gives us all the issues for Nati that are one group removed from him. That works, but it isn’t how we want to do things. It is a good exercise in setting out the structure, though. Because instead of hard coding the pattern, we are now going to use recursion.


The key for this query is the recursive element in the third line. We moved from the issue to its groups, and then we recurse from the issues’ groups to their parents. We allow empty recursion and we follow all the paths in the graph.

On the other side, we go from the user and try to find a match from the user’s group to any of the groups that end the query. In Nati’s case, we went from project-x group to team-nati, which is a match, so we can return this issue.

Here is the final query that we can use, it is a bit much, even if it is short, so I will deconstruct it below:


We use a parameterized query here, to make things easier. We start from a user and find an issue if:

  • The issue directly name the user as allowed access. This is the case for Sunny.
  • The issue’s groups (or their parents) match the groups that the user belong to (non recursively, mind).

In the case of Max, for example, we have a match because the recursive element allows a zero length path, so the project-x group is allowed access and Max is a member of it.

In the case of Nati, on the other hand, we have to go from project-x to team-nati to get a match.

If we’ll set $uid to users/23, which is Pheobe, all the way to the left in the graph, we’ll also have a match. We’ll go from project-x to execs to board and then find a match.

Snoopy, on the other hand, doesn’t have access. Look carefully at the direction of the arrows in the graph. Snoopy belongs to the r-n-d group, and that groups is a child of exces, but the query we specified only go up to the parents, not the children, so Snoopy is not included.

I hope that this post gave you some ideas about what kind of use cases are enabled by graph queries in RavenDB. I would love to hear about any scenarios you have for graph queries so we can play with them and see how they are answered by RavenDB.

The fear of an empty source file

time to read 4 min | 606 words

imageI have been writing software at this point for over twenty years, and I want to believe that I have learned a few things during that timeframe.

And yet, probably the hardest thing for me is to start writing from scratch. If there is no code already there, it is all too easy to get lost in the details and not actually be able to get anywhere.

An empty source file is full of so many options, and any decision that I’ll make is going to have very long lasting impact. Sometimes I look at the keyboard and just freeze, unable to proceed because I know, with a 100% certainty, that whatever I’ll produce isn’t going to be up to my own standards. In fact, it is going to suck, for sure.

I think that about 90% of the things I have written so far are stuff that I couldn’t write today. Not because I lack the knowledge, but because I have far greater understanding of the problem space and I know that trying to solve it all is such a big task that it is not possible for me to do so. What I need reminding, sometimes, is that I have written those things, and eventually, those things were able to accomplish all that was required of them.

A painter doesn’t just start by throwing paint on canvas, and a building doesn’t grow up by people putting bricks where they feel like. In pretty much any profession, you need to iterate several times to get things actually done. With painters, you’ll typically do a drawing before actually putting paint on canvas. With architects will build a small scale model, etc.

For me, the hardest thing to do when I’m building something new is to actually allow myself to write it out as is. That means, lay out the general structure of the code, and ignore all the other stuff that you must have in order to get to real production worthy code. This means flat our ignoring:

  • Error handling
  • Control of allocations and memory used
  • Select the underlying data structures and algorithms
  • Yes, that means that O(N^2) is just fine for now
  • Logging, monitoring and visibility
  • Commenting and refactoring the code for maintainability over time

All of these are important, but I literally can’t pay these taxes and build something new in the same time.

I like to think about the way I work as old style rendering passes. When I’m done with the overall structure, I’ll go back and add these details. Sometimes that can be a lot of work, but at that point, I actually have something to help me. At a minimum, I have tests that verify that things still work and now I have a good understanding of the problem (and my solution) so I can approach things without having so many unknown to deal with.

A large part of that is that the fact that I didn’t pay any of the taxes for development. This usually means that the new thing is basically a ball of mud, but it is a small ball of mud, which means that if I need to change things around, I have to touch fewer moving parts. A lot fewer, actually. That allow me to explore, figure out what works and doesn’t.

It is also going directly against all of my instincts and can be really annoying. I really want to do a certain piece of code properly, but focusing on perfecting a single door knob means that the whole structure will never see the light of day.

Graphs in RavenDBRecursive queries

time to read 3 min | 565 words

imageGraph queries as I discussed them so far gives you the ability to search for patterns. On the right, you can see the family tree of the royal family of Great Britain going back a few hundred years. That make for an interesting subject for practicing graph queries.

A good example we might want to ask is who is the royal grand parent of Elizabeth II. We can do that using:


This is great, and nicely demonstrate how we can scan for specific patterns in the graph. However, it is limited by its rigidity. For example, let’s say that I want to find someone in the family tree and I’m not sure about the exact nature of the relationship?

“We are not amused” comes to mind, but off the top of my head and without consulting the chart, I don’t think that I would be able to figure it out. Luckily, I don’t have to, I can ask RavenDB to be so kind and tell me.


Note the use of the recursive element here. We are asking RavenDB to start in a particular document and go up the parents, trying to find an unamused royal. The recursion portion of the query can be zero to six steps in size and should abort as soon as we have any match. Following the zero to six parents, there should be a parent that is both a royal an unamused.

The Cypher syntax for what they call variable length queries is reminiscent of regular expressions, and I don’t mean that in a complimentary manner. Looking at the query above, you might have noticed that there is a distinct difference between it and the first one. The recursive query will go up the Parents link, regardless of whatever that parent is royal or not. RavenDB Graph Queries has what I believe to be a unique feature. The recursive pattern isn’t limited to a single step and can be as complex as you like.

For example, let’s ensure that we are only going to go up the chain of the royal parents.


The recursive element has a few knows that you can tweak. The minimum and maximum distance, for example, are obvious examples, but the results criteria for the recursion is also interesting. In this query, we use the shortest, instead of the lazy. This will make RavenDB work a bit harder and find the shortest recursive path that matches the query, where as lazy stops on the first one that matches. The following options are available:

  • Lazy – stop on the first pattern that matches. Good for: “Am I related to Victoria?”
  • Shortest – find the shortest path that match the pattern. Good for: “How am I related to Victoria?”
  • Longest – find the longest path that match the pattern. Good for: “For how many generations has Victoria’s family been royals?”
  • All – find all the paths that match the pattern. Good for if you have multiple paths in your ancestry to Victoria.


No future posts left, oh my!


  1. Graphs in RavenDB (11):
    08 Nov 2018 - Real world use cases
  2. Challenge (54):
    28 Sep 2018 - The loop that leaks–Answer
  3. Reviewing FASTER (9):
    06 Sep 2018 - Summary
  4. RavenDB 4.1 features (12):
    22 Aug 2018 - MongoDB & CosmosDB Migration Wizards
  5. Reading the NSA’s codebase (7):
    13 Aug 2018 - LemonGraph review–Part VII–Summary
View all series


Main feed Feed Stats
Comments feed   Comments Feed Stats