Ayende @ Rahien

My name is Oren Eini
Founder of Hibernating Rhinos LTD and RavenDB.
You can reach me by email or phone:


+972 52-548-6969

, @ Q j

Posts: 6,739 | Comments: 48,778

filter by tags archive

Use cases for MADV_DONTNEED in Voron

time to read 2 min | 341 words

The rant in this video is an absolute beautiful one. I run into this rant figuring out how MADV_DONTNEED work and I thought I would give some context on why the behavior is exactly what I want. In fact, you can read my reasoning directly in the Linux Kernel source code.

During a transaction, we need to put the memory pages modified by the transaction somewhere. We put them in temporary storage which we call scratch. Once a transaction is committed, we still use the scratch memory for a short period of time (due to MVCC) and then we flush these pages to the data file. At this point, we are never going to use the data on the scratch pages again. Just leaving them around means that the kernel needs to write them to the file they were mapped from. Under load, that can actually be a large portion of the I/O the system is doing.

We handle that by tell the OS that we don’t actually need this memory and it should throw it away and not write it to disk using MADV_DONTNEED. We are still checking whatever this cause us excessive reads when we do that (when the kernel tries to re-read the data from disk).

There are things that seems better, though. There is MADV_REMOVE, which will do the same and also zero (efficiently) the data on disk if needed, so it is not likely to cause page faults when reading it back again. The problem is that this is limited to certain file systems. In particular, SMB file systems are really common for containers, so that is something to take into account.

MADV_FREE, on the other hand, does exactly what we want, but will only work on anonymous maps. Our scratch files use actual files, not anonymous maps. This is because we want to give the memory a backing store in the case of memory overload (and to avoid the wrath of the OOM killer). So we explicitly define them as file (although temporary ones). h

RavenDB Node.JS client updated (now supporting subscription)

time to read 1 min | 85 words

The node.js RavenDB client had a milestone release last month, which I only now got to talking about. The most important factor about this release is that now node.js users can use subscriptions to process data from RavenDB.

Here is how you can do this:

This is one of the most powerful abilities RavenDB have, you will continuously get new orders in your script and be able to process them. This make it very easy to build features such as data pipelines, background processing, etc.

Analyzing the GitHub outage

time to read 8 min | 1516 words

imageA couple of weeks ago, GitHUb had a major outage, lasting over 24 hours and resulted in wide spread disruption of many operations for customers. A few days after everything was fixed, they posted their analysis on what happened, which makes for a really good read.

The pebble that started all of this was a connection disruption that lasted 43 seconds(!). A couple of months ago I talked about people who say that you can assume that distributed failures are no longer meaningful. The real world will keep serving up examples of weird / strange / nasty stuff to your productions systems, and you need to handle that. Quoting from the original post:

Therefore: the question becomes: how much availability is lost when we guarantee consistency? In practice, the answer is very little. Systems that guarantee consistency only experience a necessary reduction in availability in the event of a network partition. As networks become more redundant, partitions become an increasingly rare event. And even if there is a partition, it is still possible for the majority partition to be available. Only the minority partition must become unavailable. Therefore, for the reduction in availability to be perceived, there must be both a network partition, and also clients that are able to communicate with the nodes in the minority partition (and not the majority partition). This combination of events is typically rarer than other causes of system unavailability.

So no, not really. There is a good point here on the fact only the minority portion of the system must become unavailable, but given typical production deployment, any disconnect between data centers will cause a minority portion to be visible to clients and become unavailable.

The actual GitHub issues that are discussed in the post are a lot more interesting. First, we have the obvious problem that most applications assume that their database access is fast and they make multiple such calls during the processing of a single request (sometimes, many calls). This is just another example of the Fallacies of Distributed Computing in action. RavenDB has a builtin detection for that and a host of features that allow you to go to the database server once, instead of multiple times. In such a case, even if you need to failover to a remote server, you won’t pay the roundtrip costs multiple times.

However, this is such a common problem that I don’t think that it deserve much attention. There isn’t much that you can do about it without careful consideration and support from the whole stack. Usually, this happens on projects when you have a strong leader that institute a performance budget and enforce that. This has costs of its own and usually it is cheaper to just not failover across data center boundaries.

The next part that I find really interesting is that the system that GitHub uses for managing topologies is not consistent but is required to be. The problem is that there is an inherent delay between their orchestrator re-organizing the cluster after a failure and when the failure actually occurs. That would have been fine, if they had a way to successfully merge histories, but that is not the case. In fact, looking at just the information that they have published (and ignoring that I have the benefit of hindsight) the issue is glaringly obvious.

A deep dive (and a fascinating read) into how GitHub handles high availability talks about the underlying details and expose the root cause. You cannot layer distinct distributed architectures on top of one another and expect to come up with a good result. Here is what happens in a master crash scenario:

In a master crash scenario:

  • The orchestrator nodes detect failures.
  • The orchestrator/raft leader kicks off a recovery. A new master gets promoted.
  • orchestrator/raft advertises the master change to all raft cluster nodes.
  • Each orchestrator/raft member receives a leader change notification. They each update the local Consul’s KV store with the identity of the new master.
  • Each GLB/HAProxy has consul-template running, which observes the change in Consul’s KV store, and reconfigures and reloads HAProxy.
  • Client traffic gets redirected to the new master.

I read this and feel a bit queasy, because the master crash scenario is not the interesting bit. That is the easy part. The really hard part is how you manage things when you have a network disruption, with both sides still up and functioning. In fact, that is exactly what happened to GitHub. In this case, on the minority side, their orchestrator cannot get a majority (so cannot make any forward process). However, the rest of the system cannot proceed, the whole thing stops at either the first or second stage.

That means that the rest of the system will continue to write to the old master, resulting in a conflict. And this is where things gets complicated. The issue here is that with MySQL (and most other systems that relies on log replication) you must have a single master at any given time. That is an absolute requirement. If you got to the point where you had two writes with divergent histories, you are in for selecting which one you’ll accept (and what data you’ll discard) and trying to manually fix things after the fact.

The proper way to handle something like this would have been to use Raft to actually send the commands themselves to the server. This ensures a consistent set of statements that run in the same order for all servers. Rqlite is a great example of this, where you can get consistent and distributed system on top of individual components. That would be the proper way to do it, mind, not the way anyone would do it.

You wouldn’t be able to get any reasonable performance from the system using this kind of approach. Rqlite, for example, talks about being able to get 10 – 200 operations per second. I’m going to assume that GitHub has a need for something better than that. So the underlying distributed architecture looks like this:

  • MySQL write master with multiple read-only secondaries using binlog.
  • Orchestrator to provide health monitoring and selection of new write primary (consistent using Raft)
  • Underlying infrastructure that uses (different) Raft to store routing configuration.

If you break Orchestrator’s ability to make decisions (easy, just create a partition), you take away the ability to change the write master, and if the failure mode you are dealing with is not a failed master (for example, you have partition) you are going to accept new writes to the old master.  That breaks completely the whole idea of binlog replication, of course, so you are sort of stuck at that point. In short, I think that Orchestrator is something that was meant to solve an entirely different problem, it was meant to deal with the failure of a single node, not to handle a full data center partition.

When looking at such incidents, I always compare to what would have happened if RavenDB was used instead. This is not really fair in this case because RavenDB was designed upfront to be a distributed database. RavenDB doesn’t really have the concept a a write master. For simplicity’s sake, we usually try to direct all writes to a single node for each database, because this simplify how you usually work. However, but any node can accept writes and will distribute it to the rest of the nodes in the cluster. In a situation like the one GitHub faced, both sides of the partition would keep accepting writes (just like happened in GitHub’s case with MySQL).

The difference is what will happen when the partition is healed. Both sides of the partition will update the other with the data that is missing on the other side. Any conflicting writes (by which I mean writes on both sides of the partition to the same document or documents) will be detected and resolved automatically. Automatic resolution is very important to keeping everything up and running. This can be a custom resolution policy defined by the user or arbitrary by RavenDB. Regardless of the conflict resolution policy, the administrator will be notified about the conflicts and can review the actions taken by RavenDB and decide what to do about that.

In GitHub’s case, their busiest cluster had less than a thousand writes in the time period in question. Most of which aren’t going to conflict. I would expect the timeline with RavenDB to be:

  • 2018 October 21 22:52 UTC – initial network partition, lasting 43 seconds
  • 2018 October 21 22:54 UTC – internal monitoring alert about any conflicts (but I consider this unlikely)
  • 2018 October 21 23:00 UTC – issue resolved, go home

The difference is mostly because RavenDB was designed to live in these kind of environment, deployed in multiple data centers and actually handling, in the real world and with very little assistance, the task of keeping applications up and running without blowing things up. It is quite literally one of the basic building blocks we have, so it shouldn’t be surprising that we are pretty good at it.

Graphs in RavenDBReal world use cases

time to read 6 min | 1032 words

imageI talked a lot about how graph queries in RavenDB will work, but one missing piece of the puzzle is how they are going to be used. I’m going to use this post to discuss some of the options that the new features enables. We’ll take the following simple model, issue tracking. Let’s imagine that a big (secret) project is going on and we need to track down the tasks for it. On the right, you have an image of the permissions graph that we are interested in.

The permission model is pretty standard I think. We have the notion of users and groups. A user can be associated to one or more groups. Groups memberships are hierarchical. An issue’s access is controlled either by giving access to a specific users or to a group. The act of assigning a group will also allow access to all the group’s parents and any user that is associated to any of them.

Here is the issue in question:


Given the graph on the right, let’s see, for each user, how we can find out what issues they have access to.

We’ll start with Sunny, who is named directly in the issue as allowed access. Here is the query to find the issues on which we are directly named.


This was easy, I must admit. We could also write this without any need for graph queries, because it is so simple:


It gets a bit trickier when we have to look at Max. Unlike Sunny, Max isn’t directly named. Instead, he is a member in a group that is directly named in the issue. Let’s see how we can query on issues that Max has access to:


This is where things gets start to get interesting. If you’ll look into the match clause, you’ll see that we have arrows that go both left and right. We are telling RavenDB that we want to find issues that have the same group (denoted as g in the query) as the user. This kind of query you already can’t express with RavenDB right now without the graph query syntax. Another way to write this query is to use explicit clauses, like so:


In this case, all the arrows go in a single direction, but we have two clauses that are being and together. These two queries are exactly the same, in fact, RavenDB will translate the one with arrows going both ways to the one with the and.

The key here is that between any two clauses with an and, RavenDB will match the same alias to the same value.

So we now can find issues for Max and Sunny, but what about all the rest of the people? We can, of course, do this manually. Here is how we can find issues for people a group removed from the issue.


This query gives us all the issues for Nati that are one group removed from him. That works, but it isn’t how we want to do things. It is a good exercise in setting out the structure, though. Because instead of hard coding the pattern, we are now going to use recursion.


The key for this query is the recursive element in the third line. We moved from the issue to its groups, and then we recurse from the issues’ groups to their parents. We allow empty recursion and we follow all the paths in the graph.

On the other side, we go from the user and try to find a match from the user’s group to any of the groups that end the query. In Nati’s case, we went from project-x group to team-nati, which is a match, so we can return this issue.

Here is the final query that we can use, it is a bit much, even if it is short, so I will deconstruct it below:


We use a parameterized query here, to make things easier. We start from a user and find an issue if:

  • The issue directly name the user as allowed access. This is the case for Sunny.
  • The issue’s groups (or their parents) match the groups that the user belong to (non recursively, mind).

In the case of Max, for example, we have a match because the recursive element allows a zero length path, so the project-x group is allowed access and Max is a member of it.

In the case of Nati, on the other hand, we have to go from project-x to team-nati to get a match.

If we’ll set $uid to users/23, which is Pheobe, all the way to the left in the graph, we’ll also have a match. We’ll go from project-x to execs to board and then find a match.

Snoopy, on the other hand, doesn’t have access. Look carefully at the direction of the arrows in the graph. Snoopy belongs to the r-n-d group, and that groups is a child of exces, but the query we specified only go up to the parents, not the children, so Snoopy is not included.

I hope that this post gave you some ideas about what kind of use cases are enabled by graph queries in RavenDB. I would love to hear about any scenarios you have for graph queries so we can play with them and see how they are answered by RavenDB.

The fear of an empty source file

time to read 4 min | 606 words

imageI have been writing software at this point for over twenty years, and I want to believe that I have learned a few things during that timeframe.

And yet, probably the hardest thing for me is to start writing from scratch. If there is no code already there, it is all too easy to get lost in the details and not actually be able to get anywhere.

An empty source file is full of so many options, and any decision that I’ll make is going to have very long lasting impact. Sometimes I look at the keyboard and just freeze, unable to proceed because I know, with a 100% certainty, that whatever I’ll produce isn’t going to be up to my own standards. In fact, it is going to suck, for sure.

I think that about 90% of the things I have written so far are stuff that I couldn’t write today. Not because I lack the knowledge, but because I have far greater understanding of the problem space and I know that trying to solve it all is such a big task that it is not possible for me to do so. What I need reminding, sometimes, is that I have written those things, and eventually, those things were able to accomplish all that was required of them.

A painter doesn’t just start by throwing paint on canvas, and a building doesn’t grow up by people putting bricks where they feel like. In pretty much any profession, you need to iterate several times to get things actually done. With painters, you’ll typically do a drawing before actually putting paint on canvas. With architects will build a small scale model, etc.

For me, the hardest thing to do when I’m building something new is to actually allow myself to write it out as is. That means, lay out the general structure of the code, and ignore all the other stuff that you must have in order to get to real production worthy code. This means flat our ignoring:

  • Error handling
  • Control of allocations and memory used
  • Select the underlying data structures and algorithms
  • Yes, that means that O(N^2) is just fine for now
  • Logging, monitoring and visibility
  • Commenting and refactoring the code for maintainability over time

All of these are important, but I literally can’t pay these taxes and build something new in the same time.

I like to think about the way I work as old style rendering passes. When I’m done with the overall structure, I’ll go back and add these details. Sometimes that can be a lot of work, but at that point, I actually have something to help me. At a minimum, I have tests that verify that things still work and now I have a good understanding of the problem (and my solution) so I can approach things without having so many unknown to deal with.

A large part of that is that the fact that I didn’t pay any of the taxes for development. This usually means that the new thing is basically a ball of mud, but it is a small ball of mud, which means that if I need to change things around, I have to touch fewer moving parts. A lot fewer, actually. That allow me to explore, figure out what works and doesn’t.

It is also going directly against all of my instincts and can be really annoying. I really want to do a certain piece of code properly, but focusing on perfecting a single door knob means that the whole structure will never see the light of day.

Graphs in RavenDBRecursive queries

time to read 3 min | 565 words

imageGraph queries as I discussed them so far gives you the ability to search for patterns. On the right, you can see the family tree of the royal family of Great Britain going back a few hundred years. That make for an interesting subject for practicing graph queries.

A good example we might want to ask is who is the royal grand parent of Elizabeth II. We can do that using:


This is great, and nicely demonstrate how we can scan for specific patterns in the graph. However, it is limited by its rigidity. For example, let’s say that I want to find someone in the family tree and I’m not sure about the exact nature of the relationship?

“We are not amused” comes to mind, but off the top of my head and without consulting the chart, I don’t think that I would be able to figure it out. Luckily, I don’t have to, I can ask RavenDB to be so kind and tell me.


Note the use of the recursive element here. We are asking RavenDB to start in a particular document and go up the parents, trying to find an unamused royal. The recursion portion of the query can be zero to six steps in size and should abort as soon as we have any match. Following the zero to six parents, there should be a parent that is both a royal an unamused.

The Cypher syntax for what they call variable length queries is reminiscent of regular expressions, and I don’t mean that in a complimentary manner. Looking at the query above, you might have noticed that there is a distinct difference between it and the first one. The recursive query will go up the Parents link, regardless of whatever that parent is royal or not. RavenDB Graph Queries has what I believe to be a unique feature. The recursive pattern isn’t limited to a single step and can be as complex as you like.

For example, let’s ensure that we are only going to go up the chain of the royal parents.


The recursive element has a few knows that you can tweak. The minimum and maximum distance, for example, are obvious examples, but the results criteria for the recursion is also interesting. In this query, we use the shortest, instead of the lazy. This will make RavenDB work a bit harder and find the shortest recursive path that matches the query, where as lazy stops on the first one that matches. The following options are available:

  • Lazy – stop on the first pattern that matches. Good for: “Am I related to Victoria?”
  • Shortest – find the shortest path that match the pattern. Good for: “How am I related to Victoria?”
  • Longest – find the longest path that match the pattern. Good for: “For how many generations has Victoria’s family been royals?”
  • All – find all the paths that match the pattern. Good for if you have multiple paths in your ancestry to Victoria.

Graphs in RavenDBInconsistency abhorrence

time to read 3 min | 462 words

In my previous post, I discussed some options for changing the syntax of graph queries in RavenDB from Cypher to be more in line with the rest of the RavenDB Query Language. We have now completed that part and can see the real impact it has on the overall design.

In one of the design review, one of the devs (who have built non trivial applications using Neo4J) complained that the syntax is now much longer. Here are the before and after queries to compare:


The key, from my perspective, is that the new form is more explicit and easier to read after the fact. Queries tend to grow more complex over time, and they are being read a lot more often than written). As such, I absolutely want to lean toward being readable over being terse.

The example above just show the extra characters that you need to write. Let’s talk about something that is a bit more complex:


Now we have a lot more text, but it is a lot easier to understand what is going on. Focus especially on the Lines edge, where we can very clearly separate what constitute the selection on the edge, the filter on the edge and what is the property that contains the actual linked document id.

The end result is that we now have a syntax that is a lot more consistent and approachable. There are other benefits, but I’ll show them off in the next post.

A major source of annoyance for me with this syntax was how to allow anonymous aliases. In the Cypher syntax we used, you could do something like:


There is a problem with how to express this kind of syntax of anonymous aliases with the Collection as alias mode. I initially tried to make it work by saying that we’ll look at the rest of the query and figure it out. But that just felt wrong. I didn’t like this inconsistency. I want a parse tree that I can look at in isolation and know what is going on. Simplifying the language is something that pays dividends over time, so I eventually decided that the query above will look this with the next syntax:


There is a lot of precedence of using underscore as the “I don’t care” marker, so that works nice and resolves any ambiguities in the syntax.

Graphs in RavenDBSelecting the syntax

time to read 5 min | 981 words

When we started building support for graph queries inside RavenDB, we looked at what is the state of the market in this regard. There seems to be two major options: Cypher and Gremlins. Gremlins is basically a fluent interface that represent a specific graph pattern while Cypher is a more abstract manner to represent the graph query. I don’t like Gremlins, and it doesn’t fit into the model we have for RQL, so we went for the Cypher syntax. Note the distinction between went for Cypher and went for Cypher syntax.

One of the major requirements that we have is fitting in into the pre-existing Raven Query Language, but the first concern we had was just getting started and getting some idea about our actual scenarios. We are now at the point where we have written a bunch of graph queries and got a lot more experience in how it mesh into the overall environment. And at this point, I can really feel that there is an issue in meshing Cypher syntax into RQL. They don’t feel the same at all. There are a lot of good ideas there, make no mistake, but we want to create something that would flow as a cohesive whole.

Let’s look at some of our queries and how we can better express them. The one I talked to about the most is this:


Let see what we have here:

  • match is the overall clause that apply a graph pattern query to the dataset.
  • () – is an indication of a node in the graph.
  • [] – is an indication of an edge.
  • a:Dogs, l:Likes and b:Dogs – this is an alias and a path specification.
  • -[]-> – is an indication of an edge between two nodes
  • (expression) – is a filter on a node or an edge

I’m ignoring the select statement here because it is just the usual RQL select statement.

The first thing that keeps biting us is the filter in (a:Dogs (id() = 'dogs/arava')), I keep being tripped by missing the closing ), so that has got to go. Luckily, is it very obvious what to do here:


We use an explicit where clause, instead of the () to express the inline filter. This fits a lot more closely with how the rest of RQL works.

Now, let’s look at the aliases: (b:Dogs). The alias:Collection syntax is pretty foreign to RQL, we tend to use the Collection as alias syntax. Let’s see how that would look like, shall we?


This looks a lot more natural to me, and it is a good fit into RQL in general. This syntax does bring a few things to the table. In particular, look a the edge. In Cypher, an anonymous edge would be: [:Likes], and using this method, we will have just [Likes].

However, as nice as this syntax is, we still run into a problem. The query above is actually just a shorthand way to write the full query, which looks like so:


In fact, we have two queries here, to show off the actual problem we have in parsing. In the first case, we have a match clause the only refers to explicit with statement. On the second case, we have a couple of explicit with statements, but also an implicit with edges expression (the Likes).

From the point of view of the parser, we can’t distinguish those two. Now, we can absolutely say that if the edge expression contains a single name, we’ll simply look for an edge with that name and otherwise assume that this is the path that will be used.

But this seems to be error prone, because you might have a small typo or remove a edge statement and get a completely different (and unexpected) meaning. I thought about adding some sort of prefix to help tell an alias from an implicit definition, but that looks very ugly, see:


And on the other hand, I really like the –[Likes]-> syntax in general. It is a lot cleaner and easier to read.

At this point, I don’t have a solution for this. I think we’ll go with the mode in which we can’t tell what the query is meant to say just from the parser, and look at the explicit with statements to figure it out (with the potential for mistakes that I pointed out earlier) until we can figure out something better.

One thing that I’m thinking about is that the () and [] which help distinguish between nodes and edges, aren’t actually required for us if we have an explicit statement. So we can write it like so:


In this manner, we can tell, quite easily, if you meant to define an implicit edge / node or refers to an explicitly defined alias. I’m not sure whatever this would be a good idea, though.

Another issue we have to deal with is:


Note that in this case, we have a filter expression on the edge as well. Applying the same process we have done so far, we get:


The advantages here is that this is very clear and obvious about what is going on. The disadvantage is that this takes quite a bit longer to express.

Graphs in RavenDBWhat’s the role of the middle man?

time to read 2 min | 279 words

imageAn interesting challenge with implementing graph queries is that you sometimes get into situations where the correct behavior is counter intuitive.

Consider the case of the graph on the right and the following query:


This will return:

  • Source: Arava, Destination: Oscar

But what would be the value of the Edge property? The answer to that is… complicated.  What we actually return is the edge itself. Let’s see what I mean by that.


And, indeed, the value of Edge in this query is going to be dogs/oscar.


This isn’t very helpful if we are talking about a simple edge like this. After all, we can deduce this from the Src –> Destination pair. This gets more interesting when the edge is more complex. Consider the following query:


What do you this should be the output here? In this case, the edge isn’t the Product property, it is the specific line that match the filter on the edge. Here is what the result looks like:


As you can imagine, knowing exactly what edge led you from one document to another can be very useful when you look at the query results.

Graphs in RavenDBI didn’t mean to build this feature!

time to read 2 min | 258 words

I was busy working on the implementation on filtering in graph queries, as discussed in my previous post. What I ended up implementing is a way for the user to tell us exactly how to handle the results. The actual query we ended up with is this:


And the key part here is the where clause, were we state that a and c cannot be the same dog. This also matches the behavior of SQL, and for that reason allow (predictably), that’s a good idea.

However, I didn’t just implement inequity, I implement full filtering capabilities, and you can access anything in the result. Which means that this query is now also possible:


I’ll let you a moment to analyze this query in peace. Try to de-chyper it (pun intended).

What this  query is doing is to compare the actual sale price and the regular price of product on a particular order, for products that match a particular set of categories.

This is a significant query because, for the first time in RavenDB, you have the ability to perform such a query (previous, you would have had to define a specific index for this query).

In other words, what graph query filtering brings to the table is joins. And I did not set out to build this feature and I’m feeling very strange about it.


No future posts left, oh my!


  1. Graphs in RavenDB (11):
    08 Nov 2018 - Real world use cases
  2. Challenge (54):
    28 Sep 2018 - The loop that leaks–Answer
  3. Reviewing FASTER (9):
    06 Sep 2018 - Summary
  4. RavenDB 4.1 features (12):
    22 Aug 2018 - MongoDB & CosmosDB Migration Wizards
  5. Reading the NSA’s codebase (7):
    13 Aug 2018 - LemonGraph review–Part VII–Summary
View all series


Main feed Feed Stats
Comments feed   Comments Feed Stats