Ayende @ Rahien

Oren Eini aka Ayende Rahien CEO of Hibernating Rhinos LTD, which develops RavenDB, a NoSQL Open Source Document Database.

You can reach me by:


+972 52-548-6969

Posts: 6,919 | Comments: 49,399

filter by tags archive
time to read 4 min | 621 words

So far we tackled the idea of large compute cluster, and a large storage cluster. I mentioned that the problem with the large storage cluster is that it doesn’t handle consistency within itself. Two concurrent requests can hit two storage nodes and make concurrent operations that aren’t synchronized between themselves. That is usually a good thing, since that is what you want for high throughput system. The less coordination you can get away with, the more you can actually do.

So far, so good, but that isn’t always suitable. Let us consider a case where we need to have a consistent approach, for some business reason. The typical example would be transactions in a bank, but I hate this example, because in the real world banks deal with inconsistency all the time, this is an explicit part of their business model. Let us talk about auctions and bids, instead. We have an auction service, which allow us to run a large number of auctions.

For each auction, users can place bids, and it is important for us that bids are always processed sequentially per auction, because we have to know who place a bid that is immediately rejected ($1 commission) or a wining bid that was later overbid (no commission except for the actual winner). We’ll leave aside the fact that this is something that we can absolutely figure out from the auction history and say that we need to have it immediate and consistent. How do we go about doing this?

Remember, we have enough load on the system that we are running a cluster with a hundred nodes in it. The rough topology is still this:


We have the consensus cluster, which decide on the network topology. In other words, it decide which set of servers is responsible for which auction. What happens next is where it gets interesting.

Instead of just a set of cooperating nodes that share the data between them and each of which can accept both reads and writes, we are going to twist things a bit. Each set of servers is their own consensus cluster for that particular auction. In other words, we first go to the root consensus cluster to get the topology information, then we add another command to the auction’s log. That command go through the same distributed consensus algorithm between the three nodes. The overall cluster is composed of many consensus clusters for each auction.

This means that we have a fully consistent set of operations across the entire cluster, even in the presence of failure. Which is quite nice. The problem here is that you have to have a good way to distinguish between the different consensuses. In this case, an auction is the key per consensus, but it isn’t always so each to make such distinction, and it is important that an auction cannot grow large enough to overwhelm the set of servers that it is actually using. In those cases, you can’t really do much beyond relax the constraints and go in a different manner.

For optimization purposes, you usually don’t run an independent consensus for each of the auctions. Or rather, you do, but you make sure that you’ll share the same communication resources, so for auctions/123 the nodes are D,E,U with E being the leader, while for auctions/321 the nodes are also D,E,U but U is the leader. This gives you the ability to spread processing power among the cluster, and the communication channels (TCP connections, for example) are shared between both auctions consensuses. 

time to read 3 min | 464 words

In my previous post, I talked about how we can design a large cluster for compute bound operations. The nice thing about this is that is that the actual amount of shared data that you need is pretty small, and you can just distribute that information among your nodes, then let them do stateless computation on that, and you are done.

A much more common scenario is when can’t just do stateless operations, but need to keep track of what is actually going on. The typical example is a set of users changing data. For example, let us say that we want to keep track of the pages each user visit on our site. (Yes, that is a pretty classic Big Table scenario, I’ll ignore the prior art issue for now). How would we design such a system?

Well, we still have the same considerations. We don’t want a single point of failures, and we want to have very large number of machines and make the most of their resources.

In this case, we are merely going to change the way we look at the data. We still have the following topology:


There is the consensus cluster, which is responsible for cluster wide immediately consistent operations. And there are all the other nodes, which actually handle processing requests and keeping the data.

What kind of decisions do we get to make in the consensus cluster? Those would be:

  • Adding & removing nodes from the entire cluster.
  • Changing the distribution of the data in the cluster.

In other words, the state that the consensus cluster is responsible for is the entire cluster topology. When a request comes in, the cluster topology is used to decide into which set of nodes to direct it to.

Typically in such systems, we want to keep the data on three separate nodes, so we get a request, then route it to one of those three nodes that match this. This is done by sharding the data according the the actual user id whose page views we are trying to track.

Distributing the sharding configuration is done as described in the compute cluster example, and the actual handling of requests, or sending the data between the sharded instances is handled by the cluster nodes directly.

Note that in this scenario, you cannot ensure any kind of safety. Two requests for the same user might hit different nodes, and do separate operations without being able to consider the concurrent operation. Usually, that is a good thing, but that isn’t always the case. But that is an issue of the next post.

time to read 5 min | 844 words

I’m using 100/99 node cluster as the example, but the discussion also apply for smaller clusters (dozens of nodes) and bigger clusters (hundreds or thousands). Pretty much the only reason that you want to go with clusters of that size is that you want to scale out your processing in some manner. I’ve already discussed why a hundred node cluster isn’t a good option for safety reasons.

Consensus algorithm create a single consensus in the entire cluster, usually about an order set of operations that are fed to a state machine. The easiest such example would be a dictionary. But it make no sense to have a single dictionary spread across hundred nodes. Why would you need to do that?  How would it give you the ability to make full use of all of the power of all those nodes?

Usually nodes are used for either computing or storage purposes. Computing is much easier, so let us take that as a good example. A route calculating system, need to do a lot of computations on a relatively small amount of information (the map data). Whenever there is a change in the map (route blocked, new road open, etc), it needs to send the information to all the servers, and make sure that it isn’t lost.

Since calculating routes is expensive (we’ll ignore the options for optimizations and caching for now), we want to scale it to many nodes. And since the source data is relatively small, each node can have a full copy of the data. Under this scenario, the actual problem we have to solve is how to ensure that once we save something to the cluster, it is propagated to the entire cluster.

The obvious way to do this is with a hierarchy:


Basically, the big icons are the top cluster, each of which is responsible for updating a set of secondary servers, which is then responsible for updating the tertiary servers.

To be perfectly honest, this looks nice, and even reasonable, but it is going to cause a lot of issues. Sure, the top cluster is resilient to failures, but relying on a node to be up to notify other nodes isn’t so smart. If one of the nodes in the top cluster goes down, then we have about 20% of our cluster that didn’t get the notice, which kind of sucks.

A better approach would be to go with a management system and a gossip background:


In other words, the actual decisions are down by the big guys (literally, in this picture). This is a standard consensus cluster (Paxos, Raft, etc). Once a decision has been made by the cluster, we need to send it to the rest of the nodes in the system. We can do that either by just sending the messages to all the nodes, or by selecting a few nodes and have them send the messages to their peers. The protocol for that is something like: “What is the like command id you have? Here is what I have after that.” Assuming that each processing node is connected to a few other servers, that means that we can send the information very quickly to the entire cluster. And even if there are errors, the gossiping server will correct it (note that there is an absolute order of the commands, ensured by the consensus cluster, so there isn’t an issue about agreeing to this, just distributing the data).

Usually the gossip topology follows the actual physical distribution. So the consensus cluster will notify a couple of servers on each rack, and let the servers in the rack gossip among themselves about the new value.

This means that once we send a command to the cluster, the consensus would agree on that, then we would distribute it to the rest of the nodes. There is a gap between the consensus confirming it and the actual distributing to all the nodes, but that is expected in any distributed system. If it is important to sync this on a synchronized basis across the entire cluster, the command is usually time activated (which require clock sync, but that is something that we can blame on the ops team, so we don’t care Smile).

With this system, we can have an eventually consistent set of data across the entire cluster, and we are happy.

Of course, this is something that is only relevant for compute clusters, the kind of things were you compute a result, return it to the client and that is about it. There are other types of clusters, but I’ll talk about them in my next post.

time to read 4 min | 705 words

The question cross my desk, and it was interesting enough that I felt it deserves a post. The underlying scenario is this. We have distributed consensus protocols that are built to make sure that we can properly arrive at a decision and have the entire cluster follow it, regardless of failure. Those are things like Paxos or Raft. The problem is that those protocols are all aimed at relatively small number of nodes. Typically 3 – 5. What happens if we need to manage a large number of machines?

Let us assume that we have a cluster of 99 machines. What would happen under this scenario? Well, all consensus algorithm works on top of the notion of a quorum. That at least (N/2+1) machines have the same data. For a 3 nodes cluster, that means that any decision that is on 2 machines is committed, and for a 5 nodes cluster, it means that any decision that is on 3 machines is committed. What about 99 nodes? Well, a decision would have to be on 50 machines to be committed.

That means making 196 requests (98 x 2) (once for the command, then for the confirmation) for each command. That… is a lot of requests. And I’m not sure that I want to see what it would look like in term of perf. So just scaling things out in this manner is out.

In fact, this is also pretty strange thing to do. The notion of distributed consensus is that you will reach agreement on a state machine. The easiest way to think about it is that you reach agreement on a set of values among all nodes. But why are you sharing those values among so many nodes? It isn’t for safety, that is for sure.

Assuming that we have a cluster of 5 nodes, with each node having 99% availability (which translates to about 3.5 days of downtime per year). The availability of all nodes in the cluster is 95%, or about 18 days a year.

But we don’t need them to all be up. We just need any three of them to be up. That means that the math is going to be much nicer for us (see here for an actual discussion of the math).

In other words, here are the availability numbers if each node has a 99% availability:

Number of nodes Quorum Availability  
3 2 99.97% ~ 2.5 hours per year
5 3 99.999% (5 nines) ~ 5 minutes per year
7 5 99.9999% (6 nines) ~ 12 seconds per year
99 50 100%  

Note that all of this is based around each node having about 3.5 days of downtime per year. If we can have availability of 99.9% (or about 9 hours a year), the availability story is:

Number of nodes Quorum Availability  
3 2 99.9997% ~ 2 minutes a year
5 3 99.999999% ( 8 nines ) ~ 30 seconds per year
7 5 100%  

So in rough terms, we can say that going to 99 node cluster isn’t a good idea. It is quite costly in terms of the number of operation require to ensure a commit, and from a safety perspective, you can get the same safety level at the drastically lower cost.

But there is now another question, what would we actually want to do with a 99 node cluster*? I’ll talk about this in my next post.

A hundred node cluster only make sense if you have machines with about 80% availability. In other words, they are down for 2.5 months every year. I don’t think that this is a scenario worth discussing.


  1. Optimizing access patterns for extendible hashing - about one day from now
  2. Building extendible hash leaf page - 2 days from now

There are posts all the way to Nov 19, 2019


  1. re (24):
    12 Nov 2019 - Document-Level Optimistic Concurrency in MongoDB
  2. Voron’s Roaring Set (2):
    11 Nov 2019 - Part II–Implementation
  3. Searching through text (3):
    17 Oct 2019 - Part III, Managing posting lists
  4. Design exercise (6):
    01 Aug 2019 - Complex data aggregation with RavenDB
  5. Reviewing mimalloc (2):
    22 Jul 2019 - Part II
View all series


Main feed Feed Stats
Comments feed   Comments Feed Stats