## Large scale distributed consensus approachesCalculating a way out

time to read 4 min | 705 words

The question cross my desk, and it was interesting enough that I felt it deserves a post. The underlying scenario is this. We have distributed consensus protocols that are built to make sure that we can properly arrive at a decision and have the entire cluster follow it, regardless of failure. Those are things like Paxos or Raft. The problem is that those protocols are all aimed at relatively small number of nodes. Typically 3 – 5. What happens if we need to manage a large number of machines?

Let us assume that we have a cluster of 99 machines. What would happen under this scenario? Well, all consensus algorithm works on top of the notion of a quorum. That at least (N/2+1) machines have the same data. For a 3 nodes cluster, that means that any decision that is on 2 machines is committed, and for a 5 nodes cluster, it means that any decision that is on 3 machines is committed. What about 99 nodes? Well, a decision would have to be on 50 machines to be committed.

That means making 196 requests (98 x 2) (once for the command, then for the confirmation) for each command. That… is a lot of requests. And I’m not sure that I want to see what it would look like in term of perf. So just scaling things out in this manner is out.

In fact, this is also pretty strange thing to do. The notion of distributed consensus is that you will reach agreement on a state machine. The easiest way to think about it is that you reach agreement on a set of values among all nodes. But why are you sharing those values among so many nodes? It isn’t for safety, that is for sure.

Assuming that we have a cluster of 5 nodes, with each node having 99% availability (which translates to about 3.5 days of downtime per year). The availability of all nodes in the cluster is 95%, or about 18 days a year.

But we don’t need them to all be up. We just need any three of them to be up. That means that the math is going to be much nicer for us (see here for an actual discussion of the math).

In other words, here are the availability numbers if each node has a 99% availability:

 Number of nodes Quorum Availability 3 2 99.97% ~ 2.5 hours per year 5 3 99.999% (5 nines) ~ 5 minutes per year 7 5 99.9999% (6 nines) ~ 12 seconds per year 99 50 100%

Note that all of this is based around each node having about 3.5 days of downtime per year. If we can have availability of 99.9% (or about 9 hours a year), the availability story is:

 Number of nodes Quorum Availability 3 2 99.9997% ~ 2 minutes a year 5 3 99.999999% ( 8 nines ) ~ 30 seconds per year 7 5 100%

So in rough terms, we can say that going to 99 node cluster isn’t a good idea. It is quite costly in terms of the number of operation require to ensure a commit, and from a safety perspective, you can get the same safety level at the drastically lower cost.

But there is now another question, what would we actually want to do with a 99 node cluster*? I’ll talk about this in my next post.

A hundred node cluster only make sense if you have machines with about 80% availability. In other words, they are down for 2.5 months every year. I don’t think that this is a scenario worth discussing.