Unsafe operations are required in the real world
There is a pretty interesting discussion in the Raft mailing list, about clarifying some aspects of the Raft protocol. This led to some in depth discussion on the difference between algorithms in their raw state and the actual practice that you need in the real world.
In case you aren’t aware, Raft is a distributed consensus protocol. It allows a group of machines to reach a decision together (a gross over simplification, but good enough for this).
In a recent post, I spoke about dedicated operations bypasses. This discussion surfaced another one. One of the things that make Raft much simpler to work with is that it explicitly handles topology changes (adding / removing nodes). Since that is part of the same distributed consensus algorithm, it means that it is safe to add or remove a node at runtime.
Usually when you build a consensus algorithm, you are very much focused on safety. So you make sure that all your operations are actually safe (otherwise, why bother?). Except that you must, in your design, explicitly allow the administrator to make inherently unsafe operations. Why?
Consider the case of a three node cluster, that has been running along for a while now. A disaster strikes, and two of the nodes die horriblyh. This puts our cluster in a situation where it cannot get a majority, and nothing happen until at least one machine is back online. But those machines aren’t coming back. So our plucky admin wants to remove the two dead servers from the cluster, so it will have one node only, and resume normal operations (then the admin can add additional nodes at leisure). However, the remaining node will refuse to remove any node from the cluster. It can’t, it doesn’t have a majority.
If sounds surprisingly silly, but you actually have to build into the system the ability to make those changes with explicit admin consent as unsafe operations. Otherwise, you might end up with a perfectly correct system, that breaks down horribly in failure conditions. Not because it buggy, but because it didn’t take into account that the real world sometimes requires you to break the rules.
Comments
Why aren't nodes that can't communicate with the group (and therefor cannot form a consensus) not automatically unsubcribed from the group? Removing a dead node shouldn't be a manual action. But why does raft not allow a consensus of 1 node when that single forms a majority? A poll with only a single vote is still valid. It even reached an unanimous vote (100% voted the same thing).
Seems to me that the raft algorithm is more incomplete. Because if you can't remove dead nodes from the surviving node because it could not reach a consensus, doesn't that also mean that raft cannot distribute its state when the cluster only exists of one node? Witnesses (on disk) are often used to determine if a node is really the only surviving node, or that it just in unable to communicate with the other nodes. In the latter it should not perform any tasks, until also the witness(es) can't communicate with other nodes..
Or did I misinterpret your post?
Dave, This is a consensus system. You can't drop a node that isn't communicating with you. Consider a 3 node cluster. A single node is cut off, and as far as it is concerned, the other two have dropped from the network. If it will automatically unsubscribe them from the cluster, it will result in a single node cluster, which is its own majority. At that point, it will accept writes! At the same time, the two other nodes have settled on a leader, and are also accepting writes! Now you have a conflict.
The idea with a consensus algorithm is to prevent such a thing from ever happening. You need a majority to reach any decision, including removing or adding a node. As far as the nodes can tell, being unable to reach another node is indistinguishable from "node is down for a few seconds" to "someone set the node on fire, it isn't coming back"
Great example. Byzantine fault tolerance was originally designed to prevent forged messages between generals so people don't killed unnecessarily. It's basically the dual passcodes, dual physical keys (separated by double arm's length) nuclear warhead launch scenario. Admins should be able to bypass unless this is the actual use-case of the system :)
@Dave if you just allowed it to decide that it would shift cluster size automatically you would break the entire concept of consensus. As an example I have a cluster of 3 nodes and we have a partition so now there are 2 (they elect a new master) and a minority partition with 1 node in it. If we just said "well only 1 node so lets just do as 1 node" then that node would consider itself a master and start accepting transactions. This is known as "split brain"
Cheers,
Greg
Terminology difference, but I wouldn't actually call that "unsafe". IMO, "unsafe" is when you allow someone to break an invariant in order to accomplish something. However consensus algorithms operate fine with an N of 1, it just happens to defeat the purpose of them. I'm not sure of a word that captures that. Maybe we could disambiguate between "unsafe" and "dangerous" :) Or maybe I'm just making stuff up, I dunno.
Orbitz, Is is an unsafe operation. You are breaking an invariant. Removing a node from the cluster is an operation that requires a quorum. You are doing it without a quorum, so that is unsafe, violate the algorithm requirements, etc
No, I don't think you are. The Raft paper, or even paxos paper, does not require more than one node is in the cluster in order to work, it just happens to be a silly setup. Nothing in the algorithm has to change it. By remove nodes from the cluster you are not running without a quorum from the algorithms point of view, because it always has a quorum now.
Consider these this situation:
I have a cluster of 3 nodes, 2 of them go down so I cannot reach quorum but I want to continue processing requests until I fix the nodes and I'm going to keep an eye on my one machine. I could:
Orbitz,
Now consider what is going on _with the other 2 machines_. They lost connectivity to the other side. But they are still a majority, and can continue running.
By manually removing the nodes from the cluster, you have created an unsafe situation. There is now a majority in the cluster that thinks that the cluster is different than what it really is.
In Raft, for example, the steps for changing the topology are very clear, and here you are taking a hammer to them. That is okay, _if you know what you are doing and have a great admin_, but it is an unsafe operation, and violates the safety guarantees
Good point Oren, the unspoken assumption I was making and should have been explicit about was that the other nodes are dead and they are performing the cluster topology change because, as the operator, they have a global view.
Comment preview