Practical considerations for implementing Raft
RavenDB has been using the Raft protocol for the past years. In fact, we have written three or four different implementations of Raft along the way. I implemented Raft using pure message passing, on top of async RPC and on top of TCP. I did that using actor model and using direct parallel programming as well as the usual spaghettis mode.
The Raft paper is beautiful in how it explain a non trivial problem in a way that is easy to grok, but it is also something that can require dealing with a number of subtleties. I want to discuss some of the ways to successfully implement it. Note that I’m assuming that you are familiar with Raft, so I won’t explain anything here.
A key problem with Raft implementations is that you have multiple concurrent things happening all at once, on different machines. And you always have the election timer waiting in the background. In order to deal with that, I divide the system into independent threads that each has their own task.
I’m going to talk specifically about the leader mode, which is the most complex aspect, usually. In this mode, we have:
- Leader thread – responsible for determining the current progress in the cluster.
- Follower thread – once per follower – responsible for communicating with a particular follower.
In addition, we may have values being appended to our log concurrently to all of the above. The key here is that the followers threads will communicate with their follower and push data to it. The overall structure for a follower thread looks like this:
What is the idea? We have a dedicated thread that will communicate with the follower. It will either ping the follower with an empty AppendEntries (every 1/3 of the election timeout) or it will send a batch of up to 50 entries to update the follower. Note that there is nothing in this code about the machinery of Raft, that isn’t the responsibility of the follower thread. The leader, on the other hand, listen to the notifications from the followers threads, like so:
The idea is that each aspect of the system is running independently, and the only communication that they have with each other is the fact that they can signal the other that they did some work. We then can compute whatever that work changed the state of the system.
Note that the code here is merely drafts, missing many details. For example, we aren’t sending the last commit index on AppendEntries, and committing the log is an asynchronous operation, since it can take a long time and we need to keep the system in operation.
Comments
How important is it in the raft protocol for the different nodes to have synced clocks?
Also is there such a thing in the cloud as each node actually using a shared clock? If it was technically possible, it seems to me to be a useful feature.
@Peter - One of the best parts of Raft protocol is that it doesn't require clock sync at all. This is one of the goals of a consensus algorithm such as this, to have different nodes agree on ceratin order of events without relying on clocks.
@peter - to further expand on what @michael said, pretty much every well known algorithm in the distributed computing world was conceived to avoid relying on sync'd or shared clocks :)
Peter,
Raft absolutely does not require a shared clock.
It is common for the Raft impl (and RavenDB does this) for the leader to send its clock to the followers, so they can observe time from an authoritative source.
A related question - in AWS, Azure etc, and also ravendb cloud - when you spin up nodes, do you have control over whether or not the nodes are on the same hardware? Or is the cloud smart enough to always default the nodes each to a different physical server? Otherwise you lose the advantage of having separate nodes.
peter,
For RavenDB Cloud, we are ensuring that the instances in the cluster are running on different _availability zones_. That means, not just not on the same physical hardware, but a completely different location.
Consensus algorithms often utilize clock synchronization. Raft doesn't. Instead it uses a monotonically increasing value, known as a term. in Raft. Comments wouldn't do the topic justice...
Comment preview