Raft, as the Raven flies

time to read 5 min | 890 words

The interesting thing about the etcd / go-raft review wasn’t so much what they did, they did things quite beautifully, but the things that I wanted them to do that they didn’t do. The actual implementation was fascinating, and for the purposes of what is required, more than adequate. I, however, have a very specific goal in mind, and that means that this is really not going to work with the given implementations.

I’ve better explain first. I’m not sure if go-raft was written by the same people as etcd, but they do “smell” very much the same. It just occurred to me that yes, they are actually both on github, so I checked it appears that I was correct. xiangli-cmu, philips, benbjohnson and bcwaldon (among many others) have worked on both projects. The reason this is important is that this reinforce my feeling that go-raft was written mainly to serve the need of etcd. Anyway…

The reason that I think that is that the implementation is very much optimized at the level etcd needs it. What I mean is that there is a lot of work there to support multiple concurrent operations that would be batched to all the nodes in the cluster, then be applied in the in memory model. This work because etcd is a set of key/value pairs that are shared among many nodes, but they are usually very small values, and very small number of keys. I would be surprised if there were hundreds of thousands of keys in a single etcd installation. That just isn’t the type of thing that it is meant to do. And that reflects throughout.

For example, there is an inherit assumption through the codebase, that the data set is small, the each value is small, etc. When sending a snapshot over the wire, there is the assumption that it can all fit in memory and that this is a good thing to do so. Even more to the point, the system where we only push entries to the nodes on heartbeats is great when you are trying to batch multiple changes together, with each change being independent of all the others (usually). However, that really doesn’t work when what you actually need is something like the sort of things that we usually deal with.

I want to emphasize that I’m really evaluating the way etcd & go-raft are built and find them wanting for my own purposes, I think that they are doing quite great for the kind of problem that they are meant to solve.

Let us consider the kind of databases that we have. We usually have to deal with much larger values, typical documents are in the KB range, and are frequently hundreds of KB in size. We also tend to have quite a lot of them. Trying to put them all in memory (the etcd way) would be… suboptimal. Now, one option would be to try and do just that, and have each operation in RavenDB (put/delete documents) as a state machine command in Raft. The problem is that this really gets complicated very quickly, leaving aside the fact that it isn’t a general solution, and we really want to try to get to a general solution.

We don’t want to solve this once for RavenDB, and then for RavenFS, and then for… etc. And the reason that working at that level is tricky is that you need to push everything through the state machine, and everything in this case really does mean everything. That lead to a very different database design and implementation. It is also probably not really necessary. We can drop this a level or two down and instead of dealing at the database level, we can deal with that at the storage layer level. This is basically just an extension of the log shipping idea. Instead of using Raft for high level commands, just use Raft to create a consensus around the sequence of writes to the journal. That means that all the nodes will have the same journal, and thus have the same data in the storage.

That in turn means that an “entry” can actually be pretty large (mutli MB range, sometimes). And because the journal is purely sequential, we need to issue that command to all the nodes as soon as it is submitted to the Raft state machine.

Now, the reason that the go-raft implementation waits for the heartbeat is that it batches things up nicely, and really speed up the general case. But we don’t need to do that for what we are going to do. Or, to be rather more exact, we have already done just that. That is basically what Voron’s transaction merging is all about. And since we can only handle writes as the leader, and since we will merge concurrent writes anyway, that is going to end up as a very similar behavior under load, and faster without load. At least that is what the initial thinking is saying.

Snapshots, also something quite important for Raft, can be handle very simply by just doing a backup/restore over the network, by streaming all the data.

And the rest is just implementing Raft Smile.