On testing crazy race conditions in distributed system

time to read 2 min | 256 words

RavenDB is highly concurrent distributed database. That means that we take the idea of race conditions, multiple that by network hiccups and then raise to the power of hair pulling. Now, we have architectural structure to help with a lot of that, but sometimes you need to write and verify what happens when a particular sequence of events in a five node cluster happens. For fun, you may need to orchestrate a particular order of operations across multiple disparate processes (sometimes on different machines). As you can imagine, that is… challenging.

I wanted to give you a hint of some of the techniques that we use to handle this. We have code that looks like this, sprinkled throughout our code base (Rachis is the name of our Raft cluster implementation):

This is where a leader connects to a follower to setup their relationship:

image

This is called during leader election:

image

These methods are implemented in the following manner:

image

In other words, they will set a ManualResetEvent that we setup as part of our testing infrastructure. The code isn’t even being run on production release, but it allow us to very carefully structure the exact sequence of events that we need to expose specific behaviors in the system.