Surviving Black Friday: Or designing for failure

time to read 3 min | 432 words

imageIt’s the end of November, so like almost every year around this time, we have the AWS outage impacting a lot of people. If past experience is any indication, we’re likely to see higher failure rates in many instances, even if it doesn’t qualify as an “outage”, per se.

The image on the right shows the status of an instance of a RavenDB cluster running in useast-1. The additional load is sufficient to disrupt the connection between the members of the cluster. Note that this isn’t load on the specific RavenDB cluster, this is the environment load. We are seeing busy neighbors and higher network latency in general, enough to cause a disruption of the connection between the nodes in the cluster.

And yet, while the cluster connection is unstable, the  individual nodes are continuing to operate normally and are able to continue to work with no issues. This is part of the multi layer design of RavenDB. The cluster is running using a consensus protocol, which is sensitive to network issues and require a quorum to progress. The databases, on the other hand, uses a separate, gossip based protocol to allow for multi master distributed work.

What this means is that even in the presence of increased network disruption, we are able to run without needing to consult other nodes. As long as the client is able to reach any node in the database, we are able to serve reads and writes successfully.

In RavenDB, both clients and servers understand the topology of the system and can independently fail over between nodes without any coordination. A client that can’t reach a server will be able to consult the cached topology to know what is the next server in line. That server will be able to process the request (be it read or write) without consulting any other machine.

The servers will gossip with one another about misplaced writes and set everything in order.

RavenDB gives you a lot of knobs to control exactly this process works, but we have worked hard to ensure that by default, everything should Just Work.

Since we released the 4.0 version of RavenDB, we have had multiple Black Fridays, Cyber Monday and Crazy Tuesdays go by peacefully. Planning, explicitly, for failures has proven to be a major advantage. When they happen, it isn’t a big deal and the system know how to deal with them without scrambling the ready team. Just another (quite) day, with 1000% increase in load.