Building complex systems with duplication and multiple overlapping fields of fire

time to read 3 min | 582 words

Complex systems are considered problematic, but I’m using this term explicitly here. For reference, see this wonderful treatise on the subject. Another way to phrase this is how to create robust systems with multiple layers of defense against failure.

When something is important, you should prepare in advance for it. And you should prepare for failure. One of the typical (and effective) methods for handle failures is the good old retry. I managed to lock myself out of my car recently, had to wait 15 minutes before it was okay for me to start it up. But a retry isn’t going to help you if the car run out of fuel, or there is a flat tire. In some cases, a retry is just going to give you the same failing response.

Robust systems do not have a single way to handle a process, they have multiple, often overlapping, manners to do their task. There is absolutely duplication between these methods, which tend to raise the hackles of many developers. Code duplication is bad, right? Not when it serve a (very important) purpose.

Let’s take a simple example, order processing. Consider the following example, an order made on a website, which needs to be processed in a backend system:

image

The way it would work, the application would send the order to a payment provider (such as PayPal) which would process the actual order and then submit the order information via web hook to the backend system for processing.

In most such systems, if the payment provider is unable to contact the backend system, there will be some form of retry. Usually with a exponential back-off strategy. That is sufficient to handle > 95% of the cases without issue. Things get more interesting when we have a scenario where this breaks. In the above case, assume that there is a network issue that prevent the payment provider from accessing the backend system. For example, a misconfigured DNS entry means that external access to the backend system is broken. Retrying in this case won’t help.

For scenarios like this, we have another component in the system that handles order processing. Every few minutes, the backend system queries the payment provider and check for recent orders. It then processes the order as usual. Note that this means that you have to handle scenarios such as an order notification from the backend pull process concurrently with the web hook execution. But you need to handle that anyway (retrying a slow web hook can cause the same situation).

There is additional complexity and duplication in the system in this situation, because we don’t have a single way to do something.

On the other hand, this system is also robust on the other direction. Let’s assume that the backend credentials for the payment provider has expired. We aren’t stopped from processing orders, we still have the web hook to cover for us.  In fact, both pieces of the system are individually redundant. In practice, the web hook is used to speed up common order processing time, with the backup pulling recent orders as backup and recovery mechanism.

In other words, it isn’t code duplication, it is building redundancy into the system.

Again, I would strongly recommend reading: Short Treatise on the Nature of Failure; How Failure is Evaluated; How Failure is Attributed to Proximate Cause; and the Resulting New Understanding of Patient Safety.