In the previous post, I showed a very simple request router that would always select the fastest node. That worked for a long while, until it didn’t, and the challenge is figuring out why.
As it turns out, the issue is a simple one of spooky action at a distance. Here is what happens. Assume that we have three servers and 10 clients. Each server is sized to handle 4 clients. So far, so good, the system has the capacity to spare.
The problem is in the manner in which clients will detect which is the fastest node in the cluster. The only thing that is considered is the state of the node at the time of selection. At that time, we may end up with all the nodes selecting one particular node as the fastest.
In other words, we have three servers, two of them have no clients talking to them and one of the servers has all the clients talking to it. That results in that node going down, obviously. The clients would then react appropriately, and select a new node to talk to. All of them would do that, find the fastest node, and… bring it down as well. Rinse & repeat.
The issue can be stated as Time Of Check vs Time Of Use, but also as a race condition, where all individual nodes end up doing a synchronized “wave” operation that kills the system.
How do you prevent this?
You introduce randomness into the system. You don’t test the status once, but re-check on a regular basis so you can respond to shifting load. You should also introduce randomness into the process. So the nodes won’t all do this exactly at the same time and end up in the same position.