Production Test RunThe self flagellating server
Sometimes you see the impossible. In one of our scenarios, we saw a cluster that had such a bad case of split brain that it came near to fracturing the very boundaries of space & time.
In a three node cluster, we have one node that looked to be fine. It connected to all the other nodes and was the cluster leader. The other two nodes, however, were not in the cluster and in fact, they were showing signs that they never were in the cluster.
What was really strange was that we took the other two machines down and the first node was still showing a successful cluster. We looked deeper and realized that it wasn’t actually a healthy situation, in fact, this node was very rapidly switching between leader and follower mode.
It took a bit of time to figure out what was going on, but the root cause was DNS. We had the three nodes on separate DNS (a.oren.development.run, b.oren.development.run, c.oren.development.run) and they were setup to point to the three machines. However, we have previously used the same domain names to run a cluster on the first machine only. Because of the way DNS updates, whenever the machine at a.oren.development.run would try to connect to b.oren.development.run it would actually connect to itself.
At this point, A would tell B that it is the leader. But A is B, so A would respond by becoming a follower (because it was told it should, by itself). Because it became a follower, it disconnected from itself. After a timeout, it would become leader again, and the cycle would continue.
Every time that the server would get up, it would whip itself down again. “I’m a leader”, “No, I’m a leader”, etc.
This is a fun thing to discover. We had to trace pretty deep to figure out that the problem was in the DNS cache (since the DNS itself was properly updated).
We fixed things so we now recognize if we are talking to ourselves and error properly.
More posts in "Production Test Run" series:
- (26 Jan 2018) Overburdened and under provisioned
- (24 Jan 2018) The self flagellating server
- (22 Jan 2018) Rowhammer in Voron
- (18 Jan 2018) When your software is configured by a monkey
- (17 Jan 2018) Too much of a good thing isn’t so good for you
- (16 Jan 2018) The worst is yet to come
Comments
Out of curiosity, by using some sort of autogenerated guid per node or what?
njy, Pretty much, we are checking the topology id, which is persistent, but the same general idea.
Comment preview