In a three node cluster, we have one node that looked to be fine. It connected to all the other nodes and was the cluster leader. The other two nodes, however, were not in the cluster and in fact, they were showing signs that they never were in the cluster.
What was really strange was that we took the other two machines down and the first node was still showing a successful cluster. We looked deeper and realized that it wasn’t actually a healthy situation, in fact, this node was very rapidly switching between leader and follower mode.
It took a bit of time to figure out what was going on, but the root cause was DNS. We had the three nodes on separate DNS (a.oren.development.run, b.oren.development.run, c.oren.development.run) and they were setup to point to the three machines. However, we have previously used the same domain names to run a cluster on the first machine only. Because of the way DNS updates, whenever the machine at a.oren.development.run would try to connect to b.oren.development.run it would actually connect to itself.
At this point, A would tell B that it is the leader. But A is B, so A would respond by becoming a follower (because it was told it should, by itself). Because it became a follower, it disconnected from itself. After a timeout, it would become leader again, and the cycle would continue.
Every time that the server would get up, it would whip itself down again. “I’m a leader”, “No, I’m a leader”, etc.
This is a fun thing to discover. We had to trace pretty deep to figure out that the problem was in the DNS cache (since the DNS itself was properly updated).
We fixed things so we now recognize if we are talking to ourselves and error properly.