The state of a failure condition
I’m looking over of a bunch of distributed algorithm discussion groups, and I recently saw several people making the same bad assumption. The issue is that in a distributed system, you have to assume that any communication between system can fail.
Because that is taken into account in any distributed algorithm, there is a school of thought that believe that errors shouldn’t generate replies. That is horrifying to me.
Let me give a concrete example. In the Raft algorithm, nodes will participate in an election in order to decide who is the leader. A node can decide to vote for a certain candidate, to reject a candidate or it may be down and not responsive. Since we have to handle the non responsive node anyway, it is easy to assume that we only need to reply to the candidate when we actually vote for it. After all, no reply is a negative reply already, no?
The issue with this design decision is that this is indeed correct, but it is also boneheaded*. There are two reasons here. The minor one is that a non reply will force us to wait until a pre-configured timeout happen, after which we can go into failure handling. But actually sending a reply when we know that we refuse to vote for a node can give that node more information, and cut down the time it takes for the node to respond to negative replies.
As important as that is, this isn’t really my main concern. My main concern here is that not sending a reply leaves the administrator trying to figure out what is going on with essentially zero data. On the other hand, if the node send a “you are missing X,Y and Z for me to consider you applicable”, that is something that can be traced, that can be shown and acted upon.
It may seem like a small thing, overall, but it is something with crucial importance for operations. Those are hard enough when you have a single node. When you have a distributed system, you have to plan for that explicitly.
* I am using this terminology intentionally. Anyone who don’t consider production support and monitoring for their software from the get go never had to support complex production systems, where every nugget of information can be crucial.
Comments
We had a project that relied on integration with a system like that. The idea was that if there is an error, we get a reply. If there is no error - we get nothing. If we don't get anything for three hours, we can assume that the message was processed successfully. Or not. Sometimes the errors are just late and you have to manually fix the issues.
For the first few days after deployment to production, we had to call the institution responsible for that system and ask if our packets were accepted successfully, since there was no confirmation from the system.
The funny thing is that we had to poll that system for replies, not more often than 30 minutes, because of performance reasons on their side. I agree, that was a very big system indeed, but I think it would be much better from performance standpoint to use some sort of notifications than to ask clients to poll for messages periodically.
Comment preview