Why you should avoid graceful error handling like the plague that it is
A while ago I was reviewing a pull request by a team member and I realized that I’m looking at an attempt to negotiate graceful termination of a connection between two nodes. In particular, the code in question was invoked when one node was shutting down or had to tear down the connection for whatever reason.
That code was thrown out, and it made a very graceful arc all the way to the recycle bin.
But why? The underlying reason for this was to avoid needless error messages in the logs, which can trigger support calls and cost time & effort to figure out what is going on. That is an admirable goal, but at the same time, it is a false hope and a dangerous one at that.
Let us consider what it means that a node is shutting down. It means that it now needs to notify all its peers about this. It is no longer enough to just tear down all connections, it need to talk to them, and that means that we introduced network delays into the shutdown procedure. It also means that we now have to deal with error handling when we are trying to notify a peer that this node is shutting down, and that way lead to madness.
On the other hand, we have the other node, which node needs to also handle its peer getting up in the middle of the conversation and saying “I’m going away now” mid sentence. For that matter, since the shutdown signal (which is the common case for this to be triggered) can happen at any time, now we need to have thread safety on shutdown so we can send a legible message to the other side, and the other side must be ready to accept the shutdown message at any time. (“Do you have any new documents for me” request that expects a “There are N messages for you” now also need to handle “G’dbye world” notification).
Doing this properly complicates the code at every level, and you still need to handle the rude shutdown scenario.
Furthermore, what is the other side is supposed to do with the information that this node is shutting down the connection voluntarily? It is supposed to not connect to it again? If so, what policy should it use to decided if the other side is down for valid reasons or actually unavailable?
Assuming that there is actually a reason why there is a TCP connection between the two nodes, any interruption in service, for whatever reason, is not a valid state.
And if we ensure that we are always ending the connection in the same rude manner, we also gain a very valuable feature. We make sure that the error handling portion of the code get exercised on a regular basis, so if there are any issues there, they will be discovered easily.
As for the original issue of reducing support calls because of transient / resolved errors. That can be solved by not logging the error immediately, but waiting a bit to verify that the situation actually warrants writing to the operations log (writing to the info log should obviously happen regardless).