Why you should avoid graceful error handling like the plague that it is
A while ago I was reviewing a pull request by a team member and I realized that I’m looking at an attempt to negotiate graceful termination of a connection between two nodes. In particular, the code in question was invoked when one node was shutting down or had to tear down the connection for whatever reason.
That code was thrown out, and it made a very graceful arc all the way to the recycle bin.
But why? The underlying reason for this was to avoid needless error messages in the logs, which can trigger support calls and cost time & effort to figure out what is going on. That is an admirable goal, but at the same time, it is a false hope and a dangerous one at that.
Let us consider what it means that a node is shutting down. It means that it now needs to notify all its peers about this. It is no longer enough to just tear down all connections, it need to talk to them, and that means that we introduced network delays into the shutdown procedure. It also means that we now have to deal with error handling when we are trying to notify a peer that this node is shutting down, and that way lead to madness.
On the other hand, we have the other node, which node needs to also handle its peer getting up in the middle of the conversation and saying “I’m going away now” mid sentence. For that matter, since the shutdown signal (which is the common case for this to be triggered) can happen at any time, now we need to have thread safety on shutdown so we can send a legible message to the other side, and the other side must be ready to accept the shutdown message at any time. (“Do you have any new documents for me” request that expects a “There are N messages for you” now also need to handle “G’dbye world” notification).
Doing this properly complicates the code at every level, and you still need to handle the rude shutdown scenario.
Furthermore, what is the other side is supposed to do with the information that this node is shutting down the connection voluntarily? It is supposed to not connect to it again? If so, what policy should it use to decided if the other side is down for valid reasons or actually unavailable?
Assuming that there is actually a reason why there is a TCP connection between the two nodes, any interruption in service, for whatever reason, is not a valid state.
And if we ensure that we are always ending the connection in the same rude manner, we also gain a very valuable feature. We make sure that the error handling portion of the code get exercised on a regular basis, so if there are any issues there, they will be discovered easily.
As for the original issue of reducing support calls because of transient / resolved errors. That can be solved by not logging the error immediately, but waiting a bit to verify that the situation actually warrants writing to the operations log (writing to the info log should obviously happen regardless).
Comments
Does the server shutdown wait for current messages to finish processing, or does it just terminate everything immediately? What happens to new requests made after the shutdown signal?
Just quote the legend: https://blogs.msdn.microsoft.com/ricom/2008/05/12/shutdown-is-no-time-for-spring-cleaning/
When your application is ordered to shutdown the last thing you should do is enumerate every piece of memory you have ever allocated and systematically give them back to the operating system. Your program has a death sentence, and soon your resources are going back to the operating system whether you like it or not: what you must do is look at the minimum possible amount of memory necessary to get to a nice safe stable state and then exit as quickly as possible. Abandoning your memory like this gives the operating system the best chance to get your process unloaded while swapping in the least amount of memory and causing the least impact to the systems disk and memory caches.
Yeah. Speaking of communication during shutdown, the four-way handshake to close a TCP/IP socket has always seemed excessive. I mean, it is technically necessary since it's full-duplex and some protocols may not have a notion of a "message"; and it's not horrible since the shutdown is actually handled by the OS if the app exits, but... still... four steps???
@Oleg Good article - I have wasted a lot of minutes while my IE was closing down. It just needed to save the settings and then a failfast.
If you have a stationary pc, you might want to receive updates when you go home and spring cleaning during shutdown is OK. I hate receiving updates in the morning. On the other hand, If you have a laptop pc, it is waste of time if it wants to update when you want to go home. Using a laptop, I would prefer updates before lunch.
In a test environment, it is OK to try to test that all the memory is released in order to avoid memory leaks. I guess that fail fast is bypassing the destructors (RAII).
Just a crazy idea for RavenDB. An option to do maintenance/time consuming tasks when shutting down. I already know you hate options!
Paul, Normally, the server will wait for request to complete up to a point, and it will refuse new requests once the shutdown began.
Oleg, That assumes that you shut down the whole thing. In this case, we are talking about shutting down a single database in a server that can host many. And the actual error handling isn't on this server, it is on the other side, the server that it is talking to and it is going to still be alive.
Stephen, You really want that to happen, though. Otherwise you have connections in TIME_WAIT and have to resort to evil measures to avoid it.
This seems rather specific to stateful network connections -- where you have a state machine processing incoming bytes over a socket, parsing them into data structures you can act on, especially with recursive structures; or where you have nontrivial sequences of inputs and outputs. Databases and SMTP are good examples of this. (Application state that doesn't lead to this sort of caution with the connection should be easier to deal with.)
If you have a service with stateless connections, a message that says "omg unexpected shutdown bai" is cheap to add. It doesn't save you from ungraceful shutdown handling, granted, but you won't pay for it everywhere. A websocket over which you're accepting a series of remote calls is a good example -- you have one function that handles commands from the client, and one of those commands is to close the connection. Far from a plague. But the benefits would seem to be lacking.
For it to be useful, though, you need to do something different in the face of the "shutting down now sry" message than when you discover the connection was closed out from under you.
Comment preview