Cascading retries and the sulky applications
I recently run into a bit of code that made me go: Stop! Don’t you dare going this way!
The reason that I had such a reaction for the code in question is that I have seen where such code will lead you, and that is not anywhere good. The code in question?
This is a pretty horrible thing to do to your system. Let’s count the ways:
- Queries are happening fairly deep in your system, which means that you’re now putting this sort of behavior in a place where it is generally invisible for the rest of the code.
- What happens if the calling code also have something similar? Now we got retries on retries.
- What happens if the code that you are calling has something similar? Now we got retries on retries on retries.
- You can absolutely rely on the code you are calling to do retries. If only because that is how TCP behaves. But also because there are usually resiliency measures implemented.
- What happens if the error actually matters. There is no exception throw in any case, which means that important information is written to the log, which no one ever reads.
- There is no distinction of the types of errors where retry may help and where it won’t.
- What is the query has side effects? For example, you may be calling a stored procedure, but multiple times.
- What happens when you run out of retries? The code will return null, which means that the calling code will like fail with NRE.
What is worst, by the way, is that this piece of code is attempting to fix a very specific issue. Being unable to reach the relevant database. For example, if you are writing a service, you may run into that on reboot, your service may have started before the database, so you need to retry a few times to the let the database to load. A better option would be to specify the load order of the services.
Or maybe there was some network hiccup that you had to deal with? That would sort of work, and probably the one case where this will work. But TCP already does that by resending packets, you are adding this again and it is building up to be a nasty case.
When there is an error, your application is going to sulk, throw strange errors and refuse to tell you what is going on. There are going to be a lot of symptoms that are hard to diagnose and debug.
To quote Release It!:
Connection timeouts vary from one operating system to another, but they’re usually measured in minutes! The calling application’s thread could be blocked waiting for the remote server to respond for ten minutes!
You added a retry on top of that, and then the system just… stops.
Let’s take a look at the usage pattern, shall we?
That will fail pretty badly (and then cause a null reference exception). Let’s say that this is a service code, which is called from a client that uses a similar pattern for “resiliency”.
Question – what do you think will happen the first time that there is an error? Cascading failures galore.
In general, unknown errors shouldn’t be handled locally, you don’t have a way to do that here. You should raise them up as far as possible. And yes, showing the error to the user is general better than just spinning in place, without giving the user any feedback whatsoever.
Comments
I think a lot of developers are afraid of exceptions being thrown from their applications. To some extent this is warranted, but if you look at the list of things your code can do, they are, in order of preference: 1. Work correctly - hopefully this happens 99.99999% of the time 2. Throw an exception 3. Continue running in an unknown and unplanned for state
There's a similar gap in consequences between 1 and 2 and 2 and 3. Looking at this from another angle, an exception being thrown is the second best thing that can happen, and it should be rare if you get #1 to 5 nines.
( I thought I posted this earlier, so sorry if you are moderating and I double posted it.)
I'm not sure if the message translated by this post is complete or I'm not getting it. Entity Framework has retry functionality built into DbContext: Are they wrong by having this feature, as that allows to swallow some of the exceptions? Or that does not count to handling errors locally, as retry count is defined usually at IoC container configuration? Common case for this is how SQL Server HA or autoscale works - during failover or autoscale some of the connections are dropped and I'm not sure if user needs to see that.
Giedrius,
That is a great point, and shows the actual issue here. To start with, take a look at the following link, which details some of the issues with that feature from the horse's mouth, so to speak:https://docs.microsoft.com/en-us/ef/core/miscellaneous/connection-resiliency#transaction-commit-failure-and-the-idempotency-issue
And if the user write this code and uses the resiliency features? You get double work and no good results. Note that this feature is also allowing you to handle specific errors only, not all of them.
Chris B,
The problem is that the failure mode (rare) in the system needs to work in all cases, and error handling like that will cause cascading failures many times over.
@ayende - Absolutely. I was merely making the point that I've seen systems that do things similar to this, and the defense I've usually been given eventually boils down to "Exceptions are bad." Which has some truth to it, but usually isn't the worst option.
"Exceptions are bad" is really silly argument that can be rebuffed REALLY easily: exceptions let one know there is a problem that needs to be dealt with. Hence, they should happen often IF there is an issue, assuming they are detailed and informative enough. I suspect this idea is a reflection of corporate culture more than anything else.
Comment preview