Cascading retries and the sulky applications

time to read 3 min | 585 words

imageI recently run into a bit of code that made me go: Stop! Don’t you dare going this way!

The reason that I had such a reaction for the code in question is that I have seen where such code will lead you, and that is not anywhere good. The code in question?

This is a pretty horrible thing to do to your system. Let’s count the ways:

  • Queries are happening fairly deep in your system, which means that you’re now putting this sort of behavior in a place where it is generally invisible for the rest of the code.
  • What happens if the calling code also have something similar? Now we got retries on retries.
  • What happens if the code that you are calling has something similar? Now we got retries on retries on retries.
    • You can absolutely rely on the code you are calling to do retries. If only because that is how TCP behaves. But also because there are usually resiliency measures implemented.
  • What happens if the error actually matters. There is no exception throw in any case, which means that important information is written to the log, which no one ever reads.
  • There is no distinction of the types of errors where retry may help and where it won’t.
  • What is the query has side effects? For example, you may be calling a stored procedure, but multiple times.
  • What happens when you run out of retries? The code will return null, which means that the calling code will like fail with NRE.

What is worst, by the way, is that this piece of code is attempting to fix a very specific issue. Being unable to reach the relevant database. For example, if you are writing a service, you may run into that on reboot, your service may have started before the database, so you need to retry a few times to the let the database to load. A better option would be to specify the load order of the services.

Or maybe there was some network hiccup that you had to deal with? That would sort of work, and probably the one case where this will work. But TCP already does that by resending packets, you are adding this again and it is building up to be a nasty case.

When there is an error, your application is going to sulk, throw strange errors and refuse to tell you what is going on. There are going to be a lot of symptoms that are hard to diagnose and debug.

To quote Release It!:

Connection timeouts vary from one operating system to another, but they’re usually measured in minutes! The calling application’s thread could be blocked waiting for the remote server to respond for ten minutes!

You added a retry on top of that, and then the system just… stops.

Let’s take a look at the usage pattern, shall we?

That will fail pretty badly (and then cause a null reference exception). Let’s say that this is a service code, which is called from a client that uses a similar pattern for “resiliency”.

Question – what do you think will happen the first time that there is an error?  Cascading failures galore.

In general, unknown errors shouldn’t be handled locally, you don’t have a way to do that here. You should raise them up as far as possible. And yes, showing the error to the user is general better than just spinning in place, without giving the user any feedback whatsoever.