Production ready code is much more than error handling

time to read 7 min | 1227 words

imageProduction ready code is a term that I don’t really like. I much prefer the term: Production Ready System. This is because production readiness isn’t really a property of a particular piece of code, but of the entire system.

The term is often thrown around, and usually it is referred to adding error handling and robustness to a piece of code. For example, let’s take an example from the Official Docs:

This kind of code is obviously not production ready, right? Asked to review it, most people would point out the lack of error handling if the request fails. I asked on twitter about this and got some good answers, see here.

In practice, to make this piece of code production worthy you’ll need a lot more code and infrastructure:

  • .NET specific - ConfigureAwait(false) to ensure this works properly with a SynchronizationContext
  • .NET specific – Http Client caches Proxy settings and DNS resolution, requiring you to replace it if there is a failure / on a timer.
  • .NET specific – Exceptions won’t be thrown from Http Client if the server sent an error back (including things like auth failures).
  • Input validation – especially if this is exposed to potentially malicious user input.
  • A retry mechanism (with back off strategy) is required to handle transient conditions, but need either idempotent requests or way to avoid duplicate actions.
  • Monitoring for errors, health checks, latencies, etc.
  • Metrics for performance, how long such operations take, how many ops / sec, how many failures, etc.
  • Metrics for the size of responses (which may surprise you).
  • Correlation id for end to end tracing.
  • Properly handling of errors – including reading the actual response from the server and surfacing it to the caller / logs.
  • Handling successful requests that don’t contain the data they are supposed to.

And these are just the stuff that pop to my head from looking at 10 lines of really simple code.

And after you have done all of that, you are still not really production ready. Mostly because if you implemented all of that in the GetProductAsync() function, you can’t really figure out what is actually going on.

These kind of operation is something that you want to have to implement once, via the infrastructure. There are quite a few libraries which does robust service handling that you can use, and using that will help, but it will only take you part way toward production ready system.

Let’s take cars and driving as an example of a system. If you’ll look at a car, you’ll find that quite a bit of the car design, constraints and feature set is driven directly by the need to handle the failure mode.

A modern car will have (just the stuff that is obvious and pops to mind):

  • Drivers – required explicit learning stage and passing competency test, limits on driving in impaired state, higher certification levels for more complex vehicles.
  • Accident prevention: ABS, driver assist and seat belt beeps.
  • Reduce injuries / death when accidents do happen – seat belts, air bags, crumple zones.
  • On the road – rumble strips, road fence, road maintenance, traffic laws, active and passive enforcement.

I’m pretty sure that anyone who actually understand cars will be shocked by how sparse my list is. It is clear, however, that accidents, their prevention and reducing their lethality and cost are a part and parcel of all design decisions on cars. In fact, there is a multi layered approach for increasing the safety of drivers and passengers. I’m not sure how comparable the safety of a car is to production readiness of a piece of software, though. One of the ways that cars compete with one another is on the safety features. So there is a strong incentive to improve there. That isn’t usually the case with software.

It usually take a few (costly) lessons about how much being unavailable costs you before you can really feel how much not being production ready costs you. And at this point, most people turn to error handling and recovery strategies. I think this is a mistake. A great read on the topic is How Complex System Fail, it is a great, short paper, highly readable and very relevant to the field of software development.

I consider a system to production ready when it has, not error handling inside a particular component, but actual dedicated components related to failure handling (note the difference from error handling), management of failures and its mitigations.

The end goal is that you’ll be able to continue execution and maintain semblance of normalcy to the outside world. That means having dedicated parts of the system that are just about handling (potentially very rare) failure modes as well as significant impact on your design. and architecture. That is not an inexpensive proposition. it takes quite a lot of time and effort to get there, and it is usually only worth it if you actually need the reliability this provides.

With cars, the issue is literally human lives, so we are willing to spend quite a lot of preventing accidents and reducing their impact. However, the level of robustness I expect from a toaster is quite different (don’t go on fire, pretty much) and most of that is already handled by the electrical system in the house.

Erlang is a good example of a language and environment that has always prioritized production availability. Erlang systems famously have 99.9999999% availability (that is nine nines). That is 32 milliseconds of downtime per year, which pretty much means less than the average GC pause in most systems. Erlang have a lot of infrastructure to support this kind of availability numbers, but that still require you to understand the whole system.

For example, if your Erlang service depends on a database, a restart of a database server (which takes 2 minutes to cycle) might very well means that your service processes will die, will be restarted by their supervisors only to die again and again. At this point, the supervisors itself give up and die, passing the buck up the chain. The usual response is to restart the supervisor again a few times, but the database is still down and we are in a cascading failure scenario. Just restarting is really effective in handling errors, but for certain failure scenarios, you need to consider how you’ll actually make it work. A database being unavailable can make your entire system cycle through its restarts options and die just as the database is back online. For that matter, what happens to all the requests that you tried to process at that time?

I have had a few conversations that went something like: “Oh, we use Erlang, that is handled”, but production readiness isn’t something that you can solve at the infrastructure level. It has a global impact on your architecture, design and the business itself. There are a lot of questions that you can’t answer from a technical point of view. “If I can’t validate the inventory status, should I accept an order or not?” is probably the most famous one, and that is something that the business itself need to answer.

Although, to be honest, the most important answer that you need from the business is a much more basic one: “Do we need to worry about production readiness, and if so, by how much?”