Make a distinction: Errors vs. Alerts

time to read 3 min | 474 words

At several customer visits recently, I encountered a common problem. Their production error handling strategy. In one customer, they capture all of the production errors, push them to System Center Operations Manager (SCOM) and then send those errors to the development team over email. On another customer, they didn’t really have a production error handling strategy.

The interesting thing about that is that in both cases, production errors were handled in exactly the same way. They were ignored until a user called and complained about that.

Huh?!

What do you mean, ignored? The first client obviously did the right thing, they had capture and monitored the production errors, notifying the development team on each and every one of them.

Well, it is actually very simple, at the first client, I asked everyone to raise their hands if they receive the production errors emails. About half the room raised their hands. Then I asked how many of them set up a rule to move those emails from their inbox.

Every single one of them had done that.

The major problem is that errors happen all the time. In the vast majority of cases, you don’t really care, and it will fix itself automatically. For example, a error such as Transaction Deadlock Exception might happen a few times a day. There really isn’t much you can do about those errors (well, re-architecting the app might do that, but that is out of scope for this post). Another might be a call to an external service that occasionally fails and already have a retry strategy in place.

Do you get the problem?

Getting notified about every single production error has immunized the team from them. Now going over the productions errors is just a chore, and a fairly unpleasant one.

That is a major difference between Errors and Alerts. An error is just an exception or a problem that happened in production. Usually, those aren’t really important. It will sort itself out. An ETL process that runs once an hour can fail a few times, and as long as it’ll complete within a reasonable time frame, you don’t care.

Did you notice how often that statement is repeated. You don’t care.

When do you care?

  • When the ETL process has been unable to complete for three consecutive times.
  • When the external service that you call has been unresponsive for over 8 hours.
  • When a specific error is happening over 50 times an hour.
  • When an unknown error showed up in the logs more than twice in the last hour.

Each of those cases requires a human intervention in needed. And in most cases, those are going to be rare.

Errors are common place, they happen all the time and no one really care. Alerts is what you wake up at 2 AM for.