Make a distinction: Errors vs. Alerts
At several customer visits recently, I encountered a common problem. Their production error handling strategy. In one customer, they capture all of the production errors, push them to System Center Operations Manager (SCOM) and then send those errors to the development team over email. On another customer, they didn’t really have a production error handling strategy.
The interesting thing about that is that in both cases, production errors were handled in exactly the same way. They were ignored until a user called and complained about that.
Huh?!
What do you mean, ignored? The first client obviously did the right thing, they had capture and monitored the production errors, notifying the development team on each and every one of them.
Well, it is actually very simple, at the first client, I asked everyone to raise their hands if they receive the production errors emails. About half the room raised their hands. Then I asked how many of them set up a rule to move those emails from their inbox.
Every single one of them had done that.
The major problem is that errors happen all the time. In the vast majority of cases, you don’t really care, and it will fix itself automatically. For example, a error such as Transaction Deadlock Exception might happen a few times a day. There really isn’t much you can do about those errors (well, re-architecting the app might do that, but that is out of scope for this post). Another might be a call to an external service that occasionally fails and already have a retry strategy in place.
Do you get the problem?
Getting notified about every single production error has immunized the team from them. Now going over the productions errors is just a chore, and a fairly unpleasant one.
That is a major difference between Errors and Alerts. An error is just an exception or a problem that happened in production. Usually, those aren’t really important. It will sort itself out. An ETL process that runs once an hour can fail a few times, and as long as it’ll complete within a reasonable time frame, you don’t care.
Did you notice how often that statement is repeated. You don’t care.
When do you care?
- When the ETL process has been unable to complete for three consecutive times.
- When the external service that you call has been unresponsive for over 8 hours.
- When a specific error is happening over 50 times an hour.
- When an unknown error showed up in the logs more than twice in the last hour.
Each of those cases requires a human intervention in needed. And in most cases, those are going to be rare.
Errors are common place, they happen all the time and no one really care. Alerts is what you wake up at 2 AM for.
Comments
The corollary to this is that error handling cannot be an afterthought in your system in order to do proper alerting. Alerting is typically on par with any other feature or user story that must be designed and tested. The only difference is that the "user" of this feature is typically a sys admin or a devops team member.
Shouldn't the use of error-severity categories solve a problem like the first of your customers had? Only log it as an error when the service was unresponsive for 8 times, else log as info or debug?
Strange coincidence. I'm trying to work out a better error handling strategy where I work right now, as we have a lot of error messages coming in that are just noise. Like your example, we've made a habit of ignoring the errors, often to our detriment (when the error is reported by a customer, it's now a marketing problem, not just a software problem). I've managed to get rid of a few of the big ones, but we're still getting far too many errors that are simply not useful - I don't know how we're going to fix this so that we are only notified when the error is worth being notified about.
Is there a package that makes the handling of errors cleaner and more policy-driven? That would be nice.
@Will Gant: I believe Microsoft's Enterprise Library has a block for this purpose.
Very often catching exceptions and logging them is not enough, sometimes an alert should be raised if nothing happens for some time - for example when some service responsible for receiving messages from a queue dies quietly or gets stuck. Also, performance problems will not be detected by analyzing exceptions in the log file. IMHO the log files should be used to find the problem cause but alerts should be raised based on some other criteria - like high-level application/system-level statistics and deviations from values considered normal. Examples: measuring the 'queue latency' (time the messages spend in a queue before being processed)', web server request queue length, unusual deviations in business process statistics like number of documents processed or number of tasks completed per minute etc. Usually you should identify the key indicators of system (mis) behavior and select such ones that are important to the users (they don't care about the serveer disk queue length but they care a lot about GUI response time or the time it takes to some document to travel between two systems). Sometimes it's good to implement checkpoints in the business process, for example making sure that all documents that arrive into the system are dealt with within 3 days (if not then it means that there's error somewhere).
I was kind of hoping there was a non-microsoft open source package that handles that well. My experience with the Enterprise Library has been that it just requires so much configuration and tinkering to get working that it isn't worth the effort. I'll admit that this impression is probably a bit dated though - they may have improved since the last time I worked with their stuff.
@Will Gant I really like log4net. It's open source (not from Microsoft) and has different logging levels, which can be changed at run time. As a plus it only requires a single assembly reference.
@Simon You can use those strategies for some things.
But consider this pattern of behavior: you're talking to a service and it usually responds in 50ms, with 99.9% of calls finishing in 250ms. But now 50% of its calls are over 2 seconds.
If you have an alarm for a single call taking 2 seconds, you'll probably get pinged every ten or fifteen thousand calls. This isn't actionable, or even a problem.
You want alarming on aggregate behavior, not individual requests. Now you're adding a fair bit of complexity around this call.
For extra credit, what if you have half a dozen machines running behind a load balancer and want to alert based on the aggregate logs?
It would be nice if there was a package that let you fluently configure a policy for how your app handles errors based on type and contents. I'd love to be able to do something like:
For<System.Data.SqlClient.SqlException>(). .InTimeSpan().Minutes(5) .Occurs(10) .CompareBy(CompareBy.StackTrace | CompareBy.Host) .Where(ex=> e.Message.Contains("Timeout")) .Act(ex=>{SendPanicMessage(ex);});
That way, I could filter errors by type, contents, how close together they are, etc, and tell it what to do with them. I have no idea off the top of my head how one might implement this and make it perform well (especially across multiple machines), but something like this would be awful handy. The intent of the above is to send a panic message when 10 or more SqlExceptions with the word "Timeout" in their message occur in a five minute timespan, from the same device with the same stacktrace. (This is just a first brush - someone that is actually skilled at making fluent interfaces could make this a good deal cleaner and more expressive).
I think you'd almost have to chuck the exceptions off into a message queue or something though - you wouldn't want the logic to check all this stuff to be running inside your app. It would also probably need to be pushed to a central location to handle the load-balancing scenario. Further, if you were to chuck this into a database somewhere, you could report on the frequency of the errors. That might be handy for building a triage list for a development roadmap before the clients get involved.
I also hope that the code doesn't get turned into (worse) indecipherable gibberish in the act of posting it.
There should only be one period after the For(). Ayende's site handled the code just fine, but it figures I'd make at least one syntax error.
There should also be a SqlException inside angle brackets < > to the right of the For (it's intended to be a generic method). It might have gotten interpreted as HTML.
Will, couldn't you do something like that with Reactive Framework (Rx)? You know, with Throttle and such...
@Will, @Alwin yes it would be nice, whether using IObservable or other way to describe the requirement of an alert, but what about scaling such solution? What if error occurs on different machines - are logged to different log files/dbs whatever? @Ayende, where do store such information? How do you want to filter the stream of events from multiple servers?
I'm have been using Elmah error module for a long time and, it works really well but only on Asp.Net and Asp.Net Mvc. So then developed extension for Elmah and it can be used with any project Silverlight, Console, WPF, WCF. Here is source https://github.com/vincoss/vinco-logging-toolkit.
Later update will call Elmah.Everywhere
Excellent post. One of the biggest problems I see with implementation of monitoring software, any monitoring software, is that people don't tune the alerts to maximize signal to noise. I wrote about it here: http://www.simple-talk.com/sql/database-administration/preventing-problems-in-sql-server/
Comment preview