Ayende @ Rahien

Grant Fritchey commented on Make a distinction: Errors vs. Alerts

Sat, 19 Nov 2011 13:21:22 GMT

Excellent post. One of the biggest problems I see with implementation of monitoring software, any monitoring software, is that people don't tune the alerts to maximize signal to noise. I wrote about it here: http://www.simple-talk.com/sql/database-administration/preventing-problems-in-sql-server/

Fero commented on Make a distinction: Errors vs. Alerts

Fri, 18 Nov 2011 09:29:43 GMT

I'm have been using Elmah error module for a long time and, it works really well but only on Asp.Net and Asp.Net Mvc. So then developed extension for Elmah and it can be used with any project Silverlight, Console, WPF, WCF. Here is source https://github.com/vincoss/vinco-logging-toolkit. Later update will call Elmah.Everywhere

Scooletz commented on Make a distinction: Errors vs. Alerts

Fri, 18 Nov 2011 07:00:57 GMT

@Will, @Alwin yes it would be nice, whether using IObservable or other way to describe the requirement of an alert, but what about scaling such solution? What if error occurs on different machines - are logged to different log files/dbs whatever? @Ayende, where do store such information? How do you want to filter the stream of events from multiple servers?

Alwin commented on Make a distinction: Errors vs. Alerts

Fri, 18 Nov 2011 02:01:02 GMT

Will, couldn't you do something like that with Reactive Framework (Rx)? You know, with Throttle and such...

Will Gant commented on Make a distinction: Errors vs. Alerts

Thu, 17 Nov 2011 22:17:49 GMT

There should also be a SqlException inside angle brackets < > to the right of the For (it's intended to be a generic method). It might have gotten interpreted as HTML.

Will Gant commented on Make a distinction: Errors vs. Alerts

Thu, 17 Nov 2011 22:15:56 GMT

There should only be one period after the For(). Ayende's site handled the code just fine, but it figures I'd make at least one syntax error.

Will Gant commented on Make a distinction: Errors vs. Alerts

Thu, 17 Nov 2011 22:12:04 GMT

It would be nice if there was a package that let you fluently configure a policy for how your app handles errors based on type and contents. I'd love to be able to do something like: For(). .InTimeSpan().Minutes(5) .Occurs(10) .CompareBy(CompareBy.StackTrace | CompareBy.Host) .Where(ex=> e.Message.Contains("Timeout")) .Act(ex=>{SendPanicMessage(ex);}); That way, I could filter errors by type, contents, how close together they are, etc, and tell it what to do with them. I have no idea off the top of my head how one might implement this and make it perform well (especially across multiple machines), but something like this would be awful handy. The intent of the above is to send a panic message when 10 or more SqlExceptions with the word "Timeout" in their message occur in a five minute timespan, from the same device with the same stacktrace. (This is just a first brush - someone that is actually skilled at making fluent interfaces could make this a good deal cleaner and more expressive). I think you'd almost have to chuck the exceptions off into a message queue or something though - you wouldn't want the logic to check all this stuff to be running inside your app. It would also probably need to be pushed to a central location to handle the load-balancing scenario. Further, if you were to chuck this into a database somewhere, you could report on the frequency of the errors. That might be handy for building a triage list for a development roadmap before the clients get involved. I also hope that the code doesn't get turned into (worse) indecipherable gibberish in the act of posting it.

Chris Wright commented on Make a distinction: Errors vs. Alerts

Thu, 17 Nov 2011 21:43:44 GMT

@Simon You can use those strategies for some things. But consider this pattern of behavior: you're talking to a service and it usually responds in 50ms, with 99.9% of calls finishing in 250ms. But now 50% of its calls are over 2 seconds. If you have an alarm for a single call taking 2 seconds, you'll probably get pinged every ten or fifteen thousand calls. This isn't actionable, or even a problem. You want alarming on aggregate behavior, not individual requests. Now you're adding a fair bit of complexity around this call. For extra credit, what if you have half a dozen machines running behind a load balancer and want to alert based on the aggregate logs?

Phil commented on Make a distinction: Errors vs. Alerts

Thu, 17 Nov 2011 17:54:51 GMT

@Will Gant I really like log4net. It's open source (not from Microsoft) and has different logging levels, which can be changed at run time. As a plus it only requires a single assembly reference.

Will Gant commented on Make a distinction: Errors vs. Alerts

Thu, 17 Nov 2011 17:29:57 GMT

I was kind of hoping there was a non-microsoft open source package that handles that well. My experience with the Enterprise Library has been that it just requires so much configuration and tinkering to get working that it isn't worth the effort. I'll admit that this impression is probably a bit dated though - they may have improved since the last time I worked with their stuff.

Rafal commented on Make a distinction: Errors vs. Alerts

Thu, 17 Nov 2011 16:39:38 GMT

Very often catching exceptions and logging them is not enough, sometimes an alert should be raised if nothing happens for some time - for example when some service responsible for receiving messages from a queue dies quietly or gets stuck. Also, performance problems will not be detected by analyzing exceptions in the log file. IMHO the log files should be used to find the problem cause but alerts should be raised based on some other criteria - like high-level application/system-level statistics and deviations from values considered normal. Examples: measuring the 'queue latency' (time the messages spend in a queue before being processed)', web server request queue length, unusual deviations in business process statistics like number of documents processed or number of tasks completed per minute etc. Usually you should identify the key indicators of system (mis) behavior and select such ones that are important to the users (they don't care about the serveer disk queue length but they care a lot about GUI response time or the time it takes to some document to travel between two systems). Sometimes it's good to implement checkpoints in the business process, for example making sure that all documents that arrive into the system are dealt with within 3 days (if not then it means that there's error somewhere).

Daniel Lidström commented on Make a distinction: Errors vs. Alerts

Thu, 17 Nov 2011 16:07:32 GMT

@Will Gant: I believe Microsoft's Enterprise Library has a block for this purpose.

Will Gant commented on Make a distinction: Errors vs. Alerts

Thu, 17 Nov 2011 16:06:02 GMT

Strange coincidence. I'm trying to work out a better error handling strategy where I work right now, as we have a lot of error messages coming in that are just noise. Like your example, we've made a habit of ignoring the errors, often to our detriment (when the error is reported by a customer, it's now a marketing problem, not just a software problem). I've managed to get rid of a few of the big ones, but we're still getting far too many errors that are simply not useful - I don't know how we're going to fix this so that we are only notified when the error is worth being notified about. Is there a package that makes the handling of errors cleaner and more policy-driven? That would be nice.

Simon Skov Boisen commented on Make a distinction: Errors vs. Alerts

Thu, 17 Nov 2011 15:40:34 GMT

Shouldn't the use of error-severity categories solve a problem like the first of your customers had? Only log it as an error when the service was unresponsive for 8 times, else log as info or debug?

Joseph Daigle commented on Make a distinction: Errors vs. Alerts

Thu, 17 Nov 2011 12:23:27 GMT

The corollary to this is that error handling cannot be an afterthought in your system in order to do proper alerting. Alerting is typically on par with any other feature or user story that must be designed and tested. The only difference is that the "user" of this feature is typically a sys admin or a devops team member.