Ayende @ Rahien

It's a girl

Do you monitor negative events?

You are probably aware that you need to monitor your production systems for errors, and to add health monitoring for your servers.

But are you monitoring negative events? What is a negative event, stuff that should have happened and didn’t.

For example, every week you have a process that runs to update the tax rates that applies to your customers. This is implemented as a scheduled process, but for some reason (computer was just being rebooted, the user’s password expire, etc) that process didn’t run. There isn’t an error, pre se. You won’t get an error because nothing actually had a chance to actually happen.

Another example would be getting a callback confirmation that an order payment has been correctly processed. That usually happen within 1 – 5 minutes, and you get an OK/Fail notification. But what happens if that notification just never came?

This is a much more dangerous scenario, because you have to not only be prepared for handling errors, you have to be prepared for… nothing to happen.

What it means is that you have to have some way to setup expectations in the system, and act on them when you don’t get a confirmation (negative or positive) within a given time frame.

Comments

ashic
02/23/2012 10:26 AM by
ashic

This is simple enough to achieve - "Send this message to me after x seconds". Simple case of a [wanna say saga but it's an overloaded term].

Matthijs ter Woord
02/23/2012 10:26 AM by
Matthijs ter Woord

You describe a very interesting problem (and a fairly common one I think). I personally don't solve this problem, as I still haven't figured out a generic way of doing this. Therefore I'm really interested in your approach!

Pascal Van Vlaenderen
02/23/2012 10:54 AM by
Pascal Van Vlaenderen

I know there are systems around that check stuff regulary. Problem of all those systems is that they're to general, they never check what's really important for you.

In a past project I tackled this problem by having scripts run every X time, they send me a mail if somethings wrong and they also write back in the database. On a monitor page of the application, I then did a query to check how long ago any succesfull checks passed. If it was higher than 1 day, I displayed those in red.

Scooletz
02/23/2012 11:01 AM by
Scooletz

In NServiceBus you've got timeouts and, as @ashic wrote, you can use saga and a logic WaitAny for a timeout or a completition or an error. But yes, in majority of cases people do not take it into consideration.

Ayende Rahien
02/23/2012 11:03 AM by
Ayende Rahien

Scooletz & Ashic, How do you detect the missing tax records update? Sure, for online stuff it it easy. It is the batch process that didn't run for three months that kills you

ashic
02/23/2012 11:08 AM by
ashic

Send me a reminder after 7 days... After 7 days: Ohh...let's see if that came in. It has? Cool, process. It hasn't: alert, alarm, whatever.

NSB has such timing mechanism (you can do yearly stuff even). Quartz.NET is another option.

PS: Oren..mind changing the captcha thing....Makes me feel stupid over and over again :)

Felice Pollano
02/23/2012 11:18 AM by
Felice Pollano

The old whatchdog stuff... I did it not so long ago, when I need to monitor a legacy service sometimes hanging... solution was to sniff the log for modifications and wake up it if it does not update after some times. The watchdog service I created was actually more configuratable, but... used just for that till now :)

Ayende Rahien
02/23/2012 11:18 AM by
Ayende Rahien

Ashic, After a few times when you succeed, the CAPTCHA should go away.

And yes, I am aware of the timeout mechanism in NSB. The key is that you need to do that. And re-do that for the next week. It is one of those things that people generally don't think about

ashic
02/23/2012 11:22 AM by
ashic

Definitely agree. Technically it's not difficult. It's more a modelling problem with "rose tinted glasses" syndrome.

Felice Pollano
02/23/2012 11:25 AM by
Felice Pollano

As another thought: Negative event tracking usually happen when the infrastructure became really stable: before there is to much interaction - debugging, monitoring user attention and so on - that hide the needing of such a logging.

cbp
02/23/2012 12:02 PM by
cbp

We do this, but its pretty much all custom built.

Most of the services log directly into a table on the central db whenever they run.

A central health monitoring service is configured so that it knows which services should be logging into the HealthMonitor table, and how often. If there is a problem, it reports it via SMS.

The whole thing is pretty straight forward and relatively easy to manage.

The health monitor is also quite useful in that it can check for 'usage levels' - i.e. we expect customers to be accessing our website pretty consistently throughout the day. Obviously it easy to monitor whether the website is up or down. But we also make sure that people are actually using the site. For example, there could be a javascript error which is preventing people from submitting a particular form. Using the health monitor we can have alerts sent "If no one has submitted this particular form in the last two hours".

Andy Pook
02/23/2012 01:26 PM by
Andy Pook

Storm (by Twitter) has a mechanism for tracking "jobs" (multi-step, distributed). If an item does not complete Storm will replay. https://github.com/nathanmarz/storm/wiki/Concepts (see "Reliability") http://engineering.twitter.com/2011/08/storm-is-coming-more-details-and-plans.html

Charlie Kilian
02/23/2012 01:47 PM by
Charlie Kilian

I don't track these yet, but I have plans to do so. Right now, I'm planning on building myself a tool that runs as a windows service and listens for regular heartbeats from expected processes. I'm planning to build a library that sends a heartbeat, something like:

heartbeat.Send( "process_id" );

If the heartbeat monitor service doesn't hear from the expected services within a configured amount of time, it will send me an email that something failed to run.

That's my idea, anyway. Do you perhaps have a better one?

Sean Gough
02/23/2012 01:59 PM by
Sean Gough

@cbp @Charlie Kilian — this is the same approach we've used in the past. In a new distributed system we're working on I was planning to do the same as well, but I always have the same nagging question. What monitors the central health monitoring service? What if it doesn't send the email/notification that service N hasn't reported in a while? It's kind of a self-perpetuating problem.

Currently we regularly check to make sure that service is running or hook it up to a local task tray utility that check the central service very few minutes, but this still relies on us checking it.

I'd be interested to know what others are doing or how to handle this in a generic and reliable way.

cocowalla
02/23/2012 02:06 PM by
cocowalla

@Sean Gough

...and then what monitors the service that monitors the service that monitors the central health monitoring service, ad infinitum :)

Sean Gough
02/23/2012 02:16 PM by
Sean Gough

@cocowalla — exactly!

Darran
02/23/2012 03:30 PM by
Darran

Interesting post.

Have you looked at using something like NEsper - event stream processing can be used for looking at events and missing events over time in a variety of ways. You could probably build an event driven expectations engine on top of that.

It could be driven in two ways that I can think of right now (sitting by a beach in Thailand). On the one hand, check that events are received in certain order after certain periods of time, or alternatively, use it to drive reminders so that databases can be checked for last update times and userids.

Just a thought...

D

Shashi
02/23/2012 03:52 PM by
Shashi

Currently we do this for our batch jobs by having a notification process that runs independent of the jobs process and periodically (once a day) builds a report of what ran, succeeded, failed or was missing in the logs. That tells us if something didn't run.

I'd like to use Splunk on top of our logs to get real-time dashboards for this stuff so we can monitor our jobs, apps, databases, even http requests in one place.

Jarrett
02/23/2012 04:19 PM by
Jarrett

I almost always handle this type of thing with SQL Server Agent jobs. Some record A is created or modified. If time X has passed and some state is not met, then add a tickler/todo/notification something. Sure, it's probably clunky, but it works every time.

Daniel Lang
02/23/2012 07:12 PM by
Daniel Lang

@Sean Gough, You need to setup negation between both the system being monitored, and also the monitoring system, so that both can send an alert if the other one doesn't show up or runs on error. That way, you don't have the infinite monitor the monitor problem...

Charlie Kilian
02/23/2012 07:38 PM by
Charlie Kilian

@Sean Gough Actually, I was planning on running two instances of the heartbeat monitor on two separate servers, and then having them send a heartbeat to each other.

You could still have both instances die, and that'd be a problem. But it'd significantly cut down on the likelihood of it happening, I'd think.

Also, I'm planning to have one send out "Everything is okay" emails every couple of hours, so I at least have a shot of noticing if I stop getting them.

Rafal
02/23/2012 07:43 PM by
Rafal

a bunch of my observations from real life (I mean the life I spend with computers and programs, not the other life related to human beings):

self-watching applications don't work. NServiceBus will not detect a failure of NServicebus. you have to have an external, independent observer and if you define some problem indicator make sure it's meaningful and clearly understandable (?) to users / operations people. The system doesn't also have to self-diagnose the problem, it's enough if someone's alerted when anomalies are detected. Server log files are not helpful - applications today process so much data that it's impossible to spot the problem in gigabytes of log files and usually the information you would need is not there

What's working for me: periodic polling, time-series, RRD, Cacti

Abdu
02/23/2012 10:15 PM by
Abdu

I don't like the idea of continuously getting emails and checking if it stops. It's a bother to notice them, delete them and think. I like to have two monitors monitor each each while they monitor the system that needs to be monitored. Send an sms, phone call and email if there's a problem. This way all basis are covered with high reliability. Or you can spend as much as how you want the system to be reliable. Don't bother me unless there's a problem.

jalchr
02/24/2012 08:05 AM by
jalchr

Interesting problem ... I think its time for a new abstraction ? library. I have not such thing before, but what you need is to raise an Expectation ... rather than an event in your code

class Expectation { object Source {get; set;} Timespan Timeout {get; set;} }

In code, public void ProcessOrder() { // Process your order Expectations.Raise(new Expectation(this, Timespan.FromMinutes(5));

}

public void Callback(Order order) { // In your callback Expectations.Okay(Order); // ... etc }

// Somewhere else public void SendAlerts() { var fails = Expectations.GetAllFailures(Timespan.FromHours(5)); // Send email }

Sean Gough
02/24/2012 01:22 PM by
Sean Gough

@Charlie we do the email thing too, but I don't really like relying on it. I like your double monitor approach, especially if they are in separate physical locations (in my case, datacenters).

I also like the symbiotic monitoring relationship @Daniel suggests. It will certainly work well for our service-to-service monitoring. Not sure it'll be useful batch-job type operations, but I could always use @Shashi's daily report idea there (which is similar to our current 'the job ran" emails).

So thanks to all of you. You've given me some good ideas so off I go to implement them!

Charlie Kilian
02/26/2012 12:28 AM by
Charlie Kilian

@Sean Gough I agree, relying on the email kind of sucks. That is my main way of monitoring batch jobs now, but there are so many of them I wouldn't notice if one went missing.

For batch jobs, I thought it would be useful to build a command line utility that just sent a heartbeat with a particular idea. Batch jobs could run the command line utility to check in with the heartbeat monitor at the end, once the batch job was successful. That way, batch jobs that don't otherwise have a way to check in with the heartbeat monitor can do so.

Gregg
03/04/2012 05:36 PM by
Gregg

Who watches the watchDLLs?

Comments have been closed on this topic.