Do you monitor negative events?
You are probably aware that you need to monitor your production systems for errors, and to add health monitoring for your servers.
But are you monitoring negative events? What is a negative event, stuff that should have happened and didn’t.
For example, every week you have a process that runs to update the tax rates that applies to your customers. This is implemented as a scheduled process, but for some reason (computer was just being rebooted, the user’s password expire, etc) that process didn’t run. There isn’t an error, pre se. You won’t get an error because nothing actually had a chance to actually happen.
Another example would be getting a callback confirmation that an order payment has been correctly processed. That usually happen within 1 – 5 minutes, and you get an OK/Fail notification. But what happens if that notification just never came?
This is a much more dangerous scenario, because you have to not only be prepared for handling errors, you have to be prepared for… nothing to happen.
What it means is that you have to have some way to setup expectations in the system, and act on them when you don’t get a confirmation (negative or positive) within a given time frame.
Comments
This is simple enough to achieve - "Send this message to me after x seconds". Simple case of a [wanna say saga but it's an overloaded term].
You describe a very interesting problem (and a fairly common one I think). I personally don't solve this problem, as I still haven't figured out a generic way of doing this. Therefore I'm really interested in your approach!
I know there are systems around that check stuff regulary. Problem of all those systems is that they're to general, they never check what's really important for you.
In a past project I tackled this problem by having scripts run every X time, they send me a mail if somethings wrong and they also write back in the database. On a monitor page of the application, I then did a query to check how long ago any succesfull checks passed. If it was higher than 1 day, I displayed those in red.
In NServiceBus you've got timeouts and, as @ashic wrote, you can use saga and a logic WaitAny for a timeout or a completition or an error. But yes, in majority of cases people do not take it into consideration.
Scooletz & Ashic, How do you detect the missing tax records update? Sure, for online stuff it it easy. It is the batch process that didn't run for three months that kills you
Send me a reminder after 7 days... After 7 days: Ohh...let's see if that came in. It has? Cool, process. It hasn't: alert, alarm, whatever.
NSB has such timing mechanism (you can do yearly stuff even). Quartz.NET is another option.
PS: Oren..mind changing the captcha thing....Makes me feel stupid over and over again :)
The old whatchdog stuff... I did it not so long ago, when I need to monitor a legacy service sometimes hanging... solution was to sniff the log for modifications and wake up it if it does not update after some times. The watchdog service I created was actually more configuratable, but... used just for that till now :)
Ashic, After a few times when you succeed, the CAPTCHA should go away.
And yes, I am aware of the timeout mechanism in NSB. The key is that you need to do that. And re-do that for the next week. It is one of those things that people generally don't think about
Definitely agree. Technically it's not difficult. It's more a modelling problem with "rose tinted glasses" syndrome.
As another thought: Negative event tracking usually happen when the infrastructure became really stable: before there is to much interaction - debugging, monitoring user attention and so on - that hide the needing of such a logging.
We do this, but its pretty much all custom built.
Most of the services log directly into a table on the central db whenever they run.
A central health monitoring service is configured so that it knows which services should be logging into the HealthMonitor table, and how often. If there is a problem, it reports it via SMS.
The whole thing is pretty straight forward and relatively easy to manage.
The health monitor is also quite useful in that it can check for 'usage levels' - i.e. we expect customers to be accessing our website pretty consistently throughout the day. Obviously it easy to monitor whether the website is up or down. But we also make sure that people are actually using the site. For example, there could be a javascript error which is preventing people from submitting a particular form. Using the health monitor we can have alerts sent "If no one has submitted this particular form in the last two hours".
Storm (by Twitter) has a mechanism for tracking "jobs" (multi-step, distributed). If an item does not complete Storm will replay. https://github.com/nathanmarz/storm/wiki/Concepts (see "Reliability") http://engineering.twitter.com/2011/08/storm-is-coming-more-details-and-plans.html
I don't track these yet, but I have plans to do so. Right now, I'm planning on building myself a tool that runs as a windows service and listens for regular heartbeats from expected processes. I'm planning to build a library that sends a heartbeat, something like:
heartbeat.Send( "process_id" );
If the heartbeat monitor service doesn't hear from the expected services within a configured amount of time, it will send me an email that something failed to run.
That's my idea, anyway. Do you perhaps have a better one?
@cbp @Charlie Kilian — this is the same approach we've used in the past. In a new distributed system we're working on I was planning to do the same as well, but I always have the same nagging question. What monitors the central health monitoring service? What if it doesn't send the email/notification that service N hasn't reported in a while? It's kind of a self-perpetuating problem.
Currently we regularly check to make sure that service is running or hook it up to a local task tray utility that check the central service very few minutes, but this still relies on us checking it.
I'd be interested to know what others are doing or how to handle this in a generic and reliable way.
@Sean Gough
...and then what monitors the service that monitors the service that monitors the central health monitoring service, ad infinitum :)
@cocowalla — exactly!
Interesting post.
Have you looked at using something like NEsper - event stream processing can be used for looking at events and missing events over time in a variety of ways. You could probably build an event driven expectations engine on top of that.
It could be driven in two ways that I can think of right now (sitting by a beach in Thailand). On the one hand, check that events are received in certain order after certain periods of time, or alternatively, use it to drive reminders so that databases can be checked for last update times and userids.
Just a thought...
D
Currently we do this for our batch jobs by having a notification process that runs independent of the jobs process and periodically (once a day) builds a report of what ran, succeeded, failed or was missing in the logs. That tells us if something didn't run.
I'd like to use Splunk on top of our logs to get real-time dashboards for this stuff so we can monitor our jobs, apps, databases, even http requests in one place.
I almost always handle this type of thing with SQL Server Agent jobs. Some record A is created or modified. If time X has passed and some state is not met, then add a tickler/todo/notification something. Sure, it's probably clunky, but it works every time.
@Sean Gough, You need to setup negation between both the system being monitored, and also the monitoring system, so that both can send an alert if the other one doesn't show up or runs on error. That way, you don't have the infinite monitor the monitor problem...
@Sean Gough Actually, I was planning on running two instances of the heartbeat monitor on two separate servers, and then having them send a heartbeat to each other.
You could still have both instances die, and that'd be a problem. But it'd significantly cut down on the likelihood of it happening, I'd think.
Also, I'm planning to have one send out "Everything is okay" emails every couple of hours, so I at least have a shot of noticing if I stop getting them.
a bunch of my observations from real life (I mean the life I spend with computers and programs, not the other life related to human beings):
self-watching applications don't work. NServiceBus will not detect a failure of NServicebus. you have to have an external, independent observer and if you define some problem indicator make sure it's meaningful and clearly understandable (?) to users / operations people. The system doesn't also have to self-diagnose the problem, it's enough if someone's alerted when anomalies are detected. Server log files are not helpful - applications today process so much data that it's impossible to spot the problem in gigabytes of log files and usually the information you would need is not there
What's working for me: periodic polling, time-series, RRD, Cacti
I don't like the idea of continuously getting emails and checking if it stops. It's a bother to notice them, delete them and think. I like to have two monitors monitor each each while they monitor the system that needs to be monitored. Send an sms, phone call and email if there's a problem. This way all basis are covered with high reliability. Or you can spend as much as how you want the system to be reliable. Don't bother me unless there's a problem.
Interesting problem ... I think its time for a new abstraction ? library. I have not such thing before, but what you need is to raise an Expectation ... rather than an event in your code
class Expectation { object Source {get; set;} Timespan Timeout {get; set;} }
In code, public void ProcessOrder() { // Process your order Expectations.Raise(new Expectation(this, Timespan.FromMinutes(5));
}
public void Callback(Order order) { // In your callback Expectations.Okay(Order); // ... etc }
// Somewhere else public void SendAlerts() { var fails = Expectations.GetAllFailures(Timespan.FromHours(5)); // Send email }
@Charlie we do the email thing too, but I don't really like relying on it. I like your double monitor approach, especially if they are in separate physical locations (in my case, datacenters).
I also like the symbiotic monitoring relationship @Daniel suggests. It will certainly work well for our service-to-service monitoring. Not sure it'll be useful batch-job type operations, but I could always use @Shashi's daily report idea there (which is similar to our current 'the job ran" emails).
So thanks to all of you. You've given me some good ideas so off I go to implement them!
@Sean Gough I agree, relying on the email kind of sucks. That is my main way of monitoring batch jobs now, but there are so many of them I wouldn't notice if one went missing.
For batch jobs, I thought it would be useful to build a command line utility that just sent a heartbeat with a particular idea. Batch jobs could run the command line utility to check in with the heartbeat monitor at the end, once the batch job was successful. That way, batch jobs that don't otherwise have a way to check in with the heartbeat monitor can do so.
Who watches the watchDLLs?
Comment preview