﻿<?xml version="1.0" encoding="utf-8"?><rss version="2.0"><channel><title>Ayende @ Rahien</title><link>http://ayende.com</link><description>Ayende @ Rahien</description><copyright>Copyright (C) Ayende Rahien  2004 - 2021 (c) 2026</copyright><ttl>60</ttl><item><title>Gregg commented on Do you monitor negative events?</title><description>Who watches the watchDLLs?</description><link>http://ayende.com/153409/do-you-monitor-negative-events#comment27</link><guid>http://ayende.com/153409/do-you-monitor-negative-events#comment27</guid><pubDate>Sun, 04 Mar 2012 17:36:21 GMT</pubDate></item><item><title>Charlie Kilian commented on Do you monitor negative events?</title><description>@Sean Gough I agree, relying on the email kind of sucks. That is my main way of monitoring batch jobs now, but there are so many of them I wouldn't notice if one went missing.

For batch jobs, I thought it would be useful to build a command line utility that just sent a heartbeat with a particular idea. Batch jobs could run the command line utility to check in with the heartbeat monitor at the end, once the batch job was successful. That way, batch jobs that don't otherwise have a way to check in with the heartbeat monitor can do so.</description><link>http://ayende.com/153409/do-you-monitor-negative-events#comment26</link><guid>http://ayende.com/153409/do-you-monitor-negative-events#comment26</guid><pubDate>Sun, 26 Feb 2012 00:28:50 GMT</pubDate></item><item><title>Sean Gough commented on Do you monitor negative events?</title><description>@Charlie we do the email thing too, but I don't really like relying on it.  I like your double monitor approach, especially if they are in separate physical locations (in my case, datacenters).  

I also like the symbiotic monitoring relationship @Daniel suggests.  It will certainly work well for our service-to-service monitoring.  Not sure it'll be useful batch-job type operations, but I could always use @Shashi's daily report idea there (which is similar to our current 'the job ran" emails).

So thanks to all of you. You've given me some good ideas so off I go to implement them!</description><link>http://ayende.com/153409/do-you-monitor-negative-events#comment25</link><guid>http://ayende.com/153409/do-you-monitor-negative-events#comment25</guid><pubDate>Fri, 24 Feb 2012 13:22:04 GMT</pubDate></item><item><title>jalchr commented on Do you monitor negative events?</title><description>Interesting problem ... I think its time for a new abstraction ? library.
I have not such thing before, but what you need is to raise an Expectation ... rather than an event in your code

class Expectation
{
 	object Source {get; set;}
	Timespan Timeout {get; set;}	
}


In code,
public void ProcessOrder()
{
	// Process your order
	Expectations.Raise(new Expectation(this, Timespan.FromMinutes(5));

}

public void Callback(Order order)
{
 	// In your callback
	Expectations.Okay(Order);
	// ... etc
}

// Somewhere else
public void SendAlerts()
{
	var fails =	Expectations.GetAllFailures(Timespan.FromHours(5));
	// Send email 
}</description><link>http://ayende.com/153409/do-you-monitor-negative-events#comment24</link><guid>http://ayende.com/153409/do-you-monitor-negative-events#comment24</guid><pubDate>Fri, 24 Feb 2012 08:05:51 GMT</pubDate></item><item><title>Abdu commented on Do you monitor negative events?</title><description>I don't like the idea of continuously getting emails and checking if it stops. It's a bother to notice them, delete them and think. I like to have two monitors monitor each each while they monitor the system that needs to be monitored. Send an sms, phone call and email if there's a problem. This way all basis are covered with high reliability. Or you can spend as much as how you want the system to be reliable. Don't bother me unless there's a problem.</description><link>http://ayende.com/153409/do-you-monitor-negative-events#comment23</link><guid>http://ayende.com/153409/do-you-monitor-negative-events#comment23</guid><pubDate>Thu, 23 Feb 2012 22:15:59 GMT</pubDate></item><item><title>Rafal commented on Do you monitor negative events?</title><description>a bunch of my observations from real life (I mean the life I spend with computers and programs, not the other life related to human beings):

self-watching applications don't work. NServiceBus will not detect a failure of NServicebus.
you have to have an external, independent observer
and if you define some problem indicator make sure it's meaningful and clearly understandable (?) to users / operations people. 
The system doesn't also have to self-diagnose the problem, it's enough if someone's alerted when anomalies are detected. 
Server log files are not helpful -  applications today process so much data that it's impossible to spot the problem in gigabytes of log files and usually the information you would need is not there

What's working for me: periodic polling, time-series, RRD, Cacti</description><link>http://ayende.com/153409/do-you-monitor-negative-events#comment22</link><guid>http://ayende.com/153409/do-you-monitor-negative-events#comment22</guid><pubDate>Thu, 23 Feb 2012 19:43:56 GMT</pubDate></item><item><title>Charlie Kilian commented on Do you monitor negative events?</title><description>@Sean Gough  Actually, I was planning on running two instances of the heartbeat monitor on two separate servers, and then having them send a heartbeat to each other.

You could still have both instances die, and that'd be a problem. But it'd significantly cut down on the likelihood of it happening, I'd think.

Also, I'm planning to have one send out "Everything is okay" emails every couple of hours, so I at least have a shot of noticing if I stop getting them.</description><link>http://ayende.com/153409/do-you-monitor-negative-events#comment21</link><guid>http://ayende.com/153409/do-you-monitor-negative-events#comment21</guid><pubDate>Thu, 23 Feb 2012 19:38:23 GMT</pubDate></item><item><title>Daniel Lang commented on Do you monitor negative events?</title><description>@Sean Gough,
You need to setup negation between both the system being monitored, and also the monitoring system, so that both can send an alert if the other one doesn't show up or runs on error. That way, you don't have the infinite monitor the monitor problem...</description><link>http://ayende.com/153409/do-you-monitor-negative-events#comment20</link><guid>http://ayende.com/153409/do-you-monitor-negative-events#comment20</guid><pubDate>Thu, 23 Feb 2012 19:12:11 GMT</pubDate></item><item><title>Jarrett commented on Do you monitor negative events?</title><description>I almost always handle this type of thing with SQL Server Agent jobs. Some record A is created or modified. If time X has passed and some state is not met, then add a tickler/todo/notification something. Sure, it's probably clunky, but it works every time.</description><link>http://ayende.com/153409/do-you-monitor-negative-events#comment19</link><guid>http://ayende.com/153409/do-you-monitor-negative-events#comment19</guid><pubDate>Thu, 23 Feb 2012 16:19:02 GMT</pubDate></item><item><title>Shashi commented on Do you monitor negative events?</title><description>Currently we do this for our batch jobs by having a notification process that runs independent of the jobs process and periodically (once a day) builds a report of what ran, succeeded, failed or was missing in the logs. That tells us if something didn't run.

I'd like to use Splunk on top of our logs to get real-time dashboards for this stuff so we can monitor our jobs, apps, databases, even http requests in one place.</description><link>http://ayende.com/153409/do-you-monitor-negative-events#comment18</link><guid>http://ayende.com/153409/do-you-monitor-negative-events#comment18</guid><pubDate>Thu, 23 Feb 2012 15:52:59 GMT</pubDate></item><item><title>Darran commented on Do you monitor negative events?</title><description>Interesting post.

Have you looked at using something like NEsper - event stream processing can be used for looking at events and missing events over time in a variety of ways. You could probably build an event driven expectations engine on top of that.

It could be driven in two ways that I can think of right now (sitting by a beach in Thailand). On the one hand, check that events are received in certain order after certain periods of time, or alternatively, use it to drive reminders so that databases can be checked for last update times and userids.

Just a thought...

D</description><link>http://ayende.com/153409/do-you-monitor-negative-events#comment17</link><guid>http://ayende.com/153409/do-you-monitor-negative-events#comment17</guid><pubDate>Thu, 23 Feb 2012 15:30:31 GMT</pubDate></item><item><title>Sean Gough commented on Do you monitor negative events?</title><description>@cocowalla — exactly!</description><link>http://ayende.com/153409/do-you-monitor-negative-events#comment16</link><guid>http://ayende.com/153409/do-you-monitor-negative-events#comment16</guid><pubDate>Thu, 23 Feb 2012 14:16:08 GMT</pubDate></item><item><title>cocowalla commented on Do you monitor negative events?</title><description>@Sean Gough

...and then what monitors the service that monitors the service that monitors the central health monitoring service, ad infinitum :)</description><link>http://ayende.com/153409/do-you-monitor-negative-events#comment15</link><guid>http://ayende.com/153409/do-you-monitor-negative-events#comment15</guid><pubDate>Thu, 23 Feb 2012 14:06:34 GMT</pubDate></item><item><title>Sean Gough commented on Do you monitor negative events?</title><description>@cbp @Charlie Kilian — this is the same approach we've used in the past.  In a new distributed system we're working on I was planning to do the same as well, but I always have the same nagging question.  What monitors the central health monitoring service?  What if it doesn't send the email/notification that service N hasn't reported in a while?  It's kind of a self-perpetuating problem.

Currently we regularly check to make sure that service is running or hook it up to a local task tray utility that check the central service very few minutes, but this still relies on us checking it.  

I'd be interested to know what others are doing or how to handle this in a generic and reliable way.</description><link>http://ayende.com/153409/do-you-monitor-negative-events#comment14</link><guid>http://ayende.com/153409/do-you-monitor-negative-events#comment14</guid><pubDate>Thu, 23 Feb 2012 13:59:26 GMT</pubDate></item><item><title>Charlie Kilian commented on Do you monitor negative events?</title><description>I don't track these yet, but I have plans to do so. Right now, I'm planning on building myself a tool that runs as a windows service and listens for regular heartbeats from expected processes. I'm planning to build a library that sends a heartbeat, something like:

heartbeat.Send( "process_id" );

If the heartbeat monitor service doesn't hear from the expected services within a configured amount of time, it will send me an email that something failed to run.

That's my idea, anyway. Do you perhaps have a better one?</description><link>http://ayende.com/153409/do-you-monitor-negative-events#comment13</link><guid>http://ayende.com/153409/do-you-monitor-negative-events#comment13</guid><pubDate>Thu, 23 Feb 2012 13:47:05 GMT</pubDate></item><item><title>Andy Pook commented on Do you monitor negative events?</title><description>Storm (by Twitter) has a mechanism for tracking "jobs" (multi-step, distributed). If an item does not complete Storm will replay.
https://github.com/nathanmarz/storm/wiki/Concepts (see "Reliability")
http://engineering.twitter.com/2011/08/storm-is-coming-more-details-and-plans.html</description><link>http://ayende.com/153409/do-you-monitor-negative-events#comment12</link><guid>http://ayende.com/153409/do-you-monitor-negative-events#comment12</guid><pubDate>Thu, 23 Feb 2012 13:26:26 GMT</pubDate></item><item><title>cbp commented on Do you monitor negative events?</title><description>We do this, but its pretty much all custom built.

Most of the services log directly into a table on the central db whenever they run.

A central health monitoring service is configured so that it knows which services should be logging into the HealthMonitor table, and how often. If there is a problem, it reports it via SMS.

The whole thing is pretty straight forward and relatively easy to manage.

The health monitor is also quite useful in that it can check for 'usage levels' - i.e. we expect customers to be accessing our website pretty consistently throughout the day. Obviously it easy to monitor whether the website is up or down. But we also make sure that people are actually using the site. For example, there could be a javascript error which is preventing people from submitting a particular form. Using the health monitor we can have alerts sent "If no one has submitted this particular form in the last two hours".</description><link>http://ayende.com/153409/do-you-monitor-negative-events#comment11</link><guid>http://ayende.com/153409/do-you-monitor-negative-events#comment11</guid><pubDate>Thu, 23 Feb 2012 12:02:08 GMT</pubDate></item><item><title>Felice Pollano commented on Do you monitor negative events?</title><description>As another thought: Negative event tracking usually happen when the infrastructure became really stable: before there is to much interaction - debugging, monitoring user attention and so on - that hide the needing of such a logging.</description><link>http://ayende.com/153409/do-you-monitor-negative-events#comment10</link><guid>http://ayende.com/153409/do-you-monitor-negative-events#comment10</guid><pubDate>Thu, 23 Feb 2012 11:25:41 GMT</pubDate></item><item><title>ashic commented on Do you monitor negative events?</title><description>Definitely agree. Technically it's not difficult. It's more a modelling problem with "rose tinted glasses" syndrome.</description><link>http://ayende.com/153409/do-you-monitor-negative-events#comment9</link><guid>http://ayende.com/153409/do-you-monitor-negative-events#comment9</guid><pubDate>Thu, 23 Feb 2012 11:22:01 GMT</pubDate></item><item><title>Ayende Rahien commented on Do you monitor negative events?</title><description>Ashic,
After a few times when you succeed, the CAPTCHA should go away.

And yes, I am aware of the timeout mechanism in NSB. The key is that you need to do that. And re-do that for the next week.
It is one of those things that people generally don't think about</description><link>http://ayende.com/153409/do-you-monitor-negative-events#comment8</link><guid>http://ayende.com/153409/do-you-monitor-negative-events#comment8</guid><pubDate>Thu, 23 Feb 2012 11:18:20 GMT</pubDate></item><item><title>Felice Pollano commented on Do you monitor negative events?</title><description>The old whatchdog stuff... I did it not so long ago, when I need to monitor a  legacy service sometimes hanging... solution was to sniff the log for modifications and wake up it if it does not update after some times. The watchdog service I created was actually more configuratable, but... used just for that till now :)</description><link>http://ayende.com/153409/do-you-monitor-negative-events#comment7</link><guid>http://ayende.com/153409/do-you-monitor-negative-events#comment7</guid><pubDate>Thu, 23 Feb 2012 11:18:10 GMT</pubDate></item><item><title>ashic commented on Do you monitor negative events?</title><description>Send me a reminder after 7 days...
After 7 days:
Ohh...let's see if that came in. It has? Cool, process. It hasn't: alert, alarm, whatever.

NSB has such timing mechanism (you can do yearly stuff even). Quartz.NET is another option.

PS: Oren..mind changing the captcha thing....Makes me feel stupid over and over again :)</description><link>http://ayende.com/153409/do-you-monitor-negative-events#comment6</link><guid>http://ayende.com/153409/do-you-monitor-negative-events#comment6</guid><pubDate>Thu, 23 Feb 2012 11:08:57 GMT</pubDate></item><item><title>Ayende Rahien commented on Do you monitor negative events?</title><description>Scooletz &amp; Ashic,
How do you detect the missing tax records update? Sure, for online stuff it it easy. 
It is the batch process that didn't run for three months that kills you</description><link>http://ayende.com/153409/do-you-monitor-negative-events#comment5</link><guid>http://ayende.com/153409/do-you-monitor-negative-events#comment5</guid><pubDate>Thu, 23 Feb 2012 11:03:38 GMT</pubDate></item><item><title>Scooletz commented on Do you monitor negative events?</title><description>In NServiceBus you've got timeouts and, as @ashic wrote, you can use saga and a logic WaitAny for a timeout or a completition or an error. But yes, in majority of cases people do not take it into consideration.</description><link>http://ayende.com/153409/do-you-monitor-negative-events#comment4</link><guid>http://ayende.com/153409/do-you-monitor-negative-events#comment4</guid><pubDate>Thu, 23 Feb 2012 11:01:49 GMT</pubDate></item><item><title>Pascal Van Vlaenderen commented on Do you monitor negative events?</title><description>I know there are systems around that check stuff regulary.
Problem of all those systems is that they're to general, they never check what's really important for you. 

In a past project I tackled this problem by having scripts run every X time, they send me a mail if somethings wrong and they also write back in the database. 
On a monitor page of the application, I then did a query to check how long ago any succesfull checks passed. If it was higher than 1 day, I displayed those in red.</description><link>http://ayende.com/153409/do-you-monitor-negative-events#comment3</link><guid>http://ayende.com/153409/do-you-monitor-negative-events#comment3</guid><pubDate>Thu, 23 Feb 2012 10:54:03 GMT</pubDate></item><item><title>Matthijs ter Woord commented on Do you monitor negative events?</title><description>You describe a very interesting problem (and a fairly common one I think). I personally don't solve this problem, as I still haven't figured out a generic way of doing this. Therefore I'm really interested in your approach!</description><link>http://ayende.com/153409/do-you-monitor-negative-events#comment2</link><guid>http://ayende.com/153409/do-you-monitor-negative-events#comment2</guid><pubDate>Thu, 23 Feb 2012 10:26:36 GMT</pubDate></item><item><title>ashic commented on Do you monitor negative events?</title><description>This is simple enough to achieve - "Send this message to me after x seconds". Simple case of a [wanna say saga but it's an overloaded term].</description><link>http://ayende.com/153409/do-you-monitor-negative-events#comment1</link><guid>http://ayende.com/153409/do-you-monitor-negative-events#comment1</guid><pubDate>Thu, 23 Feb 2012 10:26:08 GMT</pubDate></item></channel></rss>