Continuous Environment Validation
For some reason, it seems like I am talking about Release It a lot lately, to a lot of people. As I said when I reviewed it, that book literally changed the way that I approach problems. It also made me much more aware of the failure scenarios that I need to deal with.
A while ago I sat down in one of Jeremy Miller's talks and he mentioned that he had added the ability to do Environment Validation to StructureMap, so when the application is starting up, it can verify that all its dependencies are in a valid state. That made so much sense to me that I immediately added this facility to Windsor.
What I am going to talk about today is to take this approach one step further. Instead of running those tests just at application startup, they should be run every day, or every hour.
Yes, the operation team is suppose to have monitoring on the application, but unless they were part of the development process (or are a dedicated ops team), that still leaves you as the principal owner of knowledge in about the environment your application need. Even if you have a capable operation team, and they have very good understanding on your application, it is often best to support them by providing this functionality. It is very likely that you can get more information from your application that the operation team.
And if you don't have an operation team, you really want to be able to do that.
Now that we have taken care of the motivation for this approach, let us see what exactly we are talking about.
Environment validation means that you validate the your entire environment is in a state that allows your application to run in full capacity. I am going to list a few things that I think are essential for many applications, I am sure that I am going to miss some, however, feel free to add more items to the list.
- Certificate's valid and expire in more than a month.
- Domain registration expires in than one month.
- For each server in the application (web, database, cache, application):
- Server is alive and responding (within specified time).
- Server's HD has more than 10% free space.
- Server CPU usage is less than 80%
- Associated 3rd party servers are responding within their SLA.
- Sample execution of common scenarios finish successfully in a specified time frame.
- Number of faults (non critical ones) in the application is below the threshold.
- No critical fault (critical defined as taking the entire system down).
- Current traffic / work on the system is within expected range (too low, and we may have external network issue, too high, and we need to up our capacity).
- Application audit trail is updated. (Can do the same for log, if required).
- System backup was performed and completed successfully.
- All batch jobs have been run and completed successfully.
- Verify the previously generated faults has been dealt with.
Those are the generalities, I am pretty sure that you can think of a lot more that fit your own systems.
The important thing to remember here is that you should treat this piece as a core part of the application infrastructure. In many production environment, you simply cannot get access. This is part of the application, and should be deployed with the application. At any rate, it should be made clear that this is part of the deployment program, not just useless appendix.
My preference would be to have a windows service to monitor my systems and alert when there are failures.
This is another important consideration, how do you send alerts? And when? You should have at least three levels of warnings: Warning, Error and Fatal. You send them according to the severity of the problem.
In all cases, I would log them to the event log at a minimum, probably send mail as well. For Error and Fatal levels, I would use SMS / generate alert to operation monitoring systems. If there are monitoring system in place that the operations team is using, it is best to route things through them. They probably have the ability to wake someone up in 3 AM already. If you don't have that, than an SMS is at least near instantaneous, and you can more or less rely on that to be read.
That is long enough, and I have to do some work today, so I'll just stop here, I think.
Comments
I'll add a few items
The alert system itself is functioning
Windows Services are all running (e.g. SharePoint has 3 services running at all times, minimum. Sometimes they're not running.)
Service accounts are functioning
Servers can communicate with each other (this has changed on me, in production! OOPS)
Servers can query AD
Servers can send email via SMTP (again this has burned me in the past, firewall/antivirus updates)
I particularly want to emphasize the "make sure alert system works" because I've seen our "emergency pager system" go down, regularly, for hours at a time, without notice from the pager company.
Ayende , don't re invent the wheel : http://www.nagios.org/ :-)
We use ManageEngine® Applications Manager(http://manageengine.adventnet.com/products/applications_manager/), which has worked very well for us.
Oren,
I believe you are mixing together logically different concepts of Environment Validation and Monitoring/Notifications. EV could be a part of the system (it has to be) that just reports the state, while the monitoring/notification/contiuous thing is better to be completely external system that actually decides if the state is invalid, suspicious, good or whatever, and then takes the actions.
I've expanded on that a little bit in my blog (http://rabdullin.com), since that is close to the concept of CI/automation engines for business solutions. And we had to implement that recently.
Onur,
Nagios is a specific software that deals with the "network problems", while Environment Validation is a broader concept that can deal with issues starting from IoC configuration and up to the business validation.
Rinat
Thanks for the book tip.
Have you started implementing these "environment validators" yet? I can totally see the value in them, and I'm thinking about how I'd go about it.
I've got various programs, written in various languages, across various servers.
So, I'm not sure if a single service could do this?
I'd probably want to write custom monitors that are scheduled to run frequently on each server, and which do a particular job. Each monitor could report back to one single place, using a well known format. As a metaphor, this is a bit like having various "security sensors" distributed around a building complex.
This one place could even warn if reports aren't received when expected (i.e. one of the sensors isn't responding). Results would be aggregated and published via RSS/Email.
Comment preview