For some reason, it seems like I am talking about Release It a lot lately, to a lot of people. As I said when I reviewed it, that book literally changed the way that I approach problems. It also made me much more aware of the failure scenarios that I need to deal with.
A while ago I sat down in one of Jeremy Miller's talks and he mentioned that he had added the ability to do Environment Validation to StructureMap, so when the application is starting up, it can verify that all its dependencies are in a valid state. That made so much sense to me that I immediately added this facility to Windsor.
What I am going to talk about today is to take this approach one step further. Instead of running those tests just at application startup, they should be run every day, or every hour.
Yes, the operation team is suppose to have monitoring on the application, but unless they were part of the development process (or are a dedicated ops team), that still leaves you as the principal owner of knowledge in about the environment your application need. Even if you have a capable operation team, and they have very good understanding on your application, it is often best to support them by providing this functionality. It is very likely that you can get more information from your application that the operation team.
And if you don't have an operation team, you really want to be able to do that.
Now that we have taken care of the motivation for this approach, let us see what exactly we are talking about.
Environment validation means that you validate the your entire environment is in a state that allows your application to run in full capacity. I am going to list a few things that I think are essential for many applications, I am sure that I am going to miss some, however, feel free to add more items to the list.
- Certificate's valid and expire in more than a month.
- Domain registration expires in than one month.
- For each server in the application (web, database, cache, application):
- Server is alive and responding (within specified time).
- Server's HD has more than 10% free space.
- Server CPU usage is less than 80%
- Associated 3rd party servers are responding within their SLA.
- Sample execution of common scenarios finish successfully in a specified time frame.
- Number of faults (non critical ones) in the application is below the threshold.
- No critical fault (critical defined as taking the entire system down).
- Current traffic / work on the system is within expected range (too low, and we may have external network issue, too high, and we need to up our capacity).
- Application audit trail is updated. (Can do the same for log, if required).
- System backup was performed and completed successfully.
- All batch jobs have been run and completed successfully.
- Verify the previously generated faults has been dealt with.
Those are the generalities, I am pretty sure that you can think of a lot more that fit your own systems.
The important thing to remember here is that you should treat this piece as a core part of the application infrastructure. In many production environment, you simply cannot get access. This is part of the application, and should be deployed with the application. At any rate, it should be made clear that this is part of the deployment program, not just useless appendix.
My preference would be to have a windows service to monitor my systems and alert when there are failures.
This is another important consideration, how do you send alerts? And when? You should have at least three levels of warnings: Warning, Error and Fatal. You send them according to the severity of the problem.
In all cases, I would log them to the event log at a minimum, probably send mail as well. For Error and Fatal levels, I would use SMS / generate alert to operation monitoring systems. If there are monitoring system in place that the operations team is using, it is best to route things through them. They probably have the ability to wake someone up in 3 AM already. If you don't have that, than an SMS is at least near instantaneous, and you can more or less rely on that to be read.
That is long enough, and I have to do some work today, so I'll just stop here, I think.