Sites outage
We have an outage that appears to have taken roughly 12 hours.
The reason it took so long to fix, it was after business hours, and while we have production support for our clients, we never hooked up our own websites to our own system. A typical case of the barefoot shoemaker.
The reason for the outage? Also pretty typical:
The reason for that? We had a remote backup process that put some temp files and didn’t clean them up properly. The growth rate was about 3-6 MB a day, so no one really noticed.
The fix:
All is working now, I sorry for the delay in fixing this. We’ll be having some discussion here to see how we can avoid repeat issues like that.
Comments
I would suggest a tool like nagios or one of its derivatives to monitor your hard disks.
My experience with icinga and nsclient++ for Windows has been very good..
We use very simple powershell script to check server disk drive free space.
+1 for Nagios or Zabbix... you get lots of other built in metrics like cpu load, etc as well
Heheh, seen this plenty of times. Usually it's the IIS logs that hurt me.
Protip: if you are doing anything on your OS volume you are probably doing it wrong on a server. 1st setup step here is to move everything IIS related to D.
Ayende I am just happy you have 149GB of HibernatingRhinos.Orders :)
You should also probably set customErrors to On and set defaultRedirect to a nice error page that doesn't leak your stack trace...
Robert, I don't do that on purpose.
When setting up servers I like to allocate a large file of several gigabytes that can be delete when this situation occurs. This has saved my bacon a few times when running out of space on source control repositories.
Ayende I would like to introduce you to Oren Eini. He is the smart man behind RavenDB. In situations like that I always like to quote his excellence from his workshops: "disk space is cheap" :)
Comment preview