Ayende @ Rahien

Refunds available at head office

Sites outage

We have an outage that appears to have taken roughly 12 hours.

The reason it took so long to fix, it was after business hours, and while we have production support for our clients, we never hooked up our own websites to our own system. A typical case of the barefoot shoemaker.

The reason for the outage? Also pretty typical:

image

The reason for that? We had a remote backup process that put some temp files and didn’t clean them up properly. The growth rate was about 3-6 MB a day, so no one really noticed.

The fix:

image

All is working now, I sorry for the delay in fixing this. We’ll be having some discussion here to see how we can avoid repeat issues like that.

Tags:

Posted By: Ayende Rahien

Published at

Originally posted at

Comments

Christian Seitzer
10/29/2013 06:12 AM by
Christian Seitzer

I would suggest a tool like nagios or one of its derivatives to monitor your hard disks.

My experience with icinga and nsclient++ for Windows has been very good..

Jiří Nouza
10/29/2013 11:48 AM by
Jiří Nouza

We use very simple powershell script to check server disk drive free space.

Jim Geurts
10/29/2013 01:50 PM by
Jim Geurts

+1 for Nagios or Zabbix... you get lots of other built in metrics like cpu load, etc as well

Judah Gabriel Himango
10/29/2013 02:56 PM by
Judah Gabriel Himango

Heheh, seen this plenty of times. Usually it's the IIS logs that hurt me.

Wyatt Barnett
10/29/2013 03:45 PM by
Wyatt Barnett

Protip: if you are doing anything on your OS volume you are probably doing it wrong on a server. 1st setup step here is to move everything IIS related to D.

Ajai
10/29/2013 04:23 PM by
Ajai

Ayende I am just happy you have 149GB of HibernatingRhinos.Orders :)

Robert
10/29/2013 05:31 PM by
Robert

You should also probably set customErrors to On and set defaultRedirect to a nice error page that doesn't leak your stack trace...

Ayende Rahien
10/30/2013 02:54 AM by
Ayende Rahien

Robert, I don't do that on purpose.

Dave
10/30/2013 09:11 PM by
Dave

When setting up servers I like to allocate a large file of several gigabytes that can be delete when this situation occurs. This has saved my bacon a few times when running out of space on source control repositories.

Daniel Marbach
11/04/2013 05:29 PM by
Daniel Marbach

Ayende I would like to introduce you to Oren Eini. He is the smart man behind RavenDB. In situations like that I always like to quote his excellence from his workshops: "disk space is cheap" :)

Comments have been closed on this topic.