Ayende @ Rahien

Hi!
My name is Oren Eini
Founder of Hibernating Rhinos LTD and RavenDB.
You can reach me by phone or email:

ayende@ayende.com

+972 52-548-6969

, @ Q c

Posts: 10 | Comments: 37

filter by tags archive

Sites outage

time to read 1 min | 182 words

We have an outage that appears to have taken roughly 12 hours.

The reason it took so long to fix, it was after business hours, and while we have production support for our clients, we never hooked up our own websites to our own system. A typical case of the barefoot shoemaker.

The reason for the outage? Also pretty typical:

image

The reason for that? We had a remote backup process that put some temp files and didn’t clean them up properly. The growth rate was about 3-6 MB a day, so no one really noticed.

The fix:

image

All is working now, I sorry for the delay in fixing this. We’ll be having some discussion here to see how we can avoid repeat issues like that.


Comments

Christian Seitzer

I would suggest a tool like nagios or one of its derivatives to monitor your hard disks.

My experience with icinga and nsclient++ for Windows has been very good..

Jiří Nouza

We use very simple powershell script to check server disk drive free space.

Jim Geurts

+1 for Nagios or Zabbix... you get lots of other built in metrics like cpu load, etc as well

Judah Gabriel Himango

Heheh, seen this plenty of times. Usually it's the IIS logs that hurt me.

Wyatt Barnett

Protip: if you are doing anything on your OS volume you are probably doing it wrong on a server. 1st setup step here is to move everything IIS related to D.

Ajai

Ayende I am just happy you have 149GB of HibernatingRhinos.Orders :)

Robert

You should also probably set customErrors to On and set defaultRedirect to a nice error page that doesn't leak your stack trace...

Ayende Rahien

Robert, I don't do that on purpose.

Dave

When setting up servers I like to allocate a large file of several gigabytes that can be delete when this situation occurs. This has saved my bacon a few times when running out of space on source control repositories.

Daniel Marbach

Ayende I would like to introduce you to Oren Eini. He is the smart man behind RavenDB. In situations like that I always like to quote his excellence from his workshops: "disk space is cheap" :)

Comment preview

Comments have been closed on this topic.

FUTURE POSTS

  1. Production postmortem: The case of the memory eater and high load - 15 hours from now
  2. Production postmortem: The case of the lying configuration file - about one day from now
  3. Production postmortem: The industry at large - 3 days from now
  4. The insidious cost of allocations - 4 days from now
  5. Find the bug: The concurrent memory buster - 5 days from now

And 4 more posts are pending...

There are posts all the way to Sep 10, 2015

RECENT SERIES

  1. Find the bug (5):
    20 Apr 2011 - Why do I get a Null Reference Exception?
  2. Production postmortem (10):
    14 Aug 2015 - The case of the man in the middle
  3. What is new in RavenDB 3.5 (7):
    12 Aug 2015 - Monitoring support
  4. Career planning (6):
    24 Jul 2015 - The immortal choices aren't
View all series

Syndication

Main feed Feed Stats
Comments feed   Comments Feed Stats