Things we learned from production, part I–shutting down is hard to do

time to read 3 min | 507 words

This series of posts is going to talk about the things that we have learned ourselves and via our customers about running RavenDB in production. Those customers include people running on a single database on a Celeron 600 Mhz with 512 MB all the way to monsters like what RavenHQ is doing.

This particular story is about the effect of shutdown on RavenDB in production environments. Before we can do that, I have to explain the sequence of operations when RavenDB shuts down:

  • Stop accepting new connections
  • Abort all existing connections
  • For each loaded database:
    • Shut down indexing
    • For each index:
      • Wait for current indexing batch to complete
      • Flush the index
      • Close the index
    • Close database
  • Repeat the same sequence for the system database
  • Done

I am skipping a lot of details, but that is the gist of it.

In this case, however, you might have noticed something interesting. What happen if we have a large number of active databases, with a large number of actual indexes?

In that case, we have to wait for the current indexing batch to complete, then shut down each of the indexes, then move to the next db, and do the same.

In some cases, that can take a while. In particular, long enough while that we would get killed. Either by automated systems that decided we passed our threshold (in particular, iisreset gives you mere 20 seconds to restart, which tend to be not enough) or by an out of patience admin.

That sucks, because if you get killed, you don’t have the time to do a proper shutdown. You crashed & burned and died and now you have to deal with all the details of proper resurrection. Now, RavenDB prides itself on actually being a regular in this matters. You can yank the power cord out and once everything is back up, RavenDB will recover gracefully and with no data loss.

But, recovering from such scenarios can take precious time. Especially if, as is frequently the case in such scenarios, we have a lot of databases and indexes to recover.

Because of that, we actually had to spend quite a bit of time on optimizing the shut down sequence. It sounds funny, isn’t it? Very few people actually care about the time it takes them to shut down. But as it turned out, we have a fairly limited budget for that.  In particular, we parallelized the process of shutting down all of the databases together, and all of their indexes together as well.

That means more IO contention than before, but at least we could usually meet the deadline.  Speaking of which, we also added additional logging and configuration that told common hosts (such as IIS) that we really would like some more time before we would be hang out to dry.

On my next post, I’ll discuss the other side, how hard it is to actually wake up in the morning Smile.