Things we learned from production, part I–shutting down is hard to do
This series of posts is going to talk about the things that we have learned ourselves and via our customers about running RavenDB in production. Those customers include people running on a single database on a Celeron 600 Mhz with 512 MB all the way to monsters like what RavenHQ is doing.
This particular story is about the effect of shutdown on RavenDB in production environments. Before we can do that, I have to explain the sequence of operations when RavenDB shuts down:
- Stop accepting new connections
- Abort all existing connections
- For each loaded database:
- Shut down indexing
- For each index:
- Wait for current indexing batch to complete
- Flush the index
- Close the index
- Close database
- Repeat the same sequence for the system database
- Done
I am skipping a lot of details, but that is the gist of it.
In this case, however, you might have noticed something interesting. What happen if we have a large number of active databases, with a large number of actual indexes?
In that case, we have to wait for the current indexing batch to complete, then shut down each of the indexes, then move to the next db, and do the same.
In some cases, that can take a while. In particular, long enough while that we would get killed. Either by automated systems that decided we passed our threshold (in particular, iisreset gives you mere 20 seconds to restart, which tend to be not enough) or by an out of patience admin.
That sucks, because if you get killed, you don’t have the time to do a proper shutdown. You crashed & burned and died and now you have to deal with all the details of proper resurrection. Now, RavenDB prides itself on actually being a regular in this matters. You can yank the power cord out and once everything is back up, RavenDB will recover gracefully and with no data loss.
But, recovering from such scenarios can take precious time. Especially if, as is frequently the case in such scenarios, we have a lot of databases and indexes to recover.
Because of that, we actually had to spend quite a bit of time on optimizing the shut down sequence. It sounds funny, isn’t it? Very few people actually care about the time it takes them to shut down. But as it turned out, we have a fairly limited budget for that. In particular, we parallelized the process of shutting down all of the databases together, and all of their indexes together as well.
That means more IO contention than before, but at least we could usually meet the deadline. Speaking of which, we also added additional logging and configuration that told common hosts (such as IIS) that we really would like some more time before we would be hang out to dry.
On my next post, I’ll discuss the other side, how hard it is to actually wake up in the morning .
Comments
When ASP.NET notifies you that it wants to unload the appdomain you can stay alive indefinitely by just not returning from your callback. Only after all callbacks have been called the countdown begins.
Certainly not what the architects intended but it works.
Tobi, That is ASP.Net, that isn't IIS. IIS give you severe time limits.
In an azure 'worker role' environment would you be able to request more time if the server is self hosted (Raven.Database.Server.HttpServer)?
As an aside - would you recommend using the HttpServer instead of hosting via IIS (I wasn't sure how to use IIS hosting)
Andrew, I am not familiar enough with the way worker roles work. We run RavenDB in production inside IIS. Self hosted, service mode, also have some limit on shutdown, in the sense that the service manager will give an error if you take too long, but won't kill you. IIS will kill you if you take too long
Ayende, out of curiosity, could you please expand a bit on the RavenHQ Monster ;)
Louis, RavenHQ is hosting a LOT of databases, and it is one of the heaviest loaded RavenDB setups. That is what I meant, it has brought to our attention a lot of interesting issues like that.
Is there a way for a custom bundle to participate in the shutdown process?
I'm developing the update cascade bundle. This bundle might have several task in progress that probably need to be canceled.
Jesus, We will call the IStartupTask dispose method, if it implements IDisposable, which is the hook you get when we shut down.
RavenHD is insanely expensive.. I just can't understand anyone paying that much. If I need that capacity, I would go MySQL, jezuz
Comment preview