Things we learned from production, part II–wake up or I kill you dead
Getting started is probably easier than shutting down, I mean, no one is going to begrudge us some time to get our feet from under us, right?
As it turned out, this assumption is wrong on quite a few levels.
To start with, hosts such as IIS / Windows Service Manager will give you a certain time to start before they decide that you are hang and ruthlessly execute you without even thinking twice about it. This doesn’t even include the issue of admins with people breathing down their necks who assume that a taste of mortality must convince RavenDB to try even harder then next time it is started after then 7th time it was killed for not starting fast enough.
Because killing us during startup is pretty much the same as a standard crash, it means that we need to run recovery after this happened, which means that the next time is going to take longer, and then…
I think you can get the picture, right?
But the issue here is actually much more complex.
It is actually easier to recover from a real crash (something like a process termination or kill –9). It is harder when it isn’t a real crash, but something like IIS just recycling the AppDomain. The reason it is harder is that anything that is scoped to the OS, like file handles, unmanaged resources, etc, are actually still alive. It means that during the crash, you have to be very careful about detecting that you are crashing and cleaning up after you properly.
Moving back to the actual startup issue, so we have to startup fairly quickly, even if we just crashed. That makes sense, I guess. Now, that is fine and dandy, but that is just for the system database, what happens when you want to access a non system database (for example, the Northwind database)?
In RavenDB, we load those databases lazily, so on the first request to that particular database, we will load it.
As it turned out, this simple and fairly obvious decision has caused a no end of problems.
Starting up a database may take a while, in bad cases, that while may be long enough that the request time out. Now, what does it means, request time out? You might get a 408 Request Timeout from the server, but that is the client perspective.
What happens on the server? Well, IIS handed over control of the request to RavenDB, and as far as IIS is concerned, RavenDB is sitting there doing nothing, well above its time limit. Now, IIS doesn’t have a way to tell RavenDB, stop processing this request. So what do you think it does?
Welcome to the nice land of Thread.Abort().
Now, if you have ever read about Thread.Abort(), you probably know that every single reference to that is filled with warnings about the need to be very careful about what you are doing, that it is a very bad idea in general and that you should take care to never use it. The reason it is such a bad idea is that you basically cut the thread at mid execution, leaving it no chance at all to actually handle things. It is an easy way to violate invariants.
In particular, it is a good way for your cleanup to never happen. Think about it, we are in the middle of our constructor, opening files, settings things up, and suddenly the floor is yanked right out from under us.
As it turned out, in those cases, we would leak some stuff out. The next time that you tried to access the database, you would get an error that said that the files were already opened by someone else. (To make things worse, those were unmanaged resources, they wouldn’t get cleaned up by the system when GC is run.
That led to errors that were extremely hard to figure out. Because they would only occur when running at a high load, with a db that crashed and was now recovering, and with a few other databases waiting as well. And going over the code, thinking multi threading thoughts, none of that works. At some point, I put so many locks there, just to figure out what is going on, that the code looked like this:
But the actual problem wasn’t another thread corrupting state, the problem was that the current thread was ruthless killed in mid operation.
Once we figured that one out, it was straightforward, but in no way easy, to device a solution. We made sure that our db init code was robust for thread aborts, and then we moved the actual db initialization to a separate thread, one that wasn’t controlled by IIS, so we could actually get things done without having a hard time limit.
In my next post, I’ll discuss the fallacy of the singleton and how much pain it caused us.