Ayende @ Rahien

Oct 01 2012

Things we learned from production, part V–Is that a wrench in your pocket or are you happy to see me?

time to read 1 min | 179 words

Tags:

This post is actually going to be a short one, because it should have been pretty obvious.

An admin in production doesn’t have the same toolset that developers have on their dev environment. The ability to break in and inspect the state of the system at any time is something that we developers usually take for granted, but it is something that is quite impossible to do in production.

That means that we have to provide the admin with the ability to inspect our current state, we also need to give the admin the ability to make some changes. The most common example is to change system settings without taking the entire system down, but there are other things. Being able to force the system to do something, such as unload a database, force a full GC cycle or tweak the perf hints without getting seven people to sign off on taking the system down (and not having to do so at 3 AM) it a really good thing, as far as pretty much any admin is concerned.

Sep 28 2012

Things we learned from production, part IV–is your paperwork in order?

time to read 2 min | 362 words

10 comments

Tags:

One of the major points that we worked on in the 1.2 release was making the ops team work easier. That included additional logging, like we have previously discussed, making RavenDB plays nicer with other parts of the system, adding performance counters, etc.

But those are the obvious things, and this series isn’t about the obvious things. One of the problems that we run into is that we already had a moderately good porthole into how RavenDB works.

The problem was that this porthole gave you access to the state of a single database ,which was great…

Except that in order to get a database statistics, you had to actually load that database. Imagine a system under load, and the admin need to check what is causing the load. The act of checking a database statistics will actually force that database to load, generating even more load. This is especially dangerous when we are talking about automated health monitoring tools, the fact that we monitor the health of our software shouldn’t cause it to do additional work.

In RavenDB 1.2 we have taken steps to make sure that we can report on all the active database without having to guess which ones are active and which aren’t. We have also taken additional steps to make sure that we give the admin even more information about what is going on.

You can see this pattern pretty much everywhere, in indexes, in operations, in database and server stats. There are a lot more places where we explicitly built the hooks to make it possible for the admin to figure out what is going on.

The lesson from that is that you have to provide a lot of information for the administrators, so they can figure out what is going on (and that administrator may very well be you, at 2 AM, trying to diagnose a problem). At the same time, you have to be sure to provide those hooks in a way that have minimal impact on the system. Having admin hooks in place that will put undue burden on the application is seriously not a cool thing to do.

Sep 27 2012

Things we learned from production, part III–singleton thinking makes long queues

time to read 3 min | 471 words

3 comments

Tags:

One of the more interesting things that we had to learn in production was that we aren’t an only child. It is a bit more complex than that, and I am not explaining well, let me start at the beginning.

Usually, when we work on RavenDB, we work within the scope of a single database, all of our efforts are usually scoped to that. That means that when we worked on the multi database feature for RavenDB, we actually focused on the process of loading a single database up in the air. We considered how multiple databases will interact, and we made sure that they are isolated from one another, but that was about it.

In particular, as mentioned in the previous post, starting up and shutting down were done sequentially, on a per database basis. In order to prevent issues, we had a lock on the initialize database part of the process, so two requests to the same database will not result in the same database being loaded twice.

I mentioned that we were thinking on a single database mindset, right?

Can you guess what happened?

Request for DB #1 – lock acquired, starting up

Request for DB #1 – waiting for lock to release
Request for DB #1 – waiting for lock to release
Request for DB #1 – waiting for lock to release

DB initialized, lock released
All requests are now freed and can be processed.

What happen when we have multiple databases, however?

Request for DB #1 – lock acquired, starting up

Request for DB #1 – waiting for lock to release
Request for DB #2 – waiting for lock to release
Request for DB #3 – waiting for lock to release

DB initialized, lock released
Request for DB #2 – lock released, lock acquired, starting up

Request for DB #3 – waiting for lock to release

You guessed it, we actually had a global lock for starting (or disposing, for that matter) databases. That meant that a single db that took time to start would impact other databases.

More importantly, it would means that other requests, which were waiting for that database to load and then had to load their own database, had far less time to actually do the processing they needed. Which meant that they were far more likely to run into the request time limit and be aborted by IIS. Which left them in an inconsistent state. Which was a nightmare to figure out.

We resolved this issue by making sure that the lock is now handled only on the same database, and that we won’t lock forever, if after a while we still don’t have the db, we will error early and give you a 503 Service Unavailable error until the db is ready to rock.

Sep 25 2012

Things we learned from production, part II–wake up or I kill you dead

time to read 5 min | 826 words

11 comments

Tags:

Getting started is probably easier than shutting down, I mean, no one is going to begrudge us some time to get our feet from under us, right?

As it turned out, this assumption is wrong on quite a few levels.

To start with, hosts such as IIS / Windows Service Manager will give you a certain time to start before they decide that you are hang and ruthlessly execute you without even thinking twice about it. This doesn’t even include the issue of admins with people breathing down their necks who assume that a taste of mortality must convince RavenDB to try even harder then next time it is started after then 7th time it was killed for not starting fast enough.

Because killing us during startup is pretty much the same as a standard crash, it means that we need to run recovery after this happened, which means that the next time is going to take longer, and then…

I think you can get the picture, right?

But the issue here is actually much more complex.

It is actually easier to recover from a real crash (something like a process termination or kill –9). It is harder when it isn’t a real crash, but something like IIS just recycling the AppDomain. The reason it is harder is that anything that is scoped to the OS, like file handles, unmanaged resources, etc, are actually still alive. It means that during the crash, you have to be very careful about detecting that you are crashing and cleaning up after you properly.

Moving back to the actual startup issue, so we have to startup fairly quickly, even if we just crashed. That makes sense, I guess. Now, that is fine and dandy, but that is just for the system database, what happens when you want to access a non system database (for example, the Northwind database)?

In RavenDB, we load those databases lazily, so on the first request to that particular database, we will load it.

As it turned out, this simple and fairly obvious decision has caused a no end of problems.

Starting up a database may take a while, in bad cases, that while may be long enough that the request time out. Now, what does it means, request time out? You might get a 408 Request Timeout from the server, but that is the client perspective.

What happens on the server? Well, IIS handed over control of the request to RavenDB, and as far as IIS is concerned, RavenDB is sitting there doing nothing, well above its time limit. Now, IIS doesn’t have a way to tell RavenDB, stop processing this request. So what do you think it does?

Welcome to the nice land of Thread.Abort().

Now, if you have ever read about Thread.Abort(), you probably know that every single reference to that is filled with warnings about the need to be very careful about what you are doing, that it is a very bad idea in general and that you should take care to never use it. The reason it is such a bad idea is that you basically cut the thread at mid execution, leaving it no chance at all to actually handle things. It is an easy way to violate invariants.

In particular, it is a good way for your cleanup to never happen. Think about it, we are in the middle of our constructor, opening files, settings things up, and suddenly the floor is yanked right out from under us.

As it turned out, in those cases, we would leak some stuff out. The next time that you tried to access the database, you would get an error that said that the files were already opened by someone else. (To make things worse, those were unmanaged resources, they wouldn’t get cleaned up by the system when GC is run.

That led to errors that were extremely hard to figure out. Because they would only occur when running at a high load, with a db that crashed and was now recovering, and with a few other databases waiting as well. And going over the code, thinking multi threading thoughts, none of that works. At some point, I put so many locks there, just to figure out what is going on, that the code looked like this:

But the actual problem wasn’t another thread corrupting state, the problem was that the current thread was ruthless killed in mid operation.

Once we figured that one out, it was straightforward, but in no way easy, to device a solution. We made sure that our db init code was robust for thread aborts, and then we moved the actual db initialization to a separate thread, one that wasn’t controlled by IIS, so we could actually get things done without having a hard time limit.

In my next post, I’ll discuss the fallacy of the singleton and how much pain it caused us.

Sep 20 2012

Things we learned from production, part I–shutting down is hard to do

time to read 3 min | 507 words

9 comments

Tags:

This series of posts is going to talk about the things that we have learned ourselves and via our customers about running RavenDB in production. Those customers include people running on a single database on a Celeron 600 Mhz with 512 MB all the way to monsters like what RavenHQ is doing.

This particular story is about the effect of shutdown on RavenDB in production environments. Before we can do that, I have to explain the sequence of operations when RavenDB shuts down:

Stop accepting new connections
Abort all existing connections
For each loaded database:

Shut down indexing
For each index:

Wait for current indexing batch to complete
Flush the index
Close the index

Close database

Repeat the same sequence for the system database
Done

I am skipping a lot of details, but that is the gist of it.

In this case, however, you might have noticed something interesting. What happen if we have a large number of active databases, with a large number of actual indexes?

In that case, we have to wait for the current indexing batch to complete, then shut down each of the indexes, then move to the next db, and do the same.

In some cases, that can take a while. In particular, long enough while that we would get killed. Either by automated systems that decided we passed our threshold (in particular, iisreset gives you mere 20 seconds to restart, which tend to be not enough) or by an out of patience admin.

That sucks, because if you get killed, you don’t have the time to do a proper shutdown. You crashed & burned and died and now you have to deal with all the details of proper resurrection. Now, RavenDB prides itself on actually being a regular in this matters. You can yank the power cord out and once everything is back up, RavenDB will recover gracefully and with no data loss.

But, recovering from such scenarios can take precious time. Especially if, as is frequently the case in such scenarios, we have a lot of databases and indexes to recover.

Because of that, we actually had to spend quite a bit of time on optimizing the shut down sequence. It sounds funny, isn’t it? Very few people actually care about the time it takes them to shut down. But as it turned out, we have a fairly limited budget for that. In particular, we parallelized the process of shutting down all of the databases together, and all of their indexes together as well.

That means more IO contention than before, but at least we could usually meet the deadline. Speaking of which, we also added additional logging and configuration that told common hosts (such as IIS) that we really would like some more time before we would be hang out to dry.

On my next post, I’ll discuss the other side, how hard it is to actually wake up in the morning Smile .

Oren Eini

Oren Eini

CEO of RavenDB

Things we learned from production, part V–Is that a wrench in your pocket or are you happy to see me?

Things we learned from production, part IV–is your paperwork in order?

Things we learned from production, part III–singleton thinking makes long queues

Things we learned from production, part II–wake up or I kill you dead

Things we learned from production, part I–shutting down is hard to do

FUTURE POSTS

RECENT SERIES

RECENT COMMENTS

Syndication

Main feed
Comments feed