Ayende @ Rahien

Hi!
My name is Oren Eini
Founder of Hibernating Rhinos LTD and RavenDB.
You can reach me by phone or email:

ayende@ayende.com

+972 52-548-6969

, @ Q c

Posts: 5,972 | Comments: 44,523

filter by tags archive

Things we learned from production, part IV–is your paperwork in order?


One of the major points that we worked on in the 1.2 release was making the ops team work easier. That included additional logging, like we have previously discussed, making RavenDB plays nicer with other parts of the system, adding performance counters, etc.

But those are the obvious things, and this series isn’t about the obvious things. One of the problems that we run into is that we already had a moderately good porthole into how RavenDB works.

The problem was that this porthole gave you access to the state of a single database ,which was great…

Except that in order to get a database statistics, you had to actually load that database. Imagine a system under load, and the admin need to check what is causing the load. The act of checking a database statistics will actually force that database to load, generating even more load. This is especially dangerous when we are talking about automated health monitoring tools, the fact that we monitor the health of our software shouldn’t cause it to do additional work.

In RavenDB 1.2 we have taken steps to make sure that we can report on all the active database without having to guess which ones are active and which aren’t. We have also taken additional steps to make sure that we give the admin even more information about what is going on.

You can see this pattern pretty much everywhere, in indexes, in operations, in database and server stats. There are a lot more places where we explicitly built the hooks to make it possible for the admin to figure out what is going on.

The lesson from that is that you have to provide a lot of information for the administrators, so they can figure out what is going on (and that administrator may very well be you, at 2 AM, trying to diagnose a problem). At the same time, you have to be sure to provide those hooks in a way that have minimal impact on the system. Having admin hooks in place that will put undue burden on the application is seriously not a cool thing to do.

Things we learned from production, part III–singleton thinking makes long queues


One of the more interesting things that we had to learn in production was that we aren’t an only child. It is a bit more complex than that, and I am not explaining well, let me start at the beginning.

Usually, when we work on RavenDB, we work within the scope of a single database, all of our efforts are usually scoped to that. That means that when we worked on the multi database feature for RavenDB, we actually focused on the process of loading a single database up in the air. We considered how multiple databases will interact, and we made sure that they are isolated from one another, but that was about it.

In particular, as mentioned in the previous post, starting up and shutting down were done sequentially, on a per database basis. In order to prevent issues, we had a lock on the initialize database part of the process, so two requests to the same database will not result in the same database being loaded twice.

I mentioned that we were thinking on a single database mindset, right?

Can you guess what happened?

  • Request for DB #1 – lock acquired, starting up
    • Request for DB #1 – waiting for lock to release
    • Request for DB #1 – waiting for lock to release
    • Request for DB #1 – waiting for lock to release
  • DB initialized, lock released
  • All requests are now freed and can be processed.

What happen when we have multiple databases, however?

  • Request for DB #1 – lock acquired, starting up
    • Request for DB #1 – waiting for lock to release
    • Request for DB #2 – waiting for lock to release
    • Request for DB #3 – waiting for lock to release
  • DB initialized, lock released
  • Request for DB #2 – lock released, lock acquired, starting up
    • Request for DB #3 – waiting for lock to release

You guessed it, we actually had a global lock for starting (or disposing, for that matter) databases. That meant that a single db that took time to start would impact other databases.

More importantly, it would means that other requests, which were waiting for that database to load and then had to load their own database, had far less time to actually do the processing they needed. Which meant that they were far more likely to run into the request time limit and be aborted by IIS. Which left them in an inconsistent state. Which was a nightmare to figure out.

We resolved this issue by making sure that the lock is now handled only on the same database, and that we won’t lock forever, if after a while we still don’t have the db, we will error early and give you a 503 Service Unavailable error until the db is ready to rock.

Things we learned from production, part II–wake up or I kill you dead


Getting started is probably easier than shutting down, I mean, no one is going to begrudge us some time to get our feet from under us, right?

As it turned out, this assumption is wrong on quite a few levels.

To start with, hosts such as IIS / Windows Service Manager will give you a certain time to start before they decide that you are hang and ruthlessly execute you without even thinking twice about it. This doesn’t even include the issue of admins with people breathing down their necks who assume that a taste of mortality must convince RavenDB to try even harder then next time it is started after then 7th time it was killed for not starting fast enough.

Because killing us during startup is pretty much the same as a standard crash, it means that we need to run recovery after this happened, which means that the next time is going to take longer, and then…

I think you can get the picture, right?

But the issue here is actually much more complex.

It is actually easier to recover from a real crash (something like a process termination or kill –9). It is harder when it isn’t a real crash, but something like IIS just recycling the AppDomain. The reason it is harder is that anything that is scoped to the OS, like file handles, unmanaged resources, etc, are actually still alive. It means that during the crash, you have to be very careful about detecting that you are crashing and cleaning up after you properly.

Moving back to the actual startup issue, so we have to startup fairly quickly, even if we just crashed. That makes sense, I guess. Now, that is fine and dandy, but that is just for the system database, what happens when you want to access a non system database (for example, the Northwind database)?

In RavenDB, we load those databases lazily, so on the first request to that particular database, we will load it.

As it turned out, this simple and fairly obvious decision has caused a no end of problems.

Starting up a database may take a while, in bad cases, that while may be long enough that the request time out. Now, what does it means, request time out? You might get a 408 Request Timeout from the server, but that is the client perspective.

What happens on the server? Well, IIS handed over control of the request to RavenDB, and as far as IIS is concerned, RavenDB is sitting there doing nothing, well above its time limit. Now, IIS doesn’t have a way to tell RavenDB, stop processing this request. So what do you think it does?

Welcome to the nice land of Thread.Abort().

Now, if you have ever read about Thread.Abort(), you probably know that every single reference to that is filled with warnings about the need to be very careful about what you are doing, that it is a very bad idea in general and that you should take care to never use it. The reason it is such a bad idea is that you basically cut the thread at mid execution, leaving it no chance at all to actually handle things. It is an easy way to violate invariants.

In particular, it is a good way for your cleanup to never happen. Think about it, we are in the middle of our constructor, opening files, settings things up, and suddenly the floor is yanked right out from under us.

As it turned out, in those cases, we would leak some stuff out. The next time that you tried to access the database, you would get an error that said that the files were already opened by someone else. (To make things worse, those were unmanaged resources, they wouldn’t get cleaned up by the system when GC is run.

That led to errors that were extremely hard to figure out. Because they would only occur when running at a high load, with a db that crashed and was now recovering, and with a few other databases waiting as well. And going over the code, thinking multi threading thoughts, none of that works. At some point, I put so many locks there, just to figure out what is going on, that the code looked like this:

But the actual problem wasn’t another thread corrupting state, the problem was that the current thread was ruthless killed in mid operation.

Once we figured that one out, it was straightforward, but in no way easy, to device a solution. We made sure that our db init code was robust for thread aborts, and then we moved the actual db initialization to a separate thread, one that wasn’t controlled by IIS, so we could actually get things done without having a hard time limit.

In my next post, I’ll discuss the fallacy of the singleton and how much pain it caused us.

Things we learned from production, part I–shutting down is hard to do


This series of posts is going to talk about the things that we have learned ourselves and via our customers about running RavenDB in production. Those customers include people running on a single database on a Celeron 600 Mhz with 512 MB all the way to monsters like what RavenHQ is doing.

This particular story is about the effect of shutdown on RavenDB in production environments. Before we can do that, I have to explain the sequence of operations when RavenDB shuts down:

  • Stop accepting new connections
  • Abort all existing connections
  • For each loaded database:
    • Shut down indexing
    • For each index:
      • Wait for current indexing batch to complete
      • Flush the index
      • Close the index
    • Close database
  • Repeat the same sequence for the system database
  • Done

I am skipping a lot of details, but that is the gist of it.

In this case, however, you might have noticed something interesting. What happen if we have a large number of active databases, with a large number of actual indexes?

In that case, we have to wait for the current indexing batch to complete, then shut down each of the indexes, then move to the next db, and do the same.

In some cases, that can take a while. In particular, long enough while that we would get killed. Either by automated systems that decided we passed our threshold (in particular, iisreset gives you mere 20 seconds to restart, which tend to be not enough) or by an out of patience admin.

That sucks, because if you get killed, you don’t have the time to do a proper shutdown. You crashed & burned and died and now you have to deal with all the details of proper resurrection. Now, RavenDB prides itself on actually being a regular in this matters. You can yank the power cord out and once everything is back up, RavenDB will recover gracefully and with no data loss.

But, recovering from such scenarios can take precious time. Especially if, as is frequently the case in such scenarios, we have a lot of databases and indexes to recover.

Because of that, we actually had to spend quite a bit of time on optimizing the shut down sequence. It sounds funny, isn’t it? Very few people actually care about the time it takes them to shut down. But as it turned out, we have a fairly limited budget for that.  In particular, we parallelized the process of shutting down all of the databases together, and all of their indexes together as well.

That means more IO contention than before, but at least we could usually meet the deadline.  Speaking of which, we also added additional logging and configuration that told common hosts (such as IIS) that we really would like some more time before we would be hang out to dry.

On my next post, I’ll discuss the other side, how hard it is to actually wake up in the morning Smile.

The subtle distinction between snapshot isolation and read committed


I am using db transaction isolation levels for a reason here, they make it easier to reason about what is going on.

In short, RavenDB currently supports two storage engine options, Esent and Munin. Esent is what we usually use for production, and Munin is usually used for testing. We wrote Munin as a transactional, fully managed, storage engine a while ago. And it has mostly served us well, but Esent is what we usually aim for. That is the production use case.

We recently made a few changes that resulted in test failures on Munin, only in one run out of two dozens or so, but always worked with Esent.

Naturally, because of the random nature of the problem, I suspected the issue being a race condition in Munin. That happened in the past ,and obviously they are very hard to root out completely. But after finally isolating everything down to a simple test case (writing to two “tables” with associated information), I finally figured it out.

Munin is working just fine, it hasn’t got a spec of a problem. It is just that, when we built it, I built it to support Read Commited Isolation Level. While Esent is providing Snapshot isolation level. The code assumes snapshot isolation level at some pretty level. Obviously, this sort of thing shows up as a race condition, and it is extremely hard to debug, as anyone who ever dealt with those issues in RDBMS can testify.

So my task now is not to fix a bug in Munin, but to actually implement snapshot isolation. As it turned out, actually moving Munin from read committed isolation to snapshot isolation was a lot easier than finding the problem.

I am torn between being pleased that I found the issue, happy that Munin doesn’t have a bug and pissed that it took me that long.

NuGet Perf, The Final Part – Load Testing – Source Code


This is just some logistical cleanups.

The code for the entire series can be found here: https://github.com/ayende/nuget.perf

No, I’ll not do a similar SQL version, if you want to, I would be very interested in seeing one, but that isn’t something that I intend to do.

Yes, it is a simple and trivial implementation, but that was pretty much the whole point. Being able to get to that scale without actually doing anything special is what we strive for in RavenDB.

NuGet Perf, The Final Part – Load Testing – Results ^ 2


After seeing how well RavenDB does in perf testing, I decided to take it up a notch.

  • Starting from 10 users, with a  step duration of 1 sec, add 50 users for each step, all the way to 3,000.
  • Start with a warm up period of 20 seconds, then run the test for 10 minutes.

Let us see what happens, okay?

Just to be clear, this is a RavenDB application running with three thousands concurrent users, on an off the shelve laptop while I was busy doing other stuff.

One word of warning before hand, because I run everything on a single machine, just running so many users on the machine significantly slowed down how RavenDB is reacting. Basically, the code for managing the perf test took so many resources that RavenDB had to fight to get some to actually answer the queries.

Scared yet, because here are the results in graph form.

image

Now you can actually see that we have some fluctuations in the graphs, the number of users grows and grows until we get to 3,000 and we have 0.37 seconds response times.

Again, I remind you, we have done zero optimizations and this is idiomatic RavenDB code. And we were able to serve requests at a frankly pretty amazing rate of speed.

And here are they in their full details:

 

Load Test Summary
Test Run Information
Load test name LoadTest1
Description  
Start time 04/09/12 15:28:48
End time 04/09/12 15:38:48
Warm-up duration 00:00:20
Duration 00:10:00
Controller Local run
Number of agents 1
Run settings used Load
Overall Results
Max User Load 3,000
Tests/Sec 196
Tests Failed 0
Avg. Test Time (sec) 14.3
Transactions/Sec 0
Avg. Transaction Time (sec) 0
Pages/Sec 741
Avg. Page Time (sec) 0.37
Requests/Sec 741
Requests Failed 0
Requests Cached Percentage 0
Avg. Response Time (sec) 0.37
Avg. Content Length (bytes) 3,080
Key Statistic: Top 5 Slowest Pages
URL (Link to More Details) 95% Page Time (sec)
Page 1 0.83
Page 0 0.82
Page 2 0.82
Page 1 0.82
http://localhost:52688/api/search 0.81
Key Statistic: Top 5 Slowest Tests
Name 95% Test Time (sec)
Browsing 20.8
BrowseAndSearch 19.8
Searching 12.9
Test Results
Name Scenario Total Tests Failed Tests (% of total) Avg. Test Time (sec)
Browsing Load 31,843 0 (0) 17.4
BrowseAndSearch Load 33,989 0 (0) 16.8
Searching Load 51,650 0 (0) 10.8
Page Results
URL (Link to More Details) Scenario Test Avg. Page Time (sec) Count
Page 2 Load Browsing 0.40 32,338
Search yui Load Searching 0.39 52,597
Page 1 Load Browsing 0.39 32,627
http://localhost:52688/api/search Load BrowseAndSearch 0.39 68,576
Page 0 Load Browsing 0.38 32,803
Search grid Load Searching 0.38 52,283
Page 1 Load BrowseAndSearch 0.37 34,766
Page 0 Load BrowseAndSearch 0.36 34,982
Search debug Load Searching 0.35 51,991
Search ravendb Load Searching 0.33 51,846
Transaction Results
Name Scenario Test Response Time (sec) Elapsed Time (sec) Count
System Under Test Resources
Machine Name % Processor Time Available Memory at Test Completion (Mb)
Controller and Agents Resources
Machine Name % Processor Time Available Memory at Test Completion (Mb)
RAVEN 85.4 1,203
Errors
Type Subtype Count Last Message

Note that the reason fro the high CPU usage is that the tests and RavenDB were running on the same machine.

NuGet Perf, The Final Part – Loading Testing – Results


The test was run locally (no network involved ) on a Lenovo W520 laptop with 8 cores & 8 GB RAM with an SSD card. The storage engine we used was Esent, Safe Transactions. Default RavenDB configuration, running in console, with logging disabled.

We took the most obvious approach both in the code we wrote and the test approach. I am pretty sure that I’ll get a lot of helpful suggestions about the load testing. The code is available here, and you are more than welcome to take it for a spin and get your own results. What is important for me to note is that we have done exactly zero performance tuning. That is relevant to both the index we use, to the code that we wrote, everything. I just wrote things down, and didn’t worry about performance, even though this code is going to go through a load test.

Why don’t I worry about it? Because RavenDB is setup to do the Right Thing. It will self optimize itself without you need to take care of that.

With that said, here are the test results:

image

You can see that the red line is the number of users we have, and we have this worrying green line that seems to go crazy…

Except that this is actually the number of page served. The part that we care about is actually the Avg. Page Time, and that is the blue line.

This line, however, is basically flat no matter the load.

Here are the test results in details

 
Load Test Summary
Test Run Information
Load test name LoadTest1
Description  
Start time 04/09/12 14:16:38
End time 04/09/12 14:21:38
Warm-up duration 00:00:20
Duration 00:05:00
Controller Local run
Number of agents 1
Run settings used Run Settings1
Overall Results
Max User Load 300
Tests/Sec 20.0
Tests Failed 0
Avg. Test Time (sec) 12.5
Transactions/Sec 0
Avg. Transaction Time (sec) 0
Pages/Sec 77.1
Avg. Page Time (sec) 0.0062
Requests/Sec 77.1
Requests Failed 0
Requests Cached Percentage 0
Avg. Response Time (sec) 0.0062
Avg. Content Length (bytes) 3,042
Key Statistic: Top 5 Slowest Pages
URL (Link to More Details) 95% Page Time (sec)
Page 0 0.018
Page 0 0.018
Page 2 0.014
http://localhost:52688/api/search 0.014
Search ravendb 0.014
Key Statistic: Top 5 Slowest Tests
Name 95% Test Time (sec)
Browsing 19.3
BrowseAndSearch 17.6
Searching 10.6
Test Results
Name Scenario Total Tests Failed Tests (% of total) Avg. Test Time (sec)
Browsing Load 1,533 0 (0) 16.0
BrowseAndSearch Load 1,685 0 (0) 15.0
Searching Load 2,770 0 (0) 9.00
Page Results
URL (Link to More Details) Scenario Test Avg. Page Time (sec) Count
Page 0 Load Browsing 0.0072 1,629
Page 0 Load BrowseAndSearch 0.0071 1,783
http://localhost:52688/api/search Load BrowseAndSearch 0.0064 3,443
Search ravendb Load Searching 0.0064 2,798
Page 1 Load Browsing 0.0063 1,617
Page 2 Load Browsing 0.0063 1,580
Page 1 Load BrowseAndSearch 0.0063 1,760
Search debug Load Searching 0.0055 2,810
Search grid Load Searching 0.0055 2,839
Search yui Load Searching 0.0054 2,866
Transaction Results
Name Scenario Test Response Time (sec) Elapsed Time (sec) Count
System Under Test Resources
Machine Name % Processor Time Available Memory at Test Completion (Mb)
Controller and Agents Resources
Machine Name % Processor Time Available Memory at Test Completion (Mb)
RAVEN 13.0 1,356
Errors
Type Subtype Count Last Message

 

You can dig in and look at the data, it is quite interesting. Under the load of 300 users, the average page response time was… 0.0062 seconds.

And RavenDB was using just 13% of the CPU, and that include running the agents running the tests.

In my next post, we will go totally crazy…

FUTURE POSTS

No future posts left, oh my!

RECENT SERIES

  1. Production postmortem (5):
    29 Jul 2015 - The evil licensing code
  2. Career planning (6):
    24 Jul 2015 - The immortal choices aren't
  3. API Design (7):
    20 Jul 2015 - We’ll let the users sort it out
  4. What is new in RavenDB 3.5 (3):
    15 Jul 2015 - Exploring data in the dark
  5. The RavenDB Comic Strip (3):
    28 May 2015 - Part III – High availability & sleeping soundly
View all series

Syndication

Main feed Feed Stats
Comments feed   Comments Feed Stats