Ayende @ Rahien

Hi!
My name is Oren Eini
Founder of Hibernating Rhinos LTD and RavenDB.
You can reach me by phone or email:

ayende@ayende.com

+972 52-548-6969

, @ Q c

Posts: 10 | Comments: 37

filter by tags archive

Production postmortemThe case of the man in the middle

time to read 3 min | 553 words

One of the most frustrating things when you dealing with production issues is when the problem is not in our product, but elsewhere. In particular, this post is dedicated to the hard work done by many anti virus products, in particular, to make our life harder.

Let us take a look at the following quote, taken from the ESET NOD32 Anti Virus knowledge base (emphasis mine):

By default, your ESET product automatically detects programs that are used as web browsers and email clients, and adds them to the list of programs that the internal proxy scans. This can cause loss of internet connectivity or other undesired results with applications that use network features but are not web browsers/email clients.

Yes, it can. In fact, it very often does.

Previously, we looked at a similar issue with Anti Virus slowing down I/O enough to cause us to slowly die. But in this case, the issue is a lot more subtle.

Because it is doing content filtering, it tends to put a much higher overhead on the system resources, which means that as far as the user is concerned, RavenDB is slow. We actually developed features specifically to handle this scenario. The traffic watch mode will tell you how much time you spend on the server side, and we have added a feature that will make RavenDB account for the internal work each query is doing, so we can tell where the actual cost is.

You can enable that by issuing:

GET databases/Northwind/debug/enable-query-timing

And one that is setup, you can get a good idea about what is costly in the query, as far as RavenDB is concerned. Here is an example of a very slow query:

image

You can see that the issue is that we are issuing a very wide range query, so most of the time is spent in inside Lucene. Other examples might be ridicilously complex queries, which result in high parsing time (we have seen queries in the hundreds of KB range). Or loading a lot of big documents, or… you get the drift. If we see that the server thinks that a query is fast, but the overall time is slow, we know to blame the network.

But an even more insidious issue is that this would drop requests,  consistently and randomly (and yes, I know that those are contradictions, it was consistently dropping requests in a random pattern that seemed explicitly designed to thwart figuring out what is going on). Leading to things breaking, and escalated support calls. “RavenDB is broken” leads to a lot of headache, and a burning desire to hit something when you figure out that not only isn’t it your fault, but the underlying reason is actively trying to prevent you from figuring it out (I assume it is to deal with viruses that try to shut it off), which lead to really complex find facting sessions.

That is more annoying because it seems that the issue there was a bug in respecting keep alive sessions for authenticated requests under some scenarios, in the AV product in question! Absolutely not fun!

Pointer arithmetic and dynamic HTML generation FTW–at 2 AM

time to read 2 min | 388 words

I just finished writing the following code, the time is 2 AM local time, and I am chasing a conceptual issue deep in the guts of Voron (RavenDB’s internal storage system).

image

By any measure of the word, this code should frighten anyone who reads it. For what I think can we can all safely assume are self evident reasons.

So why am I so very happy about this kind of code? Because the output of this code is this:

image

This is pretty trivial tree view for internal data structures inside of Voron.

But the point here isn’t that this code is probably going to be a no-go zone by tomorrow morning. This is code explicitly designed for that. In fact, the calling method for this is:

image

So this is a piece of code that we are trying very hard to ensure that will never be called unless we are actively requesting it in a debug session. In other words, this isn’t production code. It is scaffolding code. The code that you’ll call when you lost your way, and can no longer figure out what is going on using the standard tools.

This kind of visualization is cheap & dirty trick, but it means that instead of having to reason about this kind of thing, I can immediately see what is going on.

For that matter, if you look closely at the visualization, you might be able to see the bug in question. It is pretty obvious, once we are able to look at the entire data set in one go. The more complex the algorithm, the harder it is to conceptualize it. But in most cases, if we could but look at the data in a way that would make sense to us, we will be able to immediately see what we did wrong.

Production postmortemThe case of the intransigent new database

time to read 3 min | 480 words

A customer called us to tell that they had a problem with RavenDB. As part of their process for handling new customers, they would create a new database, setup indexes, and then direct all the queries for that customer to that database.

Unfortunately, this system that has worked so well in development died a horrible death in production. But, and this was strange, only for new customers, and only in the create new customer stage. The problem was:

  • The user would create a new database in RavenDB. This just create a db record, and its location on disk. It doesn’t actually initialize a database.
  • On the first request, we initialize the db, creating it if needed. The first request will wait until this happens, then proceed.
  • On their production systems, that first request (which they used to create the indexes they require) would time out with an error.

Somehow, the creation of a new database would take way too long.

The first thought we had was they are creating the database on a path of an already existing database, maybe a big one that had a long initialization period, or maybe one that required recovery. But the customer validated that they were creating the database on an empty folder.

We looked at the logs, and the logs just showed a bunch of time were there was no activity. In fact, we had a single method call to open the database that took over 15 seconds to run. Except that on a new database, this method just create a bunch of files to start things out and is ready really quickly.

That is the point that led us to suspect that the issue was environmental. Luckily, as the result of many such calls, RavenDB comes with a pretty basic I/O Test tool. I asked the customer to run this on their production system, and I got the following:image

And now everything was clear. They were running on an I/O constrained system (a cloud machine), and they were running into an interesting problem. When RavenDB creates a database, it pre-allocate some files for its transactional journal.

Those files are 64MB in size, and the total write for a new Esent RavenDB database with default configuration is just over 65MB. If your write throughput is less than 1MB/sec sustained, that will be problematic.

I let the customer know about the configuration option to take less space at startup (Esent RavenDB databases can go as low as 5MB, Voron RavenDB starts at 256Kb), but I also gave them a hearty recommendation to make sure that their I/O rates improved, because this isn’t going to be the only case where slow I/O will kill them.

Production postmortemThe case of the infected cluster

time to read 8 min | 1592 words

RavenDB is a critical system for many of our clients, that means that we get our share of “the sky is falling” stories, and we have a team of people we train to be able to diagnose and troubleshoot issues in production. This series of post is about the kind of issues that we run into, how they were resolved, and what the underlying issue was.

The scenario, a customer had a total system meltdown because RavenDB would stop responding to requests. Restarting the server would result in everything working, but in about 10 minutes, they would run into the same issue.

Complicating matters was the fact that the customer was running in a high availability cluster, with two quite powerful nodes that had failover setup. However, the problem would typically impact a single server, eventually leading to it stopping processing requests, while the second server would be humming along beautifully. When the first server would stop, the second server would continue for a few more minutes, then die in the same manner.

The issue was clearly related to the load on those server. The website is pretty busy, with millions of queries to RavenDB per hour, but we had no reason to think that the system was over capacity. Why is that?

  • The system has been running around for over a month on the same configuration, with moderate memory consumption and low CPU usage throughout that time.
  • The last upgrade to the RavenDB servers happened about a week ago, and no new traffic patterns should cause that load.
  • The last RavenDB client upgrade happened that morning, but it was a minor change and should reduce load on the server. But the problem would show up within 10 – 20 minutes, and the client upgrade happened several hours before it started happening.
  • The operations team swore that there were no changes made to either server recently.

The behavior was puzzling because we didn’t have high CPU, high memory or paging. Typical indications that the system is under too much of a load. Furthermore, I/O rates were quite good, and testing the disk speed and throughput showed that everything is all right.

We tried to revert all recent changes, to both clients and servers. We monitored the servers to see what happens when the problem recurred, and eventually we figured out what is the probable issue. The affected server would slowly show a growth pattern in the number of threads being used. Slowly going to upward of 700 threads. The non impacted server would sit on around 100 threads or so, until the first server would fail, in which case all load would hit the okay server, and it would be fine, for a while, then the thread count would start climbing up, the request timing would increase, and eventually things would stop.

One thing that was very interesting was that the problem was that RavenDB would still be responding (albeit a bit badly) under those conditions, and there was no indication that there are any issues with it as far as it was concerned. But under this failure condition we did see that we start getting a rather strange error.

The server is too busy, could not acquire transactional access

This rather poorly worded message is an indication that we hit an internal limit inside RavenDB. By default, we have a limit of 512 concurrent requests inside the server. That is done to make sure that we aren’t swamping the server with too many requests that would take it down. The good news was that this is a configurable value, and we were able to set Raven/MaxConcurrentServerRequests to a higher value to get a little bit more breathing space. Once we have done that, the system would still become slower, but it would stabilize at some point, even though it would still be much slower than what we come to expect of it.

Intermediary conclusions were:

  1. Something is causing requests to take a lot longer than they should.
  2. Because they take longer than they should, we see a convoy of requests.
    • We can’t handle requests as fast as they come, so they start to pile up.
    • The thread pool (which dispatches requests) is handling this by slowly increasing the number of threads.
    • But we still aren’t able to catch up with the number of requests coming in.
    • At some point, the number of concurrent requests hits the limit, and we start rejecting some requests
  3. We would see major slow downs because the thread pool would increase the number of threads in the pool only on a slow and measured pace.
  4. We don’t have any idea what is causing us to slow down.

Remember, this is a system that has been in production for a while now. In fact, the current traffic numbers are actually only about one third of the peek that was hit shortly after the system was released, and we handled that load with no issue.

We came back to the drawing board, and started look at pretty much anything suspicious. Memory was reasonable, and we have few dozens GB still free. CPU utilization was moderate, with over half of the system capacity unused. I/O rates from the disk according to performance monitor was excellent. Everything was good. So we took a dump of the process and proceeded to try to figure out what is going on from that.

We saw something like this (spread over 150+ threads and in various modes, but this gets the point):

image

95% of the threads were stuck in various I/O related methods. Like NtCreateFile, NtCloseFile, etc. Other threads were hanging in .NET related code that was also doing file I/O. So we can pretty conclusively state that the problem is I/O related.

But what? We tested the disk I/O, it was good. We run I/O tests and they were fine. The measurements we got from the monitoring tools all said that the I/O is fine.

And yet we could see that I/O was the culprit. What the hell?!

At this point, we start suspecting that the issue was with something in the system. Now, each of the servers was capable to handling the entire load of the server (and then some) on its own. But both machine exhibit the same signs. That led us to believe that this isn’t some sort of hardware issue. It is something that both machine had.

A common trouble maker in this area is any kind of tool that hooks into sys calls. For example, anti virus software. In fact, that was one of the very first questions that we asked. Does this machine have anti virus on it? We got a negative answer early on. But after seeing those results, we went in and double checked. The machine did have an anti virus installed, and that is a huge red flag in those circumstances.

Anti virus software like to hook into all kind of system calls, and it is commonly associated with problems because of that. We asked the operations team to disable the anti virus on those servers to see if it would have an impact on the issue. Indeed, they stopped the anti virus and restarted the RavenDB servers, and then we waited. 10 minutes later, we saw no increase in the number of threads, and 25 minutes later, we were holding steady on the same performance that we came to expect.

Today, the customer sent us their monitoring system reports and there is a huge drop from 30% – 50% CPU usage to single digits on the day in question, and then it is flat graph for all the rest of the week.

Our hypothesis is that the anti virus either auto updated itself, or was actually updated by ops as part of routine maintenance task, and for some reason it has taken a liking to the kind of I/O RavenDB was generating. So it inspected us. And as it did so, it added overhead for each and every I/O call we made. And considering that we are a database, I/O is kinda of important for us. That additional overhead would cause us to respond much more slowly, which would mean that we would have additional request queue up, all the way until we start rejecting them. And we didn’t see those at the I/O metrics for the system because the overhead happen after the I/O is already completed.

I also suspect that this runs for all I/O operations, including things that should be buffered (therefor fast), so it has a big impact on our assumptions (this I/O call should be buffered, so it is fast, but it is checked by the AV, so it is not, leading to the convoy).

Because the Anti Virus was installed on both servers, they were both impacted, and the more load we put on them, the more the AV had to do, the more it slowed us down, the more requests we had queued, the bigger the trouble.

In other words, be very aware what is actually going on your system. As part of the post mortem analysis, we realized that we should have probably run fltmc.exe tool to check what are the file system filters that are installed and probably could have figured out that this is an Anti Virus issue much faster.

Timeouts, TCP and streaming operations

time to read 3 min | 499 words

We got a bug report in the RavenDB mailing list that was interesting to figure out.  The code in question was:

foreach(var product in GetAllProducts(session)) // GetAllProducts is implemented using streaming
{
  ++i;
  if (i > 1000)
  {
    i = 0;
    Thread.Sleep(1000);
  }
}

This code would cause a timeout error to occur after a while. The question is why? We can assume that this code is running in a console application, and it can take as long as it wants to process things.

And the server is not impacted from what the client is doing, so why do we have a timeout error here? The quick answer is that we are filling in the buffers.

GetAllProducts is using the RavenDB streaming API, which push the results of the query to the client as soon as we have anything. That lets us parallelize work on both server and client, and avoid having to hold everything in memory.

However, if the client isn’t processing things fast enough, we run into an interesting problem. The server is sending the data to the client over TCP. The client machine will get the results, buffer them and send them to the client. The client will read them from the TCP buffers, then do some work (in this case, just sleeping). Because the rate in which the client is processing items is much smaller than the rate in which we are sending them, the TCP buffers become full.

At this point, the client machine is going to start dropping TCP packets. It doesn’t have any more room to put the data in, and the server will send it again, anyway. And that is what the server is doing, assuming that we have a packet loss over the network. However, that will only hold up for a while, because if the client isn’t going to recover quickly, the server will decide that it is down, and close the TCP connection.

At this point, there isn’t any more data from the server, so the client will catch up with the buffered data, and then wait for the server to send more data. That isn’t going to happen, because the server already consider the connection lost. And eventually the client will time out with an error.

A streaming operation require us to process the results quickly enough to not jam the network.

RavenDB also have the notion of subscriptions. With those, we require explicit client confirmation from the client to send the next batch, so a a slow client isn’t going to cause issues.

Taking full dumps for big IIS apps

time to read 2 min | 313 words

If your application is running on IIS, you are getting quite a lot for free. To start with, monitoring and management tools are right there out of the box. You are also getting some… other effects.

In particular, we had RavenDB running inside IIS that exhibit a set of performance problems in a couple of nodes (and just on those nodes). We suspected that this might be related to memory usage, and we wanted to take a full process dump so we can analyze this offline.

Whenever we tried doing that, however, the process would just restart. The problem was that to reproduce this we had to wait for a particular load pattern to happen after the database was live for about 24 hours. So taking the dump at the right time was crucial. Initially we thought we used the wrong command, or something like that. The maddening this was, when we tried it on the same machine, using the same command, without the performance issue present, it just worked (and told us nothing).

Eventually we figured out that the problem was in IIS. Or, to be rather more exact, IIS was doing its job.

When the performance problem happened, the memory usage was high. We then needed to take a full process dump, which meant that we had to write a lot. IIS didn’t hear from the worker process during that time (since it was currently being dumped), and it killed it, creating a new one.

The solution was to ask IIS to not do that, the configuration is available in the advanced settings for application pool. Note that just changing that would force IIS to restart the process, which was another major annoyance.

image

Lambda methods and implicit context

time to read 3 min | 524 words

The C# compiler is lazy, which is usually a very good thing, but that can also give you some issues. We recently tracked down a memory usage issue to code that looked roughly like this.

var stats = new PerformatStats
{
    Size = largeData.Length
};
stats.OnCompletion += () => this.RecordCompletion(stats);

Write(stats, o =>
{
    var sp = new Stopwatch();
    foreach (var item in largeData)
    {
        sp.Restart();
        // do something with this
        stats.RecordOperationDuration(sp);
    }
});

On the surface, this looks good. We are only using largeData for a short while, right?

But behind the scene, something evil lurks. Here is what this actually is translated to by the compiler:

__DisplayClass3 cDisplayClass3_1 = new __DisplayClass3
{
    __this = this,
    largeData = largeData
};
cDisplayClass3_1.stats = new PerformatStats { Size = cDisplayClass3_1.largeData.Length };

cDisplayClass3_1.stats.OnCompletion += new Action(cDisplayClass3_1, __LambdaMethod1__);

Write(cDisplayClass3_1.stats, new Action(cDisplayClass3_1, __LambdaMethod2__));

You need to pay special attention to what is going on. We need to maintain the local state of the variables. So the compiler lift the local parameters into an object. (Called __DisplayClass3).

Creating spurious objects is something that we want to avoid, so the C# compiler says: “Oh, I’ve two lambdas in this call that need to get access to the local variables. Instead of creating two objects, I can create just a single one, and share it among both calls, thereby saving some space”.

Unfortunately for us, there is a slight issue here. The lifetime of the stats object is pretty long (we use it to report stats). But we also hold a reference to the completion delegate (we use that to report on stuff later on). Because the completion delegate holds the same lifted parameters object, and because that holds the large data object. It means that we ended up holding a lot of stuff in memory far beyond the time they were useful.

The annoying thing is that it was pretty hard to figure out, because we were looking at the source code, and the lambda that we know is likely to be long running doesn’t look like it is going to hold a referece to the largeData object.

Ouch.

Buffer Managers, production code and alternative implementations

time to read 2 min | 329 words

We are porting RavenDB to Linux, and as such, we run into a lot of… interesting issues. Today we run into a really annoying one.

We make use of the BufferManager class inside RavenDB to reduce memory allocations. On the .Net side of things, everything works just fine, and we never really had any issues with it.

On the Mono side of things, we started getting all sort of weird errors. From ArgumentOutOfRangeException to NullReferenceException to just plain weird stuff. That was the time to dig in and look into what is going on.

On the .NET side of things, BufferManager implementation is based on a selection criteria between large (more than 85Kb) and small buffers. For large buffers, there is a single large pool that is shared among all the users of the pool. For small buffers, the BufferManager uses a pool per active thread as well as a global pool, etc. In fact, looking at the code we see that it is really nice, and a lot of effort has been made to harden it and make it work nicely for many scenarios.

The Mono implementation, on the other hand, decides to blithely discard the API contract by ignoring the maximum buffer pool size. It seems because “no user code is designed to cope with this”. Considering the fact that RavenDB is certainly dealing with that, I’m somewhat insulted, but it seems par the course for Linux, where “memory is infinite until we kill you”* is the way to go.

But what is far worse is that this class is absolutely not thread safe. That was a lot of fun to discover. Considering that this piece of code is pretty central for the entire WCF stack, I’m not really sure how that worked. We ended up writing our own BufferManager impl for Mono, to avoid those issues.

* Yes, somewhat bitter here, I’ll admit. The next post will discuss this in detail.

Long running async and memory fragmentation

time to read 2 min | 259 words

We are working on performance a lot lately, but performance isn’t just an issue of how fast you can do something, it is also an issue of how many resources we use while doing that. One of the things we noticed was that we are using more memory than we would like to, and even after we were freeing the memory we were using. Digging into the memory usage, we found that the problem was that we were suffering from fragmentation inside the managed heap.

More to the point, this isn’t a large object heap fragmentation, but actually fragmentation in the standard heap. The underlying reason is that we are issuing a lot of outstanding async I/O requests, especially to serve things like the Changes() API, wait for incoming HTTP requests, etc.

Here is what this looks like inside dotProfiler.

image

As you can see, we are actually using almost no memory, but heap fragmentation is killing us in terms of memory usage.

Looking deeper, we see:

image

We suspect that the issue is that we have pinned instances that we sent to async I/O, and that match what we have found elsewhere about this issue, but we aren’t really sure how to deal with it.

Ideas are more than welcome.

This exception never happened

time to read 1 min | 86 words

I’m not sure what is going on, but it is strange.

image

This seems to be related to having multiple catch statements with the same exception name, but I can’t reproduce this in isolation.

And it seems that other people have run into it as well: http://stackoverflow.com/questions/5650868/net-exception-caught-is-unexpectedly-null

Worked around it by changing the parameter name… but still, really strange.

FUTURE POSTS

  1. Production postmortem: The case of the memory eater and high load - about one day from now
  2. Production postmortem: The case of the lying configuration file - 3 days from now
  3. Production postmortem: The industry at large - 4 days from now
  4. The insidious cost of allocations - 5 days from now
  5. Find the bug: The concurrent memory buster - 6 days from now

And 4 more posts are pending...

There are posts all the way to Sep 10, 2015

RECENT SERIES

  1. Find the bug (5):
    20 Apr 2011 - Why do I get a Null Reference Exception?
  2. Production postmortem (10):
    14 Aug 2015 - The case of the man in the middle
  3. What is new in RavenDB 3.5 (7):
    12 Aug 2015 - Monitoring support
  4. Career planning (6):
    24 Jul 2015 - The immortal choices aren't
View all series

Syndication

Main feed Feed Stats
Comments feed   Comments Feed Stats