Ayende @ Rahien

Hi!
My name is Oren Eini
Founder of Hibernating Rhinos LTD and RavenDB.
You can reach me by phone or email:

ayende@ayende.com

+972 52-548-6969

, @ Q c

Posts: 5,968 | Comments: 44,489

filter by tags archive

Production postmortemThe case of the native memory leak


This one is a pretty recent one. A customer complained about high memory usage in RavenDB under moderate usage. That was a cause for concern, since we care a lot about our memory utilization.

So we started investigating that, and it turned out that we were wrong, the problem wasn’t with RavenDB, it was with the RavenDB Client Library. The customer had a scenario where 100% of the time, after issuing a small number of requests (less than ten), the client process would be using hundreds of MB, for really no purpose at all. The client already turned off caching, profiling and pretty much anything else that both they and us could think of.

We got a process dump from them and looked at that, and everything seemed to be fine. The size of the heap was good, and there didn’t appear to be any memory being leaked. Our assumption at that point was that there is some sort of native memory leak from their application.

To continue the investigation further, NDAs was required, but we managed to go through that and we finally had a small repro that we could look at ourselves. The fact that the customer was able to create such a thing is really appreciated, because very often we have to work with a lot of missing information. Of course, when we run this on our own system, everything was just fine & dandy. There was no issue. We got back to the customer and they told us that the problem would only reproduce in IIS.

And indeed, once we have tested inside IIS (vs. using IIS Express), that problem was immediately obvious. That was encouraging (at least from my perspective). RavenDB doesn’t contain anything that would differentiate between IIS and IIS Express. So if the problem was only in IIS, that is not likely to be something that we did.

Once I had a repro of the problem, I sat down to observe what was actually going on. The customer was reading 1,024 documents from the server, and then sending them to the browser. That meant that each of the requests we tested was roughly 5MB in size. Note that this means that we have to:

  • Read the response from the server
  • Decompress the data
  • Materialize it to objects
  • Serialize the data back to send it over the network

That is a lot of memory that is being used here. And we never run into any such issues before.

I had a hunch, and I created the following:

[RoutePrefix("api/gc")]
public class GcController : ApiController
{
    [Route]
    public HttpResponseMessage Get()
    {
        GCSettings.LargeObjectHeapCompactionMode = GCLargeObjectHeapCompactionMode.CompactOnce;
        GC.Collect(2);
        return Request.CreateResponse(HttpStatusCode.OK);
    }
}

This allows me to manually invoke a rude GC, and indeed, when running this, memory utilization dropped quite nicely. Of course, that isn’t something that you want to run into your systems, but it is very important diagnostic tool.

Next, I tried to do the following:

[Route]
public HttpResponseMessage Get()
{
    var file = File.ReadAllText(@"D:\5mb.json");
    var deserializeObject = JsonConvert.DeserializeObject<Item>(file);

    return this.Request.CreateResponse(HttpStatusCode.OK, new { deserializeObject.item, deserializeObject.children });
}

This involves no RavenDB code, but it has roughly the same memory pressure semantics as the customer code. Indeed, the problem reproduced quite nicely there as well.

So the issue is about a request that uses a lot of memory (include at least one big buffer), likely causing some fragmentation in the heap that would bring memory utilization high. When the system had had enough, it would reclaim all of that, but unless there is active memory pressure, I’m guessing that it would rather leave it like that until it has to pay that price.

Production postmortemThe case of the intransigent new database


A customer called us to tell that they had a problem with RavenDB. As part of their process for handling new customers, they would create a new database, setup indexes, and then direct all the queries for that customer to that database.

Unfortunately, this system that has worked so well in development died a horrible death in production. But, and this was strange, only for new customers, and only in the create new customer stage. The problem was:

  • The user would create a new database in RavenDB. This just create a db record, and its location on disk. It doesn’t actually initialize a database.
  • On the first request, we initialize the db, creating it if needed. The first request will wait until this happens, then proceed.
  • On their production systems, that first request (which they used to create the indexes they require) would time out with an error.

Somehow, the creation of a new database would take way too long.

The first thought we had was they are creating the database on a path of an already existing database, maybe a big one that had a long initialization period, or maybe one that required recovery. But the customer validated that they were creating the database on an empty folder.

We looked at the logs, and the logs just showed a bunch of time were there was no activity. In fact, we had a single method call to open the database that took over 15 seconds to run. Except that on a new database, this method just create a bunch of files to start things out and is ready really quickly.

That is the point that led us to suspect that the issue was environmental. Luckily, as the result of many such calls, RavenDB comes with a pretty basic I/O Test tool. I asked the customer to run this on their production system, and I got the following:image

And now everything was clear. They were running on an I/O constrained system (a cloud machine), and they were running into an interesting problem. When RavenDB creates a database, it pre-allocate some files for its transactional journal.

Those files are 64MB in size, and the total write for a new Esent RavenDB database with default configuration is just over 65MB. If your write throughput is less than 1MB/sec sustained, that will be problematic.

I let the customer know about the configuration option to take less space at startup (Esent RavenDB databases can go as low as 5MB, Voron RavenDB starts at 256Kb), but I also gave them a hearty recommendation to make sure that their I/O rates improved, because this isn’t going to be the only case where slow I/O will kill them.

What is new in RavenDB 3.5Exploring data in the dark


One of the things that tend to happen a lot when we are developing with a database is that we need to peek at the data, and a lot of the time, just looking at the data one document at a time isn’t good for us.

We noticed that a lot of users will create temporary indexes (usually map/reduce ones) to get some idea about what is actually going on in the database, or for one off reporting. That is pretty inefficient, and in order to handle that, we added the Data Exploration feature.

image

 

This feature gives you the option of exploring the data in details. You can even run the query over large data sets, doing some complex aggregations and looking at the results.

Note that this is an admin / developer only feature, we provide no API for this, because the idea is that we have a human sitting in front of the DB going… “Hmm.. I wonder about…”

What is new in RavenDB 3.5My thread pool is smarter


The .NET thread pool is a really amazing piece of technology, and it is suitable for a wide range of usages. RavenDB has been making use of it for almost of all concurrent work since the very beginning.

In RavenDB 3.5, we have decided to change that. RavenDB have a lot of parallel execution requirements, but most of them have unique characteristics that we can express better with our own thread pool.

To start with, unlike the normal thread pool, we aren’t registering just a delegate and some state for it to execute, we are always registering a list of items to process, and a delegate that takes either a single item from that list or a section of that list. This let us do a much better job at work stealing. Because we have a lot more context about the actual operation. We know that when we are done with executing a particular delegate, we get to run the same delegate on the next available item in the list that it got passed it. That give us higher locality of code, because we are always executing the same task, as long as we have tasks for that in the pool.

We often have nested operations, a parallel task (execute indexing work) that spawn additional parallel work (index the following documents). By basing this all on our custom thread pool, we can perform those operations in a way that doesn’t involve waiting for that work to be done. Instead, the thread pool thread that we run on is able to “wait” by executing the work that we are waiting for. We have no blocked threads, and in many cases we can avoid getting any context switches.

Under load, that means that threads won’t put a lot of work on the thread pool and then have to fight with each other over who will finish its work first, it means that we get to run our own tasks, and only when there are enough threads available for other word will we spread for additional threads.

Speaking of load, the new thread pool also have dynamic load balancing feature. Because we know that RavenDB will use the thread pool for background work only, we can prioritize things accordingly. RavenDB is trying to keep the CPU usage in the 60% – 80% range by default. And if we detect that we have a higher CPU usage, we’ll start decreasing the background work we are doing, to make sure that we aren’t impacting front row work (like serving requests). We’ll start doing that by changing the priority of the background threads, and eventually just stop processing work in most of the background threads (we always have a minimum number of threads that will remain working, of course).

Another fun thing that the thread pool can do is to detect and handle slowpokes. A common example is an index that is taking a long time to run. Significantly more than all the other indexes. The thread pool can release all the other indexes, and let the calling code know that this particular task has been left to run on its own. RavenDB will then split the indexing work so the slow index will not slow all of the rest of the indexing.

And having split the thread pools between front row work (the standard .NET thread pool) doing request processing and the background pool (which is our own custom impl), we get a lot more predictability in the environment. We don’t have to worry about indexing jobs taking over the threads required to serve requests, or for requests on the server to impact the loading of a new database, etc.

And finally, like every other feature in RavenDB nowadays, we have a rich set of debug endpoints that can tell us in details exactly what is going on. That is crucial when we are talking about systems that run for months and years or when we are trying to troubleshoot a problematic server.

Production postmortemThe case of the hung over server


This was an interesting support call, about six months ago, if I recall correctly. It started with a customer calling because his database would sometimes stop responding for minutes (up to 30 minutes in some cases), and only when there was a high load on the system.

We started looking at this, and it was really strange. They could reliably reproduce this on their production system, but only there, and only under pretty high load. Simulating the same load on a non production machine would have no impact whatsoever. Tripling the load on a non production machine had no impact either.

We asked the customer to send us the debug logs, and we found an interesting tidbit of information in there. The issue always happened when the server was loading a database. The sequence of events went something like this:

  • Lots of users on the site, using the (multiple) databases on the server.
  • For some reason (lunch, meeting, whatever) there is a period of an hour or so when there is no activity on one part of the system, which meant that there are no requests for a particular database.
  • RavenDB would unload the database and release all resources associated with it.
  • After that period of time, and while the server is still in high load, requests would resume for that database. Usually in a big burst of requests for that database.
  • RavenDB would start loading the database.
  • All requests would start hanging.
  • About 15 – 25 minutes later, RavenDB would start the database.
  • All timing information in the logs indicated that there was no long operation.
  • Everything return to normal.

To be honest, this looked a look like someone suspended the entire process somehow. CPU during this time was effectively zero, memory usage was fine and there was no I/O.

The quick workaround was to increase the database idle time to account for the breaks in activity. With the database no longer loading and unloading under load, there was no issue.

Now it was the time to try to figure out what the hell was going on. Going over the logs, we could see that there was a pretty big gaps in the initialization order of the database. To be exact, the initialization of the indexes took almost all the time. But why? Looking at each individual index, it took very little time to initialize, and even all of them together didn’t take a long time. But the time from starting the index initialization until the first index started to initialize was very high.

A few years ago we sped up the initialization times for large RavenDB databases by parallelizing the initialization of all indexes. That was the first clue. Something in this parallel initialization wasn’t working.

And there was also the load factor to consider. The problem only surfaced under load, and to be more precise, under load when we had multiple databases involved, but the issue affected the entire server.

When RavenDB get a request for an unloaded database, it starts loading it in a background thread and the request waits until the database is loaded.

Combine all of those together, and you might start seeing the answer.

RavenDB has an unloaded database, and we suddenly get a high number of requests for that database, all at the same time. RavenDB will start a background task loading the database, and wait until it is loaded to complete the request. However, because we have a lot of requests to this database, we have a lot of threads that are now waiting for it to complete.

Meanwhile, the database is loading in a background thread, and needs to initialize its indexes. It does this in parallel, using the thread pool…

The same thread pool that we serve requests from…

The same thread pool where we have a lot of threads that are currently waiting for the database to load…

The same database that is now waiting for tasks to complete on that busy thread pool…

If you had things going just right, you would get into a situation where the database couldn’t load, because all the thread pool threads it needed to start were busy waiting for the database to load.

Now, the thread pool would slowly increase the number of threads to take care of the amount of work it had. But those threads would usually be kept busy with more requests to the loading database.

Eventually, the index initialization would go through, the database would complete loading, all the database requests would complete, and the system would go back to normal. But in the meantime… ouch!

The temporary workaround, as I mentioned, was increasing the database idle time to prevent this situation from happening. The long term solution was to limit the amount of time that a thread would wait for a request to load, as well as limit the number of threads that could wait for a database to load. The first five requests for an unloaded database would wait up to ten seconds for that database to load, any additional requests would return with a service unavailable code immediately.

We also made sure that waiting for a database to load can be done in an asynchronous manner and in 3.5 we have a dedicated thread pool for internal RavenDB tasks. But I’ll have another post about that.

What is new in RavenDB 3.5Smuggling data across servers


This feature is meant primarily for users that work with multiple instances of RavenDB. The common scenario is production/staging/development and needing to move data around between them. Previously, you would have to manually move data using import or export, or set a bunch of scripts to run the smuggler with the right commands.

In RavenDB 3.5, we have turned that into a full fledged feature.

image

The nice thing about this, you can just set it up as you want, with the right databases and configuration that you need to send. By default we use incremental option, so you can run this over time and only get the new stuff.

The idea is that you can use this to move the data from one environment to the next. You can also click on the Show JSON button and get the data you need to script this (across all dbs) in an automated (and possibly scheduled) manner.

Production postmortemThe case of the infected cluster


RavenDB is a critical system for many of our clients, that means that we get our share of “the sky is falling” stories, and we have a team of people we train to be able to diagnose and troubleshoot issues in production. This series of post is about the kind of issues that we run into, how they were resolved, and what the underlying issue was.

The scenario, a customer had a total system meltdown because RavenDB would stop responding to requests. Restarting the server would result in everything working, but in about 10 minutes, they would run into the same issue.

Complicating matters was the fact that the customer was running in a high availability cluster, with two quite powerful nodes that had failover setup. However, the problem would typically impact a single server, eventually leading to it stopping processing requests, while the second server would be humming along beautifully. When the first server would stop, the second server would continue for a few more minutes, then die in the same manner.

The issue was clearly related to the load on those server. The website is pretty busy, with millions of queries to RavenDB per hour, but we had no reason to think that the system was over capacity. Why is that?

  • The system has been running around for over a month on the same configuration, with moderate memory consumption and low CPU usage throughout that time.
  • The last upgrade to the RavenDB servers happened about a week ago, and no new traffic patterns should cause that load.
  • The last RavenDB client upgrade happened that morning, but it was a minor change and should reduce load on the server. But the problem would show up within 10 – 20 minutes, and the client upgrade happened several hours before it started happening.
  • The operations team swore that there were no changes made to either server recently.

The behavior was puzzling because we didn’t have high CPU, high memory or paging. Typical indications that the system is under too much of a load. Furthermore, I/O rates were quite good, and testing the disk speed and throughput showed that everything is all right.

We tried to revert all recent changes, to both clients and servers. We monitored the servers to see what happens when the problem recurred, and eventually we figured out what is the probable issue. The affected server would slowly show a growth pattern in the number of threads being used. Slowly going to upward of 700 threads. The non impacted server would sit on around 100 threads or so, until the first server would fail, in which case all load would hit the okay server, and it would be fine, for a while, then the thread count would start climbing up, the request timing would increase, and eventually things would stop.

One thing that was very interesting was that the problem was that RavenDB would still be responding (albeit a bit badly) under those conditions, and there was no indication that there are any issues with it as far as it was concerned. But under this failure condition we did see that we start getting a rather strange error.

The server is too busy, could not acquire transactional access

This rather poorly worded message is an indication that we hit an internal limit inside RavenDB. By default, we have a limit of 512 concurrent requests inside the server. That is done to make sure that we aren’t swamping the server with too many requests that would take it down. The good news was that this is a configurable value, and we were able to set Raven/MaxConcurrentServerRequests to a higher value to get a little bit more breathing space. Once we have done that, the system would still become slower, but it would stabilize at some point, even though it would still be much slower than what we come to expect of it.

Intermediary conclusions were:

  1. Something is causing requests to take a lot longer than they should.
  2. Because they take longer than they should, we see a convoy of requests.
    • We can’t handle requests as fast as they come, so they start to pile up.
    • The thread pool (which dispatches requests) is handling this by slowly increasing the number of threads.
    • But we still aren’t able to catch up with the number of requests coming in.
    • At some point, the number of concurrent requests hits the limit, and we start rejecting some requests
  3. We would see major slow downs because the thread pool would increase the number of threads in the pool only on a slow and measured pace.
  4. We don’t have any idea what is causing us to slow down.

Remember, this is a system that has been in production for a while now. In fact, the current traffic numbers are actually only about one third of the peek that was hit shortly after the system was released, and we handled that load with no issue.

We came back to the drawing board, and started look at pretty much anything suspicious. Memory was reasonable, and we have few dozens GB still free. CPU utilization was moderate, with over half of the system capacity unused. I/O rates from the disk according to performance monitor was excellent. Everything was good. So we took a dump of the process and proceeded to try to figure out what is going on from that.

We saw something like this (spread over 150+ threads and in various modes, but this gets the point):

image

95% of the threads were stuck in various I/O related methods. Like NtCreateFile, NtCloseFile, etc. Other threads were hanging in .NET related code that was also doing file I/O. So we can pretty conclusively state that the problem is I/O related.

But what? We tested the disk I/O, it was good. We run I/O tests and they were fine. The measurements we got from the monitoring tools all said that the I/O is fine.

And yet we could see that I/O was the culprit. What the hell?!

At this point, we start suspecting that the issue was with something in the system. Now, each of the servers was capable to handling the entire load of the server (and then some) on its own. But both machine exhibit the same signs. That led us to believe that this isn’t some sort of hardware issue. It is something that both machine had.

A common trouble maker in this area is any kind of tool that hooks into sys calls. For example, anti virus software. In fact, that was one of the very first questions that we asked. Does this machine have anti virus on it? We got a negative answer early on. But after seeing those results, we went in and double checked. The machine did have an anti virus installed, and that is a huge red flag in those circumstances.

Anti virus software like to hook into all kind of system calls, and it is commonly associated with problems because of that. We asked the operations team to disable the anti virus on those servers to see if it would have an impact on the issue. Indeed, they stopped the anti virus and restarted the RavenDB servers, and then we waited. 10 minutes later, we saw no increase in the number of threads, and 25 minutes later, we were holding steady on the same performance that we came to expect.

Today, the customer sent us their monitoring system reports and there is a huge drop from 30% – 50% CPU usage to single digits on the day in question, and then it is flat graph for all the rest of the week.

Our hypothesis is that the anti virus either auto updated itself, or was actually updated by ops as part of routine maintenance task, and for some reason it has taken a liking to the kind of I/O RavenDB was generating. So it inspected us. And as it did so, it added overhead for each and every I/O call we made. And considering that we are a database, I/O is kinda of important for us. That additional overhead would cause us to respond much more slowly, which would mean that we would have additional request queue up, all the way until we start rejecting them. And we didn’t see those at the I/O metrics for the system because the overhead happen after the I/O is already completed.

I also suspect that this runs for all I/O operations, including things that should be buffered (therefor fast), so it has a big impact on our assumptions (this I/O call should be buffered, so it is fast, but it is checked by the AV, so it is not, leading to the convoy).

Because the Anti Virus was installed on both servers, they were both impacted, and the more load we put on them, the more the AV had to do, the more it slowed us down, the more requests we had queued, the bigger the trouble.

In other words, be very aware what is actually going on your system. As part of the post mortem analysis, we realized that we should have probably run fltmc.exe tool to check what are the file system filters that are installed and probably could have figured out that this is an Anti Virus issue much faster.

What is new in RavenDB 3.5–Intro


We have released RavenDB 3.0 about 10 months ago, since then we have a few interim releases with minor fixes and features, but now we are gearing up for RavenDB 3.5 release, and there are a quite a few things that we have there that we are really excited about.

I’ll have detailed posts about each of the major new things that we are doing there, but I wanted to take this post to talk about the general direction that we are going toward in 3.5.

In 3.5 we did a lot of underground work, doing cleanup, performance work and many optimizations. We dealt primarily with smoothing things over, exposing more internal behaviors and configurability and in generally making sure that you have smooth sailing. For example, we have a much more predictable performance due to much better control over threading, fine grained control over replication, additional endpoints that expose the internal state of the server for operations, and a big surprise for the operations team in general.

We have another two weeks before code freeze for 3.5, and I wanted to ask what kind of features you think you would want to still slip into this version, before I show what new stuff we have.

FUTURE POSTS

No future posts left, oh my!

RECENT SERIES

  1. Career planning (6):
    24 Jul 2015 - The immortal choices aren't
  2. Production postmortem (4):
    23 Jul 2015 - The case of the native memory leak
  3. API Design (7):
    20 Jul 2015 - We’ll let the users sort it out
  4. What is new in RavenDB 3.5 (3):
    15 Jul 2015 - Exploring data in the dark
  5. The RavenDB Comic Strip (3):
    28 May 2015 - Part III – High availability & sleeping soundly
View all series

RECENT COMMENTS

Syndication

Main feed Feed Stats
Comments feed   Comments Feed Stats