Ayende @ Rahien

Hi!
My name is Oren Eini
Founder of Hibernating Rhinos LTD and RavenDB.
You can reach me by phone or email:

ayende@ayende.com

+972 52-548-6969

, @ Q c

Posts: 18 | Comments: 72

filter by tags archive

Production postmortemThe case of the lying configuration file

time to read 4 min | 609 words

While studying an issue using customer data, I noticed that indexing speed wasn’t up to what I expected it to be. In fact, the size of the indexing batch remained roughly constant (and small), and didn’t exhibit the usual increases as RavenDB notices that the server has a lot of work to do and the resources to do it. This was while investigating something else, but since I had to re-index that database quite a few time, I decided to investigate what was going on.

The underlying issue turned out to be a configuration setup. An index was specified with a MaxNumberOfOutputsPerDocument of 55. We use this value for a few things, among them to ensure to manage the memory resulting from indexing operations. In particular, we have had some issues with indexes that output a large number of index entries per documents using more then the quota allocated to the index and generating (sometime severe) memory pressure.

Unfortunately, in this case, we had the other option. The index was configured properly, but the index didn’t actually output multiple entries. So we ended up assuming that the index would generate a lot more memory than it would actually really use. That meant that we couldn’t feed it larger batches, because we feared it would use too much memory…

The fix was to make effectively ignore this value. Instead of using the value assuming that the user knows what is going it, we’ll use this value as the maximum value only, and use heuristics to figure out how much memory we should reserve for the index in question.

This is a smaller example of a wider issue. The values that the system gets are user input. And they should be treated as such. This means that you need to validate them in the same sense you would validate any other input from users.

In the case outlined above, the only implication was that we would index a more slowly, and weren’t able to take full advantage of the machine resources we had available. There have been other such issues.

An administrator has setup a set of range values so the min value was larger than the max value. This turned out to cause us to have no effective limit, short of an integer overflow. The end result was that we would use unbounded memory for our needs, which would work, most of the time, except when you had a big set of changes to apply and we would try to do it all at once…

Another case was an ops guy that wanted to reduce RavenDB CPU usage, so he set the number of threads available for background work to 0. That meant that work that was queued on background threads would never complete, and the system would effectively hang.

You get the drift, I assume.

Your configuration file (or however you are actually configuring things) is yet another way for your user to communicate with your application. Sure, the tone is much stricter (commanding, rather than asking), but you still need to make sure that what the user is asking you to do actually make sense. And if it doesn’t, you need to decide what to do about it.

  • You can exit the process. The “I can’t work with you people” mode of error handling.
  • You can emit a warning and proceed. The “I told you so, buster!” motto of blame shifting.
  • You can ignore the set value. The “I know better than you” school of thought.

There aren’t really good choices, especially for server applications that can’t just show an error to the user.

Production postmortemThe case of the memory eater and high load

time to read 9 min | 1628 words

This is a recent case. One of our customers complained that every now and then they started to see very high memory utilization, escalating quickly until the system would bog down and die. They were able to deploy a mitigation strategy of a sort, when they detected this behavioral pattern, they would force RavenDB to reject client requests for a short while, which would fix this issue.

imageThis went on for  a while, because the behavior was utterly random. It didn’t seem to relate to load, peek usage time on the system didn’t correlate to this in any way. Eventually the customer raised another issue, that a certain spatial query was behaving very slowly in the logs.

We tested that, and we found that the customer was correct. More properly, the query executed just fine when run independently. But when we run this query tens or hundreds of times concurrently, we will see very high response times (and getting worse), and we would see the server memory just blowing up very quickly. So we have a memory leak, we figured out, let us see what is going on… We dumped the data, and tried to figure out what it was exactly that we were leaking.

But there wasn’t anything there!

In fact, looking at the problem, it became even curiouser.  Take a look at what we saw during one of the test runs:

image

Note that this is all running with no other work, just a lot of queries hitting the server.

Somehow, we had a lot of data going into the Gen2 heap. But when we checked the Gen2, it was pretty much empty. In fact, we had a 100% fragmentation. Something was very strange here. We enabled memory allocation tracking and started to look into what was going on. We found this very suspicious (note that this is from a different run from the one above):

image

So FileStream.Read is allocating GBs over GBs of memory? What is going on?! It took a while to figure out what was going on. The underlying issue was within Lucene. Actually, an intersection of a few things inside Lucene.

Here is how Lucene reads from a file on disk:

image

What you’ll notice is that Lucene is holding a lock on the file, and then issuing I/O. In other words, it is going to hold that lock for a while. This is a very strange thing to do, why is Lucene doing it?

It does this because of a strange decision on how to do concurrent I/O.

image

Basically, whenever Lucene needs to do concurrent I/O, it will clone the relevant input object, and then use it concurrently. The idea, I gather, is that Lucene didn’t want to have a separate file handle for each multi threaded operation, instead it created one file handle, and used it concurrently. Since concurrent I/O takes careful usage, they slapped a lock on it and call it a day. I’m being unfair, I know, this is explicitly called out in the docs:

image

And in the Java version, there is an NIOFSDirectory that is presumably much better. Such doesn’t exist in the Lucene.Net version we are using. In fact, I was curious and I checked the upcoming version, they do have a NIOFSDirectory implementation, which had the following code in it:

image

This is a single global lock for everything. Thank you, I’ll take the lock per file. Now, to be fair again, work in progress, etc.

We noticed this issue a long while ago, and we solved it by using multiple FileStreams. It used more resources, but it meant that we were far more concurrent. Note that all of this actually happened years ago, and we had no problems in this area. Note that Linux program typically worry a lot more about the number of open file handles than Windows programs do.

The problem was definitely related to the use of multiple FileStream. But it didn’t have anything to do with holding multiple handles to the same file. Instead, the issue was in the usage pattern that the query exhibited.

In particular, and I’m going to get deep into Lucene here, the problem was inside the SegmentReader.Terms() method:

image

This seem innocuous, right? Here is how this is implemented:

image

And all the way down until we gets to the input.Clone() method. Now, in a standard Lucene system, using concurrent queries, this would result in a fair amount of locking. In RavenDB, this just meant that we were creating new FileStream objects.

The problem was that this particular query had a list of terms that it needed to check, and it called the Terms() method many times. How much is many times? 12,000 times!

Still not a problem, except that it called FileStream.Read on each and every one of those. And FileStream.Read does the following:

image

And the _bufferSize is set to the default of 4KB.

In other words, processing a single instance of this particular query will result in the system allocating about 48MB of memory!

And when we have concurrent queries of this type? Each of them is allocating 48 MB of memory, and because they allocate so much, we have GC runs, which cause the memory (which is still in use) to be sent to Gen 1, and eventually park in Gen 2. There is languish (because it is temporary memory, but we don’t clear Gen 2 very often).

We changed the implementation to use overlapped I/O and tested that, and the memory consumption dropped by a significant number. But we still saw more allocations than we liked. We ended up tracking that down the this call (in BufferedIndexInput  in the Lucene codebase):

image

The buffer size in this case is 1KB. So allocation for this query was actually 60 MB(!), and we only managed to drop it by 48MB.

After fighting with this for a long while, we ended scratch the whole buffered index input idea. It is just not sustainable in terms of allocations. Instead, we created a memory map input class, that map the input data once, and doesn’t use a buffer (so no allocations). With that option, our cost to process this query was drastically lower.

We profile the change, to see whatever there are additional issues, and we found that the newly optimized code was much better, but still had an issue. The memory map code used the UnmanagedMemoryStream class to expose the file to the rest of the application. Unfortunately, this class appears to be intended for concurrent usage, which is a rarity for Streams. Here is ReadByte method from that class:image

As you can see, this method is doing quite a lot. And it showed up as a hot spot in our profiling. The rest of the class is pretty complex as well, and does significantly more than what we actually need it to do.

We replaced this with a MmapStream class, and here is the comparable implementation.

image

You can safely assume that this is much faster Smile.

We have tested this using 5000 concurrent requests, without caching. Memory consumption is steady, and doesn’t increase. We show marked improvement across the board, in memory utilization, CPU usage and I/O rates. Note that while this issue was caused by a particular query whose pattern of operation caused tremendous number of allocations, this change has wider reaching implications.

We now allocate less memory for all queries, and previous experience has shown us that reducing a single 4Kb allocation in query processing can improve overall performance but 30%. We haven’t run those tests yet, but I’ll be surprised if we’ll see negative results.

Concurrent max value

time to read 2 min | 324 words

For a feature in RavenDB, I need to figure out the maximum number of outputs per document an index has. Now, indexing runs in multiple threads at the same time, so we ended up with the following code:

var actualIndexOutput = maxActualIndexOutput;
if (actualIndexOutput > numberOfAlreadyProducedOutputs)
{
    // okay, now let verify that this is indeed the case, in thread safe manner,
    // this way, most of the time we don't need volatile reads, and other sync operations
    // in the code ensure we don't have too stale a view on the data (beside, stale view have
    // to mean a smaller number, which we then verify).
    actualIndexOutput = Thread.VolatileRead(ref maxActualIndexOutput);
    while (actualIndexOutput > numberOfAlreadyProducedOutputs)
    {
        // if it changed, we don't care, it is just another max, and another thread probably
        // set it for us, so we only retry if this is still smaller
        actualIndexOutput = Interlocked.CompareExchange(
            ref maxActualIndexOutput, 
            numberOfAlreadyProducedOutputs,
            actualIndexOutput);
    }
}

The basic idea is that this code path is hit a lot, once per document indexed per index. And it also needs to be thread safe, so we first do an unsafe operation, then a thread safe operations.

The idea is that we’ll quickly arrive at the actual max number of index inputs,  but we don’t have to pay the price of volatile reads or thread synchronization.

Optimizing I/O throughput

time to read 7 min | 1277 words

We got a customer request about performance issues they were seeing on startup on a particular set of machines.

Those machine run in a cloud environment, and they have… peculiar, one might say deviant, I/O characteristics. In particular, the I/O pipeline on those machines is wide, but very slow. What do I mean by that? I meant that any particular I/O operation on those is likely to be slow, but the idea is that you can get much better performance if you issue concurrent I/O. The system is supposed to be able to handle that much better, and overall you’ll see the same relative performance as elsewhere.

This is pretty big issue for us, because for many things, we really do care about serial I/O performance. For example, if we are committing a transaction, we really have no other way to handle it except to wait until the I/O is fully completed.

That said, the particular scenario where we had the problem was startup. If the database was under heavy load at the time it shut down, the recovery logs would be full, and the database would need to replay the recent actions that happened. Note that shutdown performance is important, because it many cases we are running in an environment where shutdown comes with a ticking clock (in IIS or as a Windows Service).

At startup, we usually have more time, and it is expected that we’ll take a while to get up to speed. If nothing else, just bringing enough of the database to memory is going to take time, so on large databases, startup time is expected to be non trivial.

That said, the startup time on those set of machines was utterly atrocious. To figure out what is going on, I pulled out Process Monitor and looked at the File I/O. We go this:

image

We are reading from a journal, and that is pretty serial I/O (in the image, I’m running of a remote network drive, to simulate slow responses). Note that we need to read the log in a serial fashion, and the way the OS reads things, we read 32Kb at a time.

Remember, we are reading things in a serial fashion, and that means that we have a lot of page faults, and we have a slow I/O system, and we execute them serially.

Yes, that is a killer for perf. By the way, when I’m talking about slow I/O system, I’m talking about > 0.5 MS per disk read for most requests (ideally, we would have latency of 0.05 – 0.15). And we have quite a few of those, as you can imagine.

Since I know that we are going to be reading the whole journal, I used the PrefetchVirtualMemory() method and passed it the entire file (it is a maximum of 64MB, and we are going to need to read it all anyway). This let the OS have the maximum amount of freedom when reading the data, and it generate big, concurrent I/O. Here is how this looks like:

image

This also give the wide I/O bandwidth a chance to play. We load the I/O subsystem with a lot of stuff that it can try to do in an optimized fashion.

The next part that was expensive was that we need to apply the data from the journal files to the data file, and sync it.

The performance of syncing a file is related to the size of the file, unfortunately. And the file in question was large, over 45GB. Especially on such a system, we saw a lot of latency here, as in multiple minutes. One obvious optimization was to not sync per journal file, but sync once per the whole recovery process. That helped, but it was still too expensive.

Next, we tried pretty much everything we could think about.

  • Switching to WriteFile (from using mmap and then calling FlushViewOfFile)
  • Using async I/O (WriteFileEx)
  • Using scatter / gather I/O with no buffering (saves the need to do sync in the end)
  • Completion ports
  • Asking a 4 months old baby girl what she think about it (she threw up on the keyboard, which is what I wanted to do at the time, then she cried, and I joined her)

Nothing seems to have worked. The major issue was that in this workload, we have a large file (45GB, as I said) and we are writing 4KB pages into it in effectively random places. In the workload we were trying to work with, there were roughly 256,000 individual 4KB writes (most of them weren’t consecutive, so we couldn’t get the benefit of that). That is about 1 GB of writing to do.

And nothing we could do would get us beyond 3MB/sec or so. Saturating the I/O subsystem with hundreds of thousands of small writes wouldn’t work, and we were at a loss. Note that a small test we made, just copying data around manually has resulted in roughly 10MS/sec peek performance on those machines. This is a very lame number, so there isn’t much that we can do.

Then I thought to ask, why are we seeing this only during startup? Surely this happens also on a regular basis. Why didn’t we notice?

The reason for that is pretty simple, we didn’t notice because we amortize the cost. Only on startup did we had to actually sit and wait for it to complete. So we dropped that requirement. We used to read all the journals, apply them to the data file, sync the data files and then delete the journals. Now we read the journals, apply them (via a memory map) to the data file, and only remember what is the last journal file we applied in memory.

There is a background process running that will take care of syncing the data file (and deleting the old journals). If we crash again, we’ll just have to replay the logs that we aren’t sure were synced before. This saves even more time.

But we still have another issue. Writing to memory mapped file require the OS to page the relevant pages into memory. And again, we are on slow I/O, and the OS will only page the stuff that we touch, so this is again a serial process that this time require us to load to memory about 1GB of data at 3MB/sec. That is… not a good place to be at. So the next step was to figure out all the addresses we’ll be writing to, and letting the OS know that we’ll be fetching them. We do some work to make sure that we load those values (and neighboring pages) to memory, then we can write to them without paging for each page individually.

A nice side effect of this is that because this is running on the latest changes in the system, this has the effect of preloading to memory the pages that are likely to be in used after the database has started.

That is a lot of work, but to be perfectly frank, this is mostly optimizing in a bad environment. The customer can’t walk away form their current machine easily, but the I/O rates those machines have would make any database sit in a corner and cry.

Unsafe operations are required in the real world

time to read 2 min | 363 words

There is a pretty interesting discussion in the Raft mailing list, about clarifying some aspects of the Raft protocol. This led to some in depth discussion on the difference between algorithms in their raw state and the actual practice that you need in the real world.

In case you aren’t aware, Raft is a distributed consensus protocol. It allows a group of machines to reach a decision together (a gross over simplification, but good enough for this).

In a recent post, I spoke about dedicated operations bypasses. This discussion surfaced another one. One of the things that make Raft much simpler to work with is that it explicitly handles topology changes (adding / removing nodes). Since that is part of the same distributed consensus algorithm, it means that it is safe to add or remove a node at runtime.

Usually when you build a consensus algorithm, you are very much focused on safety. So you make sure that all your operations are actually safe (otherwise, why bother?). Except that you must, in your design, explicitly allow the administrator to make inherently unsafe operations. Why?

Consider the case of a three node cluster, that has been running along for a while now. A disaster strikes, and two of the nodes die horriblyh. This puts our cluster in a situation where it cannot get a majority, and nothing happen until at least one machine is back online. But those machines aren’t coming back. So our plucky admin wants to remove the two dead servers from the cluster, so it will have one node only, and resume normal operations (then the admin can add additional nodes at leisure). However, the remaining node will refuse to remove any node from the cluster. It can’t, it doesn’t have a majority.

If sounds surprisingly silly, but you actually have to build into the system the ability to make those changes with explicit admin consent as unsafe operations. Otherwise, you might end up with a perfectly correct system, that breaks down horribly in failure conditions. Not because it buggy, but because it didn’t take into account that the real world sometimes requires you to break the rules.

The “you broke the build!” game

time to read 2 min | 344 words

Recently I pulled some code from a colleague, and tried to test it. It worked, which was fine, so I let it run the tests, and went out to lunch.

When I came back, I was surprised to discover that the build has failed, not because of some test failing, but because it couldn’t compile. To be rather more exact, we go the following error:

 [optimized-build] Using type script compiler: C:\Program Files (x86)\Microsoft SDKs\TypeScript\1.4\tsc.exe
 [optimized-build]
 [optimized-build] System.ComponentModel.Win32Exception thrown:
 [optimized-build] --------------------------
 [optimized-build] The filename or extension is too long
 [optimized-build] --------------------------

That was strange. I checked several times, and we had no such thing. No one had a veryVeryLongFileNameThatNeededToBeVeryExplicitAboutWhatItWasDoingAndDidNotCareAboutLength.ts.

And the tsc.exe location was in its normal place. This is from a part in our build process that gather all the TypeScript files and merge them into a single optimized bundle. And it suddenly failed. Now, on the colleague machine, it worked. The previous commit before I merged it, it worked. The merge was a clean one, and very obvious that nothing was going on there.

It took me a while, but I finally figured out that the error occurred because my colleague has added a new TypeScript file.

How can adding a file break the build?

As it turns out, the code we were calling did something like this:

tsc.exe file1.ts file2.ts file3.ts

And at some point, the size of the command line we were passing to the process has exceeded 32KB. And that is when everything broke loose.

Luckily, we were able to write those things to a file and ask the compiler to load them directly, instead of passing them via the command line:

tsc.exe @arguments.txt

And everything was okay again…

Dedicated operations road bypasses

time to read 2 min | 398 words

We care very deeply about the operations side of RavenDB. Support calls are almost never about “where are you? I want to send you some wine & roses”, and they tend to come at unpleasant timing. One of the things that we had learnt was that when stuff breaks, it tend to do so in ways that are… interesting.

Let me tell you a story…

A long time ago, a customer was using an index definition that relied on the presence of a custom assembly to enrich the API available for indexing. During the upgrade process from one major version of RavenDB to the next, they didn’t take into account that they need to also update the customer assembly.

When they tried to start RavenDB, it failed because of the version mismatch, since they weren’t actually using that index anyway. The customer then removed the assembly, and started RavenDB again. At this point, the following sequence of events happened:

  • The database started, saw that it is using an old version of the storage format, and converted to the current version of the storage format.
  • The database started to load the indexes, but the index definition was invalid without the customer assembly, so it failed. (Index definitions are validated at save time, so the code didn’t double check that at the time).

The customer was now stuck, the database format was already converted, so in order to rollback, they would need to restore from backup. They could also not remove the index from the database, because the database wouldn’t start to let them do so. Catch 22.

At this point, the admin went into the IndexDefinitions directory, and deleted the BadIndex.index-definition file, and restarted RavenDB again. The database then recognized that the index definition is missing, but the index exists, deleted the index from the server, and run happily ever after.

Operations road bypass is our terminology for giving administrators a path to changing internal state in our system using standard tools, without requiring the system to be up and functioning. The example with the index definition is a good one, because the sole reason we keep the index definition on disk is to allow administrators the ability to touch them without needing RavenDB in a healthy state.

What do you do in your system to make it possible for the admin to recover from impossible situations?

Repeatable random tests

time to read 3 min | 460 words

Testing our software is something that we take very serious. And in some cases, we want to go beyond testing stuff that we know. We want to test random stuff. For example, if we add 10,000 documents, then remove every 17th, what happens? Is there any differences in behavior, performance, etc?

It is easy to do random stuff, of course. But that leads to an interesting case. As long as the tests are passing, you can pat yourself on the back: “We have done good, and everything works as it should”.

But when something fails… well, the only thing that you know is that something did fail. You don’t have a way to reproduce this. Because the test is… random.

In order to handle that, we wrote the following code:

 [AttributeUsage(AttributeTargets.Method, AllowMultiple = true)]
 public class InlineDataWithRandomSeed : DataAttribute
 {
 
     public InlineDataWithRandomSeed(params object[] dataValues)
     {
         this.DataValues = dataValues ?? new object[] {null};
     }
 
     public object[] DataValues { get; set; }

     public override IEnumerable<object[]> GetData(MethodInfo methodUnderTest, Type[] parameterTypes)
     {
         var objects = new object[DataValues.Length+1];
         Array.Copy(DataValues,0,objects,0, DataValues.Length);
         objects[DataValues.Length] = Environment.TickCount;
         yield return objects;
         
     }
 }

This is using XUnit, which gives us the ability to add a seed to the test. Let us see how the test looks:

image

And this is what this looks like when it runs:

image

When we have a failure, we know what the seed is, and we can run the test with that seed, and see what exactly happened there.

Presenting, Highly Available & Scalable Solutions at GOTO Copenhagen

time to read 1 min | 168 words

I’ll be presenting at the GOTO Copenhagen conference in Oct 7 – 8 this year. The full session summary is:

Presentation: Highly Available & Scalable Solutions with RavenDB

Track: Solutions Track 1 / Time: Monday 13:20 - 14:10 / Location: Rosenborg

RavenDB is a 2nd generation document database, with built-in load distribution, seamless replication, disaster recovery and data-driven sharding.

In this session, we are going to explore how RavenDB deals with scaling under load and remain highly available even under failure conditions.

We'll see how RavenDB's data-driven sharding allows to increase the amount of the data in our cluster without giving up the benefits of data locality.

We are are going to execute complex distributed map-reduce queries on a sharded cluster, giving you lightning-fast responses over very large data volumes.

Hibernating Rhinos will also be presenting at a booth, and we’ll have a few members of the core team there to talk about RavenDB and the cool things that you can do with it.

FUTURE POSTS

  1. RavenDB 3.0 New Stable Release - 30 minutes from now
  2. Production postmortem: The industry at large - about one day from now
  3. The insidious cost of allocations - 2 days from now
  4. Buffer allocation strategies: A possible solution - 5 days from now
  5. Buffer allocation strategies: Explaining the solution - 6 days from now

And 3 more posts are pending...

There are posts all the way to Sep 11, 2015

RECENT SERIES

  1. Find the bug (5):
    20 Apr 2011 - Why do I get a Null Reference Exception?
  2. Production postmortem (10):
    01 Sep 2015 - The case of the lying configuration file
  3. What is new in RavenDB 3.5 (7):
    12 Aug 2015 - Monitoring support
  4. Career planning (6):
    24 Jul 2015 - The immortal choices aren't
View all series

RECENT COMMENTS

Syndication

Main feed Feed Stats
Comments feed   Comments Feed Stats