Ayende @ Rahien

My name is Oren Eini
Founder of Hibernating Rhinos LTD and RavenDB.
You can reach me by email or phone:


+972 52-548-6969

, @ Q j

Posts: 6,570 | Comments: 48,197

filter by tags archive

Production postmortemThe unavailable Linux server

time to read 6 min | 1019 words

imageAs part of our 4.0 deployment, we have a cluster made of mixed machines, some running Windows, some running Linux. In one particular configuration, we have 2 Windows services and a single Linux machine. We setup the cluster and started directing production traffic to it, and all was good. We left for the weekend and came back the following week, to see what the results were.  Our Linux machine was effectively down. It was not responding to queries and it seemed like it was non responsive.

That was good and bad. It was good because the rest of the cluster just worked, and there was no interruption in service. This is as designed, but it is always nice to see this in real life. It is bad because it isn’t supposed to be happening. What was really frustrating was that we were able to log into the machine and there was nothing wrong there.

No high CPU or memory, no outstanding I/O or anything of the kind that we are used to.

The first clue for us was trying to diagnose the problem from inside the machine, where we able to reproduce it by trying to access the server from the local machine, resulting in the exact same issue as observed externally. Namely, the server would accept the connection and just leave it hanging, eventually timing out.

That was a interesting discovery, since it meant that we can rule out anything in the middle. This is a problem in this machine. But the problem remained quite hard to figure out. We deployed to production in a manner similar to what we expect our users will do, so we used Let’s Encrypt as the certificate authority with auto generated certificates.

So we started by seeing where the problem is, whatever this is on the TCP side or the SSL side, we issued the following command:

openssl s_client -connect b.cluster-name.dbs.local.ravendb.net:443

This command showed immediate connection to the server and the client sending the ClientHello properly, but then just hanging there. What was really interesting is that if we waited about 2 minutes, that SSL connection would complete successfully. But we couldn’t figure out any reason why this would be the case. It occurred to me that it might be related to the system handling of reverse DNS lookup. The two minutes timeout was very suspicious, and I assumed that it might be trying to lookup the client certificate and somehow resolve that. That isn’t how it works in general, although the fact that some SSH (and not SSL/TLS) configuration directly relate to this has led us in a merry chase.

Eventually we pulled strace and looked into what is actually going on. We focused on the following calls:

sudo strace -fp 1017 -s 128 -Tfe open,socket,connect,sendto,recvfrom,write

The interesting bits from there are shown here:

As you can see, we are looking at some DNS resolution, as we can tell from the /etc/resolv.conf and /etc/hosts open() calls. Then we have a connect() to which is an AWS name server. Note that this is done over UDP, as you can see from the SOCK_DGRAM option in the preceding socket() call.

We are getting some data back, and we can see identrust there. And then we see something really suspicious. We have a TCP socket call that goes to on port 80. In other words, this is an HTTP call. What does an HTTP call is doing in the middle of an SSL handshake?

As it turned out, our firewall configuration was blocking outbound connections to port 80. We tested removing that rule and everything came back online and the server was running just fine.

Further inspection revealed that we were calling to: http://apps.identrust.com/roots/dstrootcax3.p7c

And this is where things started to jell together. We are using Let’s Encrypt certificates, and in order to ensure trust, we need to send the full chain to the user. SSL Certificates has the notion of Authority Information Access, which is basically a URL that is registered in the certificate that points to where you can find the certificate that signed this one.

Why is this using HTTP? Because the data that will be fetched is already signed, and it is not a secret. And trying to use HTTPS to fetch it might get us into a loop.

So whenever we had a new SSL connection, we’ll try to connect to IdenTrust to get the full chain to send to the client. The killer here is that if we fail to do so, we’ll send the certificate chain we have (without the missing root), but it will work, since the other side already have this root installed (usually). On Windows, this certificate is installed, so we didn’t see it. On Linux, we didn’t have that certificate installed, so we had to look it up every single time.

The gory details, including dives into the source code are in the GitHub issue. And I do think they are gory. In this case, once we realized what was going on we were able to take steps to handle this. We needed to pre-register the entire chain on the local machine, so any lookup will be able to find it locally, and not do a network call per each SSL connection.

But beyond mutating the global certificate store, there is no real way to prevent that remote call.

Note that this is also true for Windows, although that seems to be implemented much deeper in the stack, and not in managed code, so I wasn’t able to trace where this is actually happening. The really bad thing here is that from the outside, there is no way for us to control or disable this, so this is just something that you have to remember to do when you use certificates, make sure that the entire chain is registered on each machine, otherwise you might have a remote call per connection, or a very long (and synchronous!) hang until the request times out if you are blocking outgoing access.

Getting fatal out of memory errors because we are managing memory too well

time to read 4 min | 621 words

imageWe got a serious situation on one of our test cases. We put the system through a lot, pushing it to the breaking point and beyond. And it worked, in fact, it worked beautifully. Up until the point that we started to use too many resources and crashed. While normally that would be expected, it really bugged us, we had provisions in place to protect us against that. Bulkheads were supposed to be blocked, operations rolled back, etc. We were supposed to react properly, reduce costs of operations, prefer being up to being fast, the works.

That did not happen. From the outside, what happened is that we go to the point where we would trigger the “sky about the fall, let’s conserve everything we can”, but we didn’t see the reaction that we expected from the system. Oh, we were started to use a lot less resources, but the resources that we weren’t using? They weren’t going back to the OS, they were still held.

It’s easiest to talk about memory in this regard. We hold buffers in place to handle requests, and in order to avoid fragmentation, we typically make them large buffers, that are resident on the large object heap.

When RavenDB detects that there is a low memory situation, it starts to scale back. It releases any held buffers, completes ongoing works and starts working on much smaller batches, etc. We saw that behavior, and we certainly saw the slow down as RavenDB was willing to take less upon itself. But what we didn’t see is the actual release of resources as a result of this behavior.

And as it turned out, that was because we were too good about managing ourselves. A large part of the design of RavenDB 4.0 was around reducing the cost of garbage collections by reducing allocations as much as possible. This means that we are running very few GCs. In fact, GC Gen 2 collections are rare on our environment. However, we need these Gen 2 collections to be able to clean up stuff that is in the finalizer queue. In fact, we typically need two such runs before the GC can be certain that the memory is not in use and actually collect it.

In this particular situation, we were careful to code so we will get very few GC collections running, and that led us to crash because we would run out of resources  before the GC could realize that we are actually not really using them at this point.

The solution, by the way, was to change the way we respond to low memory conditions. We’ll be less good about keeping all the memory around and if it isn’t being used, we’ll start discarding it a lot sooner, so the GC has better chance to actually realize that is isn’t being used and recover the memory. An instead of throwing the buffers away all at once when we have low memory and hope that the GC will be fast enough in collecting them, we’ll keep them around and reuse them, avoiding the additional allocations that processing more requests would have required.

Since the GC isn’t likely to be able to actually free them in time, we aren’t affecting the total memory consumed in this scenario but are able to reduce allocations by serving the buffers that are already allocated. This two actions, being less rigorous about policing our memory and not freeing things when we get low memory are confusingly enough to get both reduce the chance of getting into low memory and reduce the chance of actually using too much memory in such a case.

Production crash root cause analysis step by step

time to read 3 min | 470 words

One of our production server crashed. They do that a lot, because we make them do evil things in harsh environment so they would crash on our end and not when running on your systems. This one is an interesting crash.

We are running the system with auto dump, so a crash will trigger a mini dump that we can run. Here is the root cause of the error, and IOException here:

    SP               IP               Function
    00000024382FF0D0 00007FFD8313108A System_Private_CoreLib!System.IO.FileStream.WriteCore(Byte[], Int32, Int32)+0x3cf7ea
    00000024382FF140 00007FFD82D611A2 System_Private_CoreLib!System.IO.FileStream.FlushWriteBuffer(Boolean)+0x62
    00000024382FF180 00007FFD82D60FA1 System_Private_CoreLib!System.IO.FileStream.Dispose(Boolean)+0x41
    00000024382FF1D0 00007FFD82D60631 System_Private_CoreLib!System.IO.FileStream.Finalize()+0x11

And with this, we are almost done analyzing the problem, right? This is a crash because a finalizer has thrown, so we now know that we are leaking a file stream and that it throws on the finalizer.  We are almost, but not quite, done. We are pretty careful about such things, and most to all of our usages are wrapped in a using statement.

The reason for the error was quite apparent when we checked, we run out of disk space. But that is something that we extensively tested and we are supposed to work fine with that. What is going on?

No disk space is an interesting IO error, because it is:

  • transient, in the sense that the admin can clear some space, and does not expect you to crash.
  • persistent, in the sense that the application can’t really do something to recover from it.

As it turns out, we have been doing everything properly. Here is the sample code that will reproduce this error:

For the moment, ignore things like file metadata, etc. We are writing 12 bytes, and we have only place for 10. Note that calling WriteByte does not go to the disk, extend the file, etc. Instead, it will just write to the buffer and the buffer will be flushed when you call Flush or Dispose.

Pretty simple, right?

Now let us analyze what will happen in this case, shall we?

We call Dispose on the FileStream, and that ends up throwing, as expected. But let us dig a little further, shall we?

Here is the actual Dispose code:

This is in Stream.Dispose. Now, consider the case of a class that:

  • Implements a finalizer, such as FileStream.
  • Had a failure during the Dispose(true); call.
  • Will have the same failure the next time you call Dispose(), even if with Dispose(false);

In this case, the Dispose fails, but the SuppressFinalize is never called, so this will be called from the finalizer, which will also fail, bringing our entire system down.

The issue was reported here and I hope that they will be fixed by the time you read it.

Transactions, request processing and convoys, oh my!

time to read 5 min | 805 words

imageYou might have noticed that we have been testing, quite heavily, what RavenDB can do. Part of that testing focused on running on barely adequate at ridiculous loads.

Because we setup the production cluster like a drunk monkeys, a single high load database was able to affect all other databases.  To be fair, I explicitly gave the booze to the monkeys and then didn’t look when they made a hash of it, because we wanted to see how RavenDB does when setup by someone who has as little interest in how to properly setup things and just want to Get Things Done.

The most interesting part of  this experiment was that we had a wide variety of mixed workload on the system. Some databases are primarily read heavy, some are used infrequently, some databases are write heavy and some are both write heavy and indexing heavy. Oh, and all the nodes in the cluster were setup identically, with each write and index going to all the nodes.

That was really interesting when we started very heavy indexing loads and then pushed a lot of writes. Remember, we intentionally under provisioned, and these machines are 2 cores with 4GB RAM and they were doing heavy indexing, processing a lot of reads and writes, the words.

Here is the traffic summary from one of the instances:


A very interesting wrinkle that I didn’t mention is that this setup has a really interesting property that we never tested. It has fast I/O, but the number of tasks that are waiting for the two cores is high enough that they don’t get a lot of CPU time on an individual basis. What that means is that it looks like we have I/O that is faster than the CPU.

A core concept of RavenDB performance is that we can send work to the disk to run and in that timeframe we will be able to complete more operations in memory, then send a batch of them to disk. Rinse, repeat, etc.

This wasn’t the case here. By the time we finished a single operation we’ll already have the previous operation completed, and so we’ll proceed with a single operation every time. That killed our performance, and it meant that the transactions merger queue would grow and grow.

We fixed things so we’ll take into account the load on the system when this happens, and we gave a higher priority to the transaction merging thread over normal request processing or indexing. This is because write requests can’t complete before the transaction has been committed, so obviously we don’t want to process further requests at the expense of processing the current requests.

This problem can only occur when we are competing heavily for CPU time, something that we don’t typically see. We are usually constrained a lot more by network or disk. With enough CPU capacity, there is never an issue of the requests and the transaction merger competing for the same core and interrupting each other, so we didn’t consider this scenario.

Another problem we had was the kind of assumptions we made with regards to the processing power. Because we tested on higher end machines, we tested with some ridiculous performance numbers, including hundreds of thousands of writes per second, indexing, mixed read / write load, etc. But we tested that on high end hardware, which means that we got requests that completed fairly quickly. And that led to a particular pattern of resource utilization. Because we reuse buffers between requests, it is typically better to grab a single big buffer and keep using it, rather than having to enlarge it between requests.

If your requests are short, the number of requests you have in flight is small, so you get great locality of reference and don’t have to allocate memory from the OS so often. But if that isn’t the case…

We had a lot of requests in flight. A lot of them because they were waiting for the transaction merger to complete its work, and it was having to fight the incoming requests for some CPU time. So we have a lot of inflight requests, and they intentionally got a bigger buffer than they actually needed (pre-allocating). You can continue down this line of thought, but I’ll give you a hint, it ends with KABOOM.

All in all, I think that it was a very profitable experiment. This is going to be very relevant for users on the low end, especially those running Docker instances, etc. But it should also help if you are running production worthy systems and can benefit from higher utilization of the system resources.

The cost of finalizers

time to read 2 min | 231 words

A common pattern we use is to wrap objects that require disposing of resources with a dispose / finalizer. Here is a simple example:

This is a pretty simple case, and obviously we try to avoid these allocations, so we try to reuse the SecretItem class.

However, when peeking into high memory utilization issue, we run into a really strange situation. We had a lot of buffers in memory, and most of them were held by items that were held in turn by the Finalization Queue.

I started to write up what I figured out about finalizer in .NET, but this post does a great job of explaining everything. The issue here is that we retain a reference to the byte array after the dispose. Because the object is finalizable, the GC isn’t going to be able to remove it on the spot. That means that it is reachable, and anything that it holds is also reachable.

Now, we are trying very hard to be nice to the GC and not allocate too much, which means that the finalizer thread will only wake up occasionally, but that means that the buffer will be held up in memory for a much longer duration.

The fix was the set the Buffer to null during the Dispose call, which means that the GC can pick up that it isn’t actually being used much faster.

ChallengeFind the bug in the fix–answer

time to read 2 min | 347 words

I am writing this answer before people had a chance to answer the actual challenge, so I hope people caught it. This was neither easy nor obvious to catch, because it was hiding with a pile of other stuff and the bug is a monster to figure out.

In case you need a reminder, here is the before & after code:

Look at line 18 in the second part. If we tried to allocate native memory and failed, we would try again, this this with the requested amount.

The logic here is that we typically want to request memory in power of 2 increments. So if asked for 17MB, we’ll allocate 32MB. This code is actually part of our memory allocator, which request memory from the operating system, so it is fine if we allocate more, we’ll just use that in a bit. However, if we don’t have enough memory to allocate 32MB, maybe we do have enough to allocate 17MB. And in many cases, we do, which allow the system to carry on operating.

Everyone is happy, right? Look at line 21 in the second code snippet. We set the allocated size to the size we wanted to allocate, not the actual size we allocated.

We allocated 17MB, we think we allocated 32MB, and now everything can happen.

This is a nasty thing to figure out. If you are lucky, this will generate an access violation when trying to get to that memory you think you own. If you are not lucky, this memory was actually allocated to your process, which means that you are now corrupting some totally random part of memory in funny ways. And that means that in some other time you’ll be start seeing funny behaviors and impossible results and tear your hair out trying to figure it out.

To make things worse, this is something that only happens when you run out of memory, so you are already suspicious about pretty much everything that is going on there. Nasty, nasty, nasty.

I might need a new category of bugs: “Stuff that makes you want to go ARGH!”

ChallengeFind the bug in the fix

time to read 1 min | 72 words

We are working on improving the reliability of RavenDB under a host of scenarios This week it is low memory conditions. We made some fixes, and introduced a horrible bug.

Here is the code, can you see what the error is? Here are the first and second versions of the code. The second version is meant to be more robust to running under low memory conditions, but it is actually much worse.

Production Test RunThe self flagellating server

time to read 2 min | 354 words

imageSometimes you see the impossible. In one of our scenarios, we saw a cluster that had such a bad case of split brain that it came near to fracturing the very boundaries of space & time.

In a three node cluster, we have one node that looked to be fine. It connected to all the other nodes and was the cluster leader. The other two nodes, however, were not in the cluster and in fact, they were showing signs that they never were in the cluster.

What was really strange was that we took the other two machines down and the first node was still showing a successful cluster. We looked deeper and realized that it wasn’t actually a healthy situation, in fact, this node was very rapidly switching between leader and follower mode.

It took a bit of time to figure out what was going on, but the root cause was DNS. We had the three nodes on separate DNS (a.oren.development.run, b.oren.development.run, c.oren.development.run) and they were setup to point to the three machines. However, we have previously used the same domain names to run a cluster on the first machine only. Because of the way DNS updates, whenever the machine at a.oren.development.run would try to connect to b.oren.development.run it would actually connect to itself.

At this point, A would tell B that it is the leader. But A is B, so A would respond by becoming a follower (because it was told it should, by itself). Because it became a follower, it disconnected from itself. After a timeout, it would become leader again, and the cycle would continue.

Every time that the server would get up, it would whip itself down again. “I’m a leader”, “No, I’m a leader”, etc.

This is a fun thing to discover. We had to trace pretty deep to figure out that the problem was in the DNS cache (since the DNS itself was properly updated).

We fixed things so we now recognize if we are talking to ourselves and error properly.

Production Test RunRowhammer in Voron

time to read 3 min | 599 words

imageRowhammer is a type of attack on the way DRAM is built. A specific write pattern to a specific memory cell can cause “leakage” to nearby memory cells, causing bit flips. The issue we were facing in production ended up being very similar.

The initial concern was that a database on disk size was very large. Enough that we started a serious investigating into what exactly is taking all this space. Our internal reporting said, nothing. And that was very strange. Looking at the actual file usage, we had a lot of transaction journals taking a lot of space there. By a lot I mean that the size of the data file in question was 32 MB and the journals were taking a total of over 15GB. To be fair, some of them were kept around to be reused, but that was bad.

It took a while to figure out what was going on. This particular Voron instance was used for a map/reduce index on a database that had high write throughput. Because if that, the indexing was always active. So far, so good, we have several other such instances, and they don’t exhibit this behavior. What was different about this index is that due to the shape of the index and the nature of the data, what ended up happening is that we’ll always modify the same (small) set of the data.

This index sums up a number of events and aggregate them based on when they happened. This particular system handle about a hundred updates a second on average, and can peak to about five to seven times that. The index gives us things such as “how many events of each type happened today” and things like that. This means that there is a lot of locality in the updates. And that was the key.

Even though this index (and the Voron storage that backed it) was experienced a lot of writes, these writes almost always happened to the same set of data (basically updating the counters). That means that there wasn’t actually just a very small set of pages in the data that were modified. And that set off a chain of behaviors that results in a lot of disk space being used.

  • A lot of data is modified, meaning that we need to write a lot to the journal on each transaction.
  • Because the same data is constant modified, the total amount of modified bytes is smaller than a certain threshold.
  • Writes are constants.
  • Flushing the data to disk is expensive, so we try to avoid it.
  • We can only delete journals after we flushed the data.

Because we try to avoid flushing to disk if we can, we only do that when there is enough idle time or when enough data has been modified. In this case, there was no idle time, and the amount of data that was modified was too small to hit the limit.

The system would actually balance itself out eventually (which is why it stopped at around ~15GB or so of journals). At some point we would either hit an idle spot or the buffer will hit the limit and we’ll flush to disk, which allow us to free the journals, but that happened only after we had quite a few. The fix was to add a limit to how long we’ll delay flushing to disk in such a case and was quite trivial once we figure out what exactly all the different parts were doing.

Production Test RunToo much of a good thing isn’t so good for you

time to read 2 min | 316 words

imageNot all of our testing happened in a production settings. One of our test clusters was simply running a pretty simple loop of writes, reads and queries on all the nodes in the cluster while intentionally destabilizing the system.

After about a week of this we learned that this worked, there were no memory leaks or increased resource usage and also that the size of the data on disk was about three orders of magnitude too much.

Investigating this we discovered that the test process introduced conflicts because it wrote the same set of documents to each of the nodes, repeatedly. We are resolving this automatically but are also keeping the conflicted copies around so users can figure out what happened to their system. In this particular scenario, we had a lot of conflicted revisions, and it was hard initially to figure out what took that space.

In our production system, we also discovered that we log too much. One of the interesting feedback items we were looking for in this production test run is to see what kind of information we can get from the logs and make sure that the details there are actionable. A part of that was to see if we could troubleshoot something simply using the logs, and add missing details if there were stuff that we couldn’t figure out from them.

We also discovered that under load, we would log a lot. In particular, we had logs detailed every indexed document and replicated item. These are almost never useful, but they generate a lot of noise when we lowered the log settings. So that went away as well. We are very focused on logs usability, it should be possible to understand what is going on and why without drowning in minutia.


  1. Patterns for high-performance C# - 7 hours from now

There are posts all the way to Feb 26, 2018


  1. Production postmortem (22):
    22 Feb 2018 - The unavailable Linux server
  2. Challenge (50):
    31 Jan 2018 - Find the bug in the fix–answer
  3. Production Test Run (6):
    26 Jan 2018 - Overburdened and under provisioned
  4. Inside RavenDB 4.0 (3):
    22 Jan 2018 - Chapter 10 is done
  5. Reminder (8):
    09 Jan 2018 - Early bird pricing for RavenDB workshops about to close
View all series



Main feed Feed Stats
Comments feed   Comments Feed Stats