Ayende @ Rahien

Oren Eini aka Ayende Rahien CEO of Hibernating Rhinos LTD, which develops RavenDB, a NoSQL Open Source Document Database.

You can reach me by:

oren@ravendb.net

+972 52-548-6969

Posts: 6,903 | Comments: 49,346

filter by tags archive
time to read 4 min | 602 words

imageI spoke about this in the video, and it seems to have caught a lot of people yes, so I thought that I would talk here a bit how we trace the root cause of a RavenDB critical issue to a printer being out of paper.

What is the relationship, I can hear you ask, between RavenDB, a document database, and a printer being out of paper? That is a good question. The answer is pretty much none. There is no DocPrint module inside of RavenDB and the last time yours truly wrote printer code was over fifteen years ago. But the story started, like all good tech stories, with a phone call. An inconveniently timed phone call, I might add.

Imagine, the year is 2013, and I’m enjoying the best part of the year, December. I love December for a few reasons. I was born in it, and it is a time where pretty much all our customers are busy doing other stuff and we can focus on pure development. So imagine my surprise when, on the other side of the line, I got a pretty upset admin that had to troubleshoot a RavenDB instance that would refuse to start. To make things interesting, this admin had drawn the short straw, I assume, and had to man the IT department while everyone else was on vacation. He had no relation to RavenDB during normal operations and only had the bare minimum information about it.

The symptoms were clear, luckily. RavenDB would simply refuse to start, hanging almost immediately, it seemed. No network access, nothing in the logs to indicate a problem. Just… hanging…

Here is the stack trace that we captured:

And that was… it. We started to look into it, and we run into this StackOverflow question, which was awesome. Indeed, restarting the print spooler would fix the problem and RavenDB would immediately start running. But I couldn’t leave it like this, and I guess the admin on the other side was kinda bored, because he went along with my investigation.

We now have access to the code (we didn’t in 2013) and can look at exactly what is going on. This comment had me… upset:

image

The service manager in Windows will consider any service that didn’t finish initialization in 30 seconds to be failed and kill it.  You might be putting the things together at this point?

After restarting the print spooler, we were able to start RavenDB, but restarting it cause another failure. Eventually we tracked it down to a network printer that was out of paper (presumably no one in the office to notice / fill it up). My assumption is that the print spooler was holding a register key hostage while making a network call to the printer that would time out because it didn’t have any paper. If at that time you would attempt to use a performance counter, you would hang, and if you are running as service, would be killed.

I’m left with nothing to do but quote Leslie Lamport:

A distributed system is one in which the failure of a computer you didn’t even know existed can render your own computer unusable.

Oh so true.

We ended up ditching performance counters entirely after running into so many issues around them. As of 2017, people were still running into this issue, so I think that was a great decision.

time to read 10 min | 1813 words

imageThis is a sordid tale of chance and mystery and the nasty tricks that Murphy can play on you.

A few customers reported an error similar to the following one:

Invalid checksum for page 1040, data file Raven.voron might be corrupted, expected hash to be 0 but was 16099259854332889469

One such case might be a disk corruption, but multiple customers reporting it is an indication of a much bigger problem. That was a trigger for a STOP SHIP reaction. We consider data safety a paramount goal of RavenDB (part of the reason why I’m doing this Production Postmortem series), and we put some of our most experienced people on it.

The problem was, we couldn’t find it. Having access to the corrupted databases showed that the problem occurred on random. We use Voron in many different capacities (indexing, document storage, configuration store, distributed log, etc) and these incidents happened across the board. That narrowed the problem to Voron specifically, and not bad usage of Voron. This reduced the problem space considerably, but not enough for us to be able to tell what is going on.

Given that we didn’t have a lead, we started  by recognizing what the issue was and added additional guards against it. In fact, the error itself was a guard we added, validating that the data on disk is the same data that we have written to it. The error above indicates that there has been a corruption in the data because the expected checksum doesn’t match the actual checksum from the data. This give us an early warning system for data errors and prevent us from proceeding on erroneous data. We have added this primarily because we were worried from physical disk corruption of data, but it turns out that this is also a great early warning system for when we mess up.

The additional guards were primarily additional checks for the safety of the data in various locations on the pipeline. Given that we couldn’t reproduce the issue ourselves, and none of the customers affected were able to reproduce this, we had no idea how to go from there. Therefor, we had a team that kept on trying different steps to reproduce this issue and another team that added additional safety measures for the system to catch any such issue as early as possible.

The additional safety measures went into the codebase for testing, but we still didn’t have any luck in figuring out what we going on. We went from trying to reproduce this by running various scenarios to analyzing the code and trying to figure out what was going on. Everything pointed to it being completely impossible for this to happen, obviously.

We got a big break when the repro team managed to reproduce this error when running a set of heavy tests on 32 bits machines. That was really strange, because all the reproductions to date didn’t run on 32 bits.

It turns out that this was a really lucky break, because the problem wasn’t related to 32 bits at all. What was going on there is that under 32 bits, we run in heavily constrained address space, which under load, can cause us to fail to allocate memory. If this happens at certain locations, this is considered to be a catastrophic error and requires us to close the database and restart it to recover. So far, this is pretty standard and both expected and desired reaction. However, it looked like sometimes, this caused an issue. This also tied to some observations from customers about the state of the system when this happened (low memory warnings, etc).

The very first thing we did was to test the same scenario on the codebase with the new checks added. So far, the repro team worked on top of the version that failed at the customers’ sites, to prevent any other code change from masking the problem. With the new checks, we were able to confirm that they actually triggered and caught the situation early. That was a great confirmation, but we still didn’t know what was going on. Luckily, we were able to add more and more checks to the system and run the scenario. The idea was to trip over a guard rail as early as possible, to allow us to inspect what actually caused it.

Even with a reproducible scenario, that was quite hard. We didn’t have a reliable method of reproducing it, we had to run the same set of operations for a while to hopefully reproduce this scenario. That took quite a bit of time and effort. Eventually, we figured out what was the root cause of the issue.

In order to explain that, I need to give you a refresher on how Voron is handling I/O and persistent data.

Voron is using MVCC model, in which any change to the data is actually done on a scratch buffer, this allow us to have snapshot isolation at very little cost and give us a drastically simplified model for working with Voron. Other important factors include the need to be transactional, which means that we have to make durable writes to disk. In order to avoid doing random writes, we use a Write Ahead Journal. For these reasons, I/O inside Voron is basically composed of the following operations:

  • Scratch (MEM) – copy on write data for pages that are going to be changed in the transaction. Usually purely in memory. This is how we maintain the Isolated and Atomic aspects on ACID.
  • Journal (WAL) – sequential, unbuffered, writes that include all the modifications to the transaction. This is how we maintain the Atomic and Durability aspects in ACID.
  • Flush (MMAP)– copy data from the scratch buffers to the data file, which allow us to reuse space in the scratch file.
  • Sync – (FSYNC) – ensure that the data from a previous flush is stored in durable medium, allow us to delete old journal files.

In Voron 3.5, we had Journal writes (which happen on each transaction commit) at one side of the I/O behavior and flush & sync as the other side. In Voron 4.0, we actually split it even further, meaning that journal writes, data flush and file sync are all independent operations which can happen independently.

Transactions are written to the journal file one at a time, until it reach a certain size (usually about 256MB), at which point we’ll create a new journal file. Flush will move data from the scratch buffers to the data file and sync will ensure that the data that was moved to the data file is durably stored on disk, at which point you can safely delete the old journals.

In order to trigger this bug, you needed to have the following sequence of events:

  • Have enough transactions happen quickly enough that the flush / sync operations are lagging by more than a single file behind the transaction rate.
  • Have a transaction start a new journal file while the flush operation was in progress.
  • Have, concurrently, the sync operation complete an operation that include that last journal file. Sync can take a lot of time.
  • Have another flush operation go on while the sync is in progress, which will move the flush target to the new journal file.
  • Have the sync operation complete, which only synced some of the changes that came from that journal, but because the new flush (which we didn’t sync yet) already moved on from that journal, mistakenly believe that this journal file is completed done and delete it.

All of these steps, that is just the setup for the actual problem, mind you.

In this case, we are prepared to have to this issue, but we aren’t yet to actually experience it. This is because what happened is that the persistent state (on disk) of the database is now suspect, if a crash happens, we will miss the oldest journal that still have transactions that haven’t been properly persisted to the data file.

Once you have setup the system properly, you aren’t done, in terms of reproducing this issue. We now have a race, the next flush / sync cycle is going to fix this issue. So you need to have a restart of the database within a very short period of time.

For additional complexity, the series of steps above will cause a problem, but even if you crash in just the right location, there are still some mitigating circumstances. In many cases, you are modifying the same set of pages in multiple transactions, and if the transactions that were lost because of the early deletion of the journal file had pages that were modified in future transactions, these transactions will fill up the missing details and there will be no issue. That was one of the issues that made it so hard to figure out what was going on. We needed to have a very specific set of timing between three separate threads (journal, flush, sync) that create the whole, then another race to restart the database at this point before Voron will fix itself in the next cycle, all happening just at the stage that Voron moves between journal files (typically every 256MB of compressed transactions, so not very often at all) and with just the right mix of writes to different pages on transactions that span multiple journal files.

These are some pretty crazy requirements for reproducing such an issue, but as the saying goes: One in a million is next Tuesday.

What made this bug even nastier was that we didn’t caught it earlier already. We take the consistency guarantees of Voron pretty seriously and we most certainly have code to check if we are missing transactions during recovery. However, we had a bug in this case. Because obviously there couldn’t be a transaction previous to Tx #1, we aren’t checking for a missing transaction at that point. At least, that was the intention of the code. What was actually executing was a check for missing transactions on every transaction except for the first transaction on the first journal file during recovery. So instead of ignoring just the the check on Tx #1, we ignored it on the first tx on all recoveries.

Of course, this is the exact state that we have caused in this bug.

Sigh.

We added all the relevant checks, tightened the guard rails a few more times to ensure that a repeat of this issue will be caught very early and provided a lot more information in case of an error.

Then we fixed the actual problems and subject the database to what in humans would be called enhanced interrogation techniques. Hammers were involved, as well as an irate developer with penchant to pulling the power cord at various stages just to see what will happen.

We have released the fix in RavenDB 4.1.4 stable release and we encourage all users to upgrade as soon as possible.

time to read 6 min | 1079 words

I’m certain that the archangel that is responsible for statistical grouping has a strange fondness for jokes. If that isn’t the case, I can’t really explain how we routinely get clear clustering of bug reports on specific issues. We have noticed that several years ago, when we suddenly started to get a lot of reports of a particular issue. There was no recent release and no reason for all of these correlated errors to happen all at once. But they did. Luckily, the critical mass of reports gave us more than enough information to figure out what the bug was and fix it. When we checked, the bug was there, dormant, for 20 months.

The current issue we are handling is a spate of reports about hard errors in RavenDB. The kind of errors that use the terms invalid data, in particular. These kind of errors bring development to a screeching halt, as we direct all available resources to figure out what is going on. We didn’t have a reproduction, but we did have a few cases of people reporting it and the stack trace we got told us that the problem wasn’t isolated to a single location but seemed to be fairly wide spread. This is a Good News / Bad News kind of situation. It is good because it means that the problem is in a low level component, it is bad because it is going to be hard to narrow it down exactly.

The common theme around this issue popped up when the system was already in a bad state. We were able to see that when we run out of disk space or when the memory on the system was very low. The common thread to these was that in both cases these are expected and handled scenarios. What is more, these are the kind of errors that we want to keep on shouldering through.

In general, the error strategy we use in most of RavenDB is fairly simple. Abort the transaction and give the user an error. If the error is considered catastrophic (I/O failure when writing to disk, for example), we also force a restart of the database (which does full recovery). As it turned out, there were three completely separate issues that we found when investigating this issue.

First, and the most critical one was cleanup code that wasn’t aware of the state of the system. For example, consider:

Let’s assume that we have an error during the WriteData. For example, we run out of space on disk. We abort the transaction and throw an exception. The FlushIndex() method, however, will go into the catch clause and try to cleanup its resources. When doing so, it will access the transaction (that was just aborted) and try to use it, still.

The problem is that at this point the state of the transaction is known bad, the only thing you are allowed to do is to dispose it, you cannot access anything on it. The code in question, however, is meant to be used on non transactional resource and require compensating actions to revert to previous state.

For that matter, the code above has another problem, however. Do you see the using statement here? It seemed like we have quite a bit of code that does stuff in the Dispose method. And if that stuff also touch the transaction, you may get what is called in the profession: “hilarity ensued”.

The problem was that our transaction code was too trusting and assumed that the calling code will not call it if it is in invalid state. Indeed, the cases where we did such a thing were rare, but when they happened.

The second problem was when we had code like this:

Except that the real code is a lot more complex and it isn’t as each to see what is going on.

What you see here is that we run commands, and as we process them, we may get (expected) exceptions, which we should report to the caller. The problem is that we mixed expected exceptions with unexpected ones. And these unexpected exceptions could leave us in… a funny state. At which point we would continue executing future commands and even commit the transaction. As you can imagine, that isn’t a good place to be at.

We have gone over our error handling and injected faults and errors at any later of the stack that we could conceive of. We have been able to fix a lot of issues, most of them have probably never been triggered, but it was a sobering moment to look at some of those changes. The most likely cause for the errors was a change that was made (by me, as it turns out) over two years ago. And in all that time, we never seen neither hide nor hair of it. Until suddenly we got several simultaneous cases.

The third and final problem, by the way, was related to not crashing. By default, any severe enough error should cause us to shut down the database and restart it. In the process, we re-apply the journal and ensure that we are in a consistent state. The problem is that we don’t want to do that for certain expected errors. And the issue with staying up was that while Voron (at the storage layer) handled the error properly, the higher level code did not. At that point we had a divergence of the in memory state vs. the on disk state. That usually led to either NRE that would remain until the server was restarted or really scary messages that would typically go away when we recovered the database and reloaded the in memory state from the on disk state.

In short, handling errors in the error handling code is tough.

The primary changes we made were on the transaction side, we made it validate its own state when called, so code that erroneously try to use a transaction that has already failed will error early. We have also added additional validation of operations in several key points. That would allow us to catch errors much more quickly and allow us to pinpoint exactly where things are breaking apart sooner.

The current state is that we pushed these changes to our production system and are running the usual battery of tests to prove that the changes are valid. We’ll also be adding more faults into the process to ensure that we exercise our error handling a lot more often.

time to read 9 min | 1610 words

image“If a tree falls in a forest and no one is around to hear it, does it make a sound?” is a well known philosophical statement. The technological equivalent of this is this story. We got a report that RavenDB was failing in the field. But the details around the failure were critical.

The failure happened on the field, literally. This is a system that is running an industrial robot using a custom ARM board. The failure would only happen on the robot on the field, would not reproduce on the user’s test environment or on our own systems. Initially, that was all the information that we had: “This particular robot works fine for a while, but as soon as there is a break, RavenDB dies and needs to be restarted”. That was the first time I run into a system that would crash when it went idle, instead of dying under load, I have to say.

My recommendation that they would just keep the robot busy at all time was shot down, but for a while, we were in the dark. It didn’t help that this was literally a custom ARM machine that we had no access to. We finally managed to figure out that the crash was some variant of SIGSEGV or SIGABRT. That was concerning. The ARM machine in question is running on 32 bits, and the worry was that our 32 bits code was somehow doing a read out of bound. This is a crash in production, so we allocated a couple of people to investigate and try to figure out what was going on.

We started by doing a review of all our 32 bits memory management code and in parallel attempted to reproduce this issue on a Raspberry Pi (the nearest machine we had to what was actually going on). We got a lucky break when we someone did manage to kill the RavenDB process on our own lab somehow. The exit code was 139 (Segmentation fault), but we weren’t sure what was actually going on. We were trying all sort of stuff on the machine, seeing what would cause this. We basically fed it all sorts of data that we had laying around and saw if it would choke on that. One particular data export would sometimes cause a crash. Sometimes. I really really hate this word. That meant that we were stuck with trying to figure out something by repeatedly trying and relying on the law of averages.

It took several more days, but we figured out that a certain sequence of operations would reliably cause a crash within 5 – 30 minutes. As you can imagine, this made debugging pretty hard. The same sequence of operations on Intel machines, either 32 bits or 64 bits worked without issue, regardless of many times we have repeated them.

We followed several false trails with our investigation into RavenDB’s memory management’s code in 32 bits. We had a few cases where we thought that we had something, but nothing popped up. We have instrumented the code and verified that everything seemed kosher, and it certainly did, but the system still crashed on occasion.

RavenDB usually relies on mmap() to access the data on disk, but on 32 bits, we couldn’t do that. With an addressable memory of just 2 GB, we cannot map the whole file to memory if it is too large. Because of that, we map portions of the file to memory as needed for each transaction. That led us to suspect that we were somehow unmapping memory while it was still in use or something like that. But we have gone through the code with a fine tooth comb and got nothing. We used strace to try to help point out what is going on and we could see that there were no surprise calls to unmap() that shouldn’t be there.

What was really nasty was the fact that when we failed with SIGSEGV, the error was always on an address just past the area of memory that we mapped. This lead us to suspect that we had an out of boundary write and led to a chase for that rouge pointer operation. We instrumented our code ever more heavily, but weren’t able to find any such operation. All our reads and writes were in bound, and that was incredibly frustrating. RavenDB is a CoreCLR application. As such, debugging it on an ARM device is… challenging. We tried lldb and gdb. Both allow unmanaged debugging, but even with lldb, we couldn’t debug managed code or even just pull the managed stack properly from ARM. Eventually we found this extension which allow to do SSH debugging on the Raspberry PI from a Windows machine.

That helped, and we finally figured out where in our managed code the error happened. This always happened during a copy of memory from a document write to a scratch buffer in a memory mapped file. The entire thing was wrapped in boundary checks and everything was good.

We went back to the drawing board and attempted to set it on fire, because it was no good for us. Once we put the fire out, we looked at what remained and had an Eureka! moment. Once of the differences between ARM and x86/x64 machines is in how they treat alignment. In x64/x86 alignment is pretty much a non issue for most operations. On ARM, however, an unaligned operation will cause a CPU fault. That led us to suspect that the SIGABRT error we got was indeed an alignment issue. Most of our code is already aligned on memory because while it isn’t mandatory on x64/x86, it can still get better perf in certain cases, but it is certainly possible that we missed it.

We discovered a horrifying problem:

image

We were using the CopyBlock method, and obviously that was the issue, right? We wrote a small test program that simulated what we were doing and used unaligned CopyBlock and it just worked. But maybe our situation is different?

Using CopyBlockUnaligned on x86 led to a 40% performance drop (we call this method a lot) and initially it looked like it fixed the problem on ARM. Except that on the third or forth attempts to reproduce the problem, we run into our good old SIGSEGV again, so that wasn’t it. This time we went to the drawing board and broke it.

During this time, we have managed to capture the error inside the debugger several times, here is what this looked like:

image

Reading ARM assembly is not something that I’m used to do, so I looked at the manual, and it looks like this instruction is to store multiple registers in descending order and… no clue beyond that. It didn’t make any sort of sense to us.

At this point, we were several weeks and four or five people into this investigation (we consider such issues serious). We have instrumented our code to the point where it barely run, we could manage to reproduce the error in a relatively short time and we were fairly convinced that we were doing things properly. Going over the kernel code for memory mapping and unmapping several times, stracing, debugging, everything. We were stumped. But we also had enough data at this point to be able to point a fairly clear picture of what was going on. So we opened an issue for the CoreCLR about this, suspecting that the issue is in the implementation of this CopyBlockUnaligned.

We got a strange response, though: “This assembly code doesn’t make any sense”. I did mention that I have no idea about ARM assembly, right?  We tried reproducing the same thing in gdb, instead of lldb and got the following assembly code:

image

This looked a lot more readable, to be sure. And it was extremely suspicious. Let me explain why:

image

The faulting instruction is: ldr r3, [r0, #0]

What this says is basically, read a word from the address pointed to by r0 (with 0 offset) into r3.

Now, r0 in this case has this value: 0x523b3ffd. Note the last three characters, ffd.

We are running this on a 32 bits machine, so a word is 4 bytes in side. FFD+4 = 1001

In other words, we had a read beyond the current page boundary. In most cases, the next page is mapped, so everything goes smoothly. In some cases, the next page is not mapped, so you are going to get an access violation trying to read a byte from the next page.

The fix for this is here:

image

This is literally a single character change. And probably the worst lines of codes / time invested that I have ever seen for any bug. Actually, there wasn’t even any code change in RavenDB’s codebase, so that is 0 lines of code / 4 people x 4 weeks.

The good thing is that at least we have proven that the 32 bits memory code is rock solid, and we have a lot better understanding on how to resolve the next issue.

time to read 6 min | 1019 words

imageAs part of our 4.0 deployment, we have a cluster made of mixed machines, some running Windows, some running Linux. In one particular configuration, we have 2 Windows services and a single Linux machine. We setup the cluster and started directing production traffic to it, and all was good. We left for the weekend and came back the following week, to see what the results were.  Our Linux machine was effectively down. It was not responding to queries and it seemed like it was non responsive.

That was good and bad. It was good because the rest of the cluster just worked, and there was no interruption in service. This is as designed, but it is always nice to see this in real life. It is bad because it isn’t supposed to be happening. What was really frustrating was that we were able to log into the machine and there was nothing wrong there.

No high CPU or memory, no outstanding I/O or anything of the kind that we are used to.

The first clue for us was trying to diagnose the problem from inside the machine, where we able to reproduce it by trying to access the server from the local machine, resulting in the exact same issue as observed externally. Namely, the server would accept the connection and just leave it hanging, eventually timing out.

That was a interesting discovery, since it meant that we can rule out anything in the middle. This is a problem in this machine. But the problem remained quite hard to figure out. We deployed to production in a manner similar to what we expect our users will do, so we used Let’s Encrypt as the certificate authority with auto generated certificates.

So we started by seeing where the problem is, whatever this is on the TCP side or the SSL side, we issued the following command:

openssl s_client -connect b.cluster-name.dbs.local.ravendb.net:443

This command showed immediate connection to the server and the client sending the ClientHello properly, but then just hanging there. What was really interesting is that if we waited about 2 minutes, that SSL connection would complete successfully. But we couldn’t figure out any reason why this would be the case. It occurred to me that it might be related to the system handling of reverse DNS lookup. The two minutes timeout was very suspicious, and I assumed that it might be trying to lookup the client certificate and somehow resolve that. That isn’t how it works in general, although the fact that some SSH (and not SSL/TLS) configuration directly relate to this has led us in a merry chase.

Eventually we pulled strace and looked into what is actually going on. We focused on the following calls:

sudo strace -fp 1017 -s 128 -Tfe open,socket,connect,sendto,recvfrom,write

The interesting bits from there are shown here:

As you can see, we are looking at some DNS resolution, as we can tell from the /etc/resolv.conf and /etc/hosts open() calls. Then we have a connect() to 172.31.0.2 which is an AWS name server. Note that this is done over UDP, as you can see from the SOCK_DGRAM option in the preceding socket() call.

We are getting some data back, and we can see identrust there. And then we see something really suspicious. We have a TCP socket call that goes to 192.35.177.64 on port 80. In other words, this is an HTTP call. What does an HTTP call is doing in the middle of an SSL handshake?

As it turned out, our firewall configuration was blocking outbound connections to port 80. We tested removing that rule and everything came back online and the server was running just fine.

Further inspection revealed that we were calling to: http://apps.identrust.com/roots/dstrootcax3.p7c

And this is where things started to jell together. We are using Let’s Encrypt certificates, and in order to ensure trust, we need to send the full chain to the user. SSL Certificates has the notion of Authority Information Access, which is basically a URL that is registered in the certificate that points to where you can find the certificate that signed this one.

Why is this using HTTP? Because the data that will be fetched is already signed, and it is not a secret. And trying to use HTTPS to fetch it might get us into a loop.

So whenever we had a new SSL connection, we’ll try to connect to IdenTrust to get the full chain to send to the client. The killer here is that if we fail to do so, we’ll send the certificate chain we have (without the missing root), but it will work, since the other side already have this root installed (usually). On Windows, this certificate is installed, so we didn’t see it. On Linux, we didn’t have that certificate installed, so we had to look it up every single time.

The gory details, including dives into the source code are in the GitHub issue. And I do think they are gory. In this case, once we realized what was going on we were able to take steps to handle this. We needed to pre-register the entire chain on the local machine, so any lookup will be able to find it locally, and not do a network call per each SSL connection.

But beyond mutating the global certificate store, there is no real way to prevent that remote call.

Note that this is also true for Windows, although that seems to be implemented much deeper in the stack, and not in managed code, so I wasn’t able to trace where this is actually happening. The really bad thing here is that from the outside, there is no way for us to control or disable this, so this is just something that you have to remember to do when you use certificates, make sure that the entire chain is registered on each machine, otherwise you might have a remote call per connection, or a very long (and synchronous!) hang until the request times out if you are blocking outgoing access.

time to read 10 min | 1883 words

In terms of severity, there are very few things that we take more seriously than data integrity. In fact, the only thing that pops to mind as higher priority are security issues. A user reported an error when using a pre-release 4.0 database that certainly looked like data corruption, so we were very concerned when we go the report, and quite happy about the actual error. If this is strange, let me explain.

Storage bugs are nasty. I suggest reading this article to understand how tricky these can be. The article talks about memory allocators (even though it calls them storage) but the same rules apply. And the most important rule from this article?

WHEN A MEMORY DAMAGE BUG IS OBSERVED, IT TAKES PRIORITY OVER ALL OTHER BUG FIXES, ENHANCEMENTS, OR ANY OTHER DEVELOPMENT ACTIVITY.  ALL DEVELOPMENT CEASES UNTIL IT IS FOUND.

You can read the article for the full reasoning why, but basically is about being able to  reproduce and fix the bug and not make it “go away” with a hammer approach. We do the same with data corruption. One of our developers stops doing anything else and investigate just that, as a top priority issue. Because we take this so seriously, we have built several layers of defense in depth into RavenDB.

All the data is signed and we compare hashed when reading from disk to validate that it hasn’t been modified. This also help us catch an enormous amount of problems with storage devices and react to them early. There are other checks that are being run to verify the integrity of the system, from debug asserts to walking the structure of the data and verifying its correctness.

In this case, analysis of the data the user provided showed that we were failing the hash validation, which should usually only happen if there is a physical file corruption. While we were rooting for that (since this would mean no issues with our code), we also looked into the error in detail. What we found was that we were somehow starting to read a document from the middle, instead of the beginning. Somehow we managed to mess up the document offset and that caused us to think that the document was corrupted.

At this point, we had a confirmed data corruption issue, since obviously we shouldn’t lose track of where we put the documents. We pulled another developer into this, to try to reproduce the behavior independently while checking if would salvage the user’s data from the corrupted files. This deserve some explanation. We don’t assume that our software is perfect, so we took steps in advanced. The hashing the data and validating it is one such step, but another is build, upfront, the recovery tools for when the inevitable happens. That meant that the way we lay out the data on disk was designed, upfront and deliberately, to allow us to recover the data in the case of corruption.

Admittedly, I was mostly thinking about corruption of the data as a result of physical failure, but the way we lay out the data on disk also protect us from errors in record keeping such as this one. This meant that we were able to extract the data out and recover everything for the user.

At this time, we had a few people trying to analyze the issue and attempting to reproduce it. The problem with trying to figure out this sort of issue from the resulting file is that by the time you have found the error, this is too late, the data is already corrupted and you have been operating in a silent bad state for a while, until it finally got to the point this become visible.

8011098[3]We had the first break in the investigation when we managed to reproduce this issue locally on a new database. That was great, because it allowed us to rule out some possible issues related to upgrading from an earlier version, which was one of the directions we looked at. The bad part was that this was reproduced mostly by the developer in question repeatedly hitting the keyboard with his head in frustration. So we didn’t have a known way to reproduce this.

Yes, I know that animated GIFs are annoying, so was this bug, I need a way to share the pain. At one point we got something that could reliably generate an error, it was on the 213th write to the system. Didn’t matter what write, but the 213th write will always produce an error. There is nothing magical about 213, by the way, I remember this value because we tried so very hard to figure out what was magical about it.

At this point we had four or five developers working on this (we needed a lot of heads banging on keyboards to reproduce this error). The code has been analyzed over and over. We found a few places where we could have detected the data corruption earlier, because it violated invariants and we didn’t check for that. That was the first real break we had. Because that allowed us to catch the error earlier, which let to less head banging before the problem could be reproduced. The problem was that we always caught it too late, we kept going backward in the code, each time really excited that we are going to be able to figure out what was going on there and realizing that the invariants this code relied on were already broken.

Because these are invariants, we didn’t check them, they couldn’t possibly be broken. That sounds bad, because obviously you need to validate your input and output, right? Allow me to demonstrate a sample of a very simple storage system:

 

There isn’t anything wrong with the code here at first glance, but look at the Remove method, and now at this code that uses the storage:

The problem we have here is not  with the code in the Remove or the GetUserEmail method, instead, the problem is that the caller did something that it wasn’t supposed to, and we proceeded on the assumption that everything is okay.

The end result is that the _byName index contained a reference to a deleted document, and calling GetUserEmail will throw a null reference exception. The user visible problem is the exception, but the problem was actually caused much earlier. The invariant that we violating could have been caught in the Remove method, though, if we did something like this:

These sort of changes allow us to get earlier and earlier to the original location where the problem first occurred. Eventually we were able to figure out that a particular pattern of writes would put the internal index inside RavenDB into a funny state, in particular, here is how this looks like from the inside.

image

What you see here is the internal structure of the tree inside RavenDB used to map between documents etags and their location on the disk. In this case, we managed to get into a case where we would be deleting the last item from a page that is the leftmost page in a tree that has 3 or more levels and whose parent is the rightmost page in the grandparent and is less than 25% full while the sibling to its left is completely full.

In this case, during rebalancing operation, we were forgetting to reset the downward references and ended up messing up the sort order of the tree. That worked fine, most of the time, but it would slowly poison our behavior, as we made binary searches on data that was supposed to be sorted but wasn’t.

Timeline (note, despite the title, this is pre released software and this is not a production system, the timeline reflects this):

  • T-9 day, first notice of this issue in the mailing list. Database size exceed 400GB. Back and forth with the user on figuring out exactly what is going on, validating the issue is indeed corruption and getting the data.
  • T-6 days, we start detailed analysis of the data in parallel to verifying that we can recover the data.
  • T-5 days, user has the data back and can resume working normally, investigation proceeds.
  • T-4 days, we have managed to reproduced this on our own system, no idea how yet.
  • T-3 days, head banging on keyboards, adding invariants validations and checks everywhere we can think of.
  • T-2 days, managed to trap the root cause of the issue, tests added, pruning investigation code for inclusion in product for earlier detection of faults.
  • Issue fixed
  • T – this blog post is written Smile.
  • T + 3 days, code for detecting this error and automatically resolving this is added to the next release.

For reference, here is the fix:

image

The last change in the area in question happened two years ago, by your truly, so this is a pretty stable part of the code.

In retrospect, there are few really good things that we learned from this.

  • In a real world situation, we were able to use the recovery tools we built and get the user back up in a short amount of time. We also found several issues with the recovery tool itself, mostly the fact that its default logging format was verbose, which on a 400GB database means an enormous amount of logs that slowed down the process.
  • No data was lost, and these kinds of issues wouldn’t be able to cross a machine boundary so a second replica would have been able to proceed.
  • Early error detection was able to find the issue, investment with hashing and validating the data paid off handsomely here. More work was done around making the code more paranoid, not for the things that it is supposed to be responsible for but to ensure that other pieces of the code are not violating invariants.
  • The use of internal debug and visualization tools (such as the one above, showing the structure of the internal low level tree) was really helpful with resolving the issue.
  • We focused too much on the actual error that we got from the system (the hash check that failed), one of the things we should have done is to verify the integrity of the whole database at the start, which would have led us to figure out what the problem was much earlier. Instead, we suspected the wrong root cause all along all the way to the end. We assumed that the issue was because of modifications to the size of the documents, increasing and decreasing them in a particular pattern to cause a specific fragmentation issue that was the root cause of the failure. It wasn’t, but we were misled about it for a while because that was the way we were able to reproduce this eventually. It turned out that the pattern of writes (to which documents) was critical here, not the size of the documents.

Overall, we spent over a lot of time on figuring out what the problem was and the fix was two lines of code. I wrote this post independently of this investigation, but it hit the nail straight on.

time to read 2 min | 253 words

A customer complained that every now and then RavenDB is hitting 100% CPU and stays there. They were kind enough to provide a minidump, and I started the investigation.

I loaded the minidump to WinDB and started debugging. The first thing you do with high CPU is rung the “!runaway” command, which sorts the threads by how busy they are:

image

I switched to the first thread (39) and asked for its stack, I highlighted the interesting parts:

image

This is enough to have a strong suspicion on what is going on. I checked some of the other high CPU threads and my suspicion was confirmed, but even from this single stack trace it is enough.

Pretty much whenever you see a thread doing high CPU within the Dictionary class it means that you are accessing it in a concurrent manner. This is unsafe, and may lead to strange effects. One of them being an infinite loop.

In this case, several threads were caught in this infinite loop. The stack trace also told us where in RavenDB we are doing this, and from there we could confirm that indeed, there is a rare set of circumstances that can cause a timer to fire fast enough that the previous timer didn’t have a chance to complete, and both of these timers will modify the same dictionary, causing the issue.

time to read 7 min | 1231 words

johnny-automatic-lame-fox-300pxThis isn’t actually about a production system, per se. This is about a performance regression we noticed in a scenario for 4.0. We recently started doing structured performance testing as part of our continuous integration system.

There goes the sad trombone sound, because what that told us is that at some point, we had a major performance regression. To give you that in numbers, we used to be around 50,000 writes per second (transactional, fully ACID, hit the disk, data is safe) and we dropped to around 35,000 writes per second. Fluctuations in performance tests are always annoying, but that big a drop couldn’t be explained by just random load pattern causing some variance in the results.

This was consistent and major difference between the previous record. Now, to explain things, the last firm numbers we had about this performance test was from Dec 2016. Part of the release process for 4.0 calls for a structured and repeatable performance testing, and this is exactly the kind of thing that it is meant to catch. So I’m happy we caught it, not so happy that we had to go through about 7 months of changes to figure out what exactly drove our performance down.

The really annoying thing is that we put a lot of effort into improving performance, and we couldn’t figure out what went wrong there. Luckily, we could scan the history and track our performance over time. I’m extremely over simplifying here, of course. In practice this meant running bisect on all the changes in the past 7 – 8  months and running benchmarks on each point. And performance is usually not something as cut and dried as a pass / fail test.

Eventually we found the commit, and it was small, self contained and had a huge impact on our performance. The code in question?

ThreadPool.SetMinThreads(MinThreads, MinThreads);

This change was part of ironically, a performance run that focused on increasing the responsiveness of the server under a specific load. More specifically, it handled the case where under a harsh load we would run into thread starvation. The problem with this load is that this specific load can only ever happen in our test suite. At some point we had run up to 80 parallel tests, each of which might spawn multiple servers. The problem is that we run them all a single process, so we run into a lot of sharing between the different tests.

That was actually quite intentional, because it allowed us to see how the code behaves in a bad environment. It exposed a lot of subtle issues, but eventually it got very tiring to have tests fail because there just weren’t enough threads to run them fast enough. So we set this value to a high enough number to not have to worry about this.

The problem is that we set it in the server, not in the tests. And that led to the performance issue. But the question is why would this setting cause such a big drop? Well, the actual difference in performance went from 20 μs to taking 28.5 μs, which isn’t that much when you think about it. The reason why this happened is a bit complex, and require understanding several different aspect of the scenario under test.

In this particular performance scenario, we are testing what happens when a lot of clients are writing to the database all at once. Basically, this is a 100% small writes scenario. Out aim here is to get sustained throughput and latency throughout the scenario. Given that we ensure that every write to RavenDB actually hit the disk and is transactional, that means that every single request needs to physically hit the disk. If we tried doing that for each individual request, that would lead to about 100 requests a second on a good disk, if that.

In order to avoid the cost of having to go to the physical disk, we do a lot of batching. Basically, we read the request from the network, do some basic processing and then send it to a queue where it will hit the disk along with all the other concurrent requests. We make heavy use of async to allow us to consume as little system resources as possible. This means that at any given point, we have a lot of requests that are in flight, waiting for the disk to acknowledge the write. This is pretty much the perfect system for high throughput server, because we don’t have to hold up a thread.

So why would increasing the amount of minimum threads hurt us so much?

To answer that we need to understand what it is exactly that SetMinThreads control. The .NET thread pool is a lazy one, even if you told it that the minimum number of threads is 1000, it will do nothing with this value. Instead, it will just accept work and distribute it among its threads. So far, so good. The problem starts when we have more work at hand then threads to process it. At this point, the thread pool need to make a decision, it can either wait for one of the existing threads to become available of it can create a new thread to  handle the new load.

This is when SetMinThreads come into play. Up until the point the thread pool created enough threads to exceed the min count, whenever there is more work in the pool then there are threads, a new thread will be created. If the minimum number of threads has been reached, the thread pool will wait a bit before adding new threads. This allow the existing threads time to finish whatever they were doing and take work from the thread pool. The idea is that given a fairly constant workload, the thread pool will eventually adjust to the right amount of threads to ensure that there isn’t any work that is sitting in the queue for too long.

With the SetMinThreads change, however, we increased the minimum number of threads significantly, and that caused the thread pool to create a lot more threads then it would otherwise create. Because our workload is perfect for the kind of tasks that the thread pool is meant to handle (short, do something and schedule more work then yield) it is actually very rare that the thread pool would need to create an additional thread, since a thread will be available soon enough.

This is not an accident, we very careful set out to build it in this fashion, of course.

The problem is that with the high number of minimum threads, it was very easy to get the thread pool into a situation where it had more work then threads. Even though we would have a free thread in just a moment, the thread pool would create a new thread at this point (because we told it to do so).

That lead to a lot of threads being created, and that lead to a lot of context switches and thread contentions as the threads are fighting over processing the incoming requests. In effect, that would take a significant amount of the actual compute time that we have for processing the request. By removing the SetMinThreads calls, we reduced the number of threads, and were able to get much higher performance.

time to read 2 min | 326 words

imageA customer called us with a problem, for the most part, RavenDB was very well behaved, but they were worried about the 99.99% percentile of request duration. While the 99.9% was excellent (around a few milliseconds), the 99.99% was measured in seconds.

Usually, this is a particularly slow request that happens, and we can handle that by figuring out what is slow on that particular request and fix it. Common causes for such an issue is a request that returns a lot of unnecessary data, such as large full documents, when it needs just a few fields.

In this case, however, our metrics told us that the problem was pretty widespread. There wasn’t a particular slow request, rather, at what appeared to be random times, certain requests would be slow. We also realized that it wasn’t a particular request that was slow, but all requests within a given time period.

That hinted quite strongly at the problem, it was very likely that this is a GC issue that caused a long pause. We still had to do some work to rule out other factors, such as I/O / noisy neighbors, but we narrowed down on GC as being the root cause.

The problem in this case was that the customer was running multiple databases on the same RavenDB process. And each of them was fairly large. The total amount of memory that the RavenDB process was using was around 60GB of managed memory. At that level, anything that would cause a GC can cause a significant pause and impact operations.

The solution in this case was to break that into multiple separate processes, one for each database. In this manner, we didn’t have a single managed heap that the GC had to traverse, each heap was much smaller, and the GC pause times were both greatly reduced and spaced much further apart.

time to read 3 min | 522 words

1329757286-300pxWe got a puzzle. A particular customer was seeing very high latency in certain operations on a fairly frequent basis. The problem is that when this happened, the server was practically idle, serving around 5% of the usual requests/sec. What was even stranger was that during the times when we had reached the peek of requests/seconds, we’ll see no such slowdowns. The behavior was annoyingly consistent, we’ll see no slowdown at all during high load, but after a period of relatively light load, the server would appear to choke.

That one took a lot of time to figure out, because it was so strange. The immediate cause was pretty simple to figure out, the server was busy paging into memory a lot of data, but why would it need to do this? The server was just sitting there doing nothing much, but it was thrashing like crazy, and that was affecting the entire system.

I’ll spare you the investigation, because it was mostly grunting and frustration, but the sequence of events as we pieced it together was something like this:

  • The system is making heavy use of caching, with a cache duration set to 15 minutes or so. Most pages would hit the cache first, and if there was a miss, generate it and save it back. The cached documents was setup with the RavenDB expiration bundle.
  • During periods of high activity, we’ll typically have very few cache expiration (because we kept using the cached data) and we’ll fill up the cache quite heavily (the cache db was around 100GB or so).
  • That would work just fine and rapidly be able to serve a high number of requests.
  • And then came the idle period…
  • During that time, we had other work (by a different process) going on in the server, which we believe would give the OS reason to page the now unused memory to disk.

So far, everything goes on as predicted, but then something happens. The expiration timer is hit, and we now have a lot of items that need to be expired. RavenDB expiration is coarse, and it runs every few minutes, so each run we had a lot of stuff to delete. Most of it was on disk, and we needed to access all of it so we can delete it. And that caused us to trash, affecting the overall server performance.

As long as we were active, we wouldn’t expire so much at once, and we had a lot more of the db in memory, so the problem wasn’t apparent.

The solution was to remove the expiration usage and handle the cache invalidation in the client, when you fetched a cached value, you checked its age, and then you can apply a policy decision if you wanted to update it or not. This actually turned out to be a great feature in general for that particular customer, since they had a lot of data that can effectively be cached for much longer periods, and that gave them the ability to express that policy.

FUTURE POSTS

No future posts left, oh my!

RECENT SERIES

  1. Searching through text (3):
    17 Oct 2019 - Part III, Managing posting lists
  2. re (22):
    19 Aug 2019 - The Order of the JSON, AKA–irresponsible assumptions and blind spots
  3. Design exercise (6):
    01 Aug 2019 - Complex data aggregation with RavenDB
  4. Reviewing mimalloc (2):
    22 Jul 2019 - Part II
  5. Production postmortem (26):
    07 Jun 2019 - Printer out of paper and the RavenDB hang
View all series

Syndication

Main feed Feed Stats
Comments feed   Comments Feed Stats