In terms of severity, there are very few things that we take more seriously than data integrity. In fact, the only thing that pops to mind as higher priority are security issues. A user reported an error when using a pre-release 4.0 database that certainly looked like data corruption, so we were very concerned when we go the report, and quite happy about the actual error. If this is strange, let me explain.
Storage bugs are nasty. I suggest reading this article to understand how tricky these can be. The article talks about memory allocators (even though it calls them storage) but the same rules apply. And the most important rule from this article?
WHEN A MEMORY DAMAGE BUG IS OBSERVED, IT TAKES PRIORITY OVER ALL OTHER BUG FIXES, ENHANCEMENTS, OR ANY OTHER DEVELOPMENT ACTIVITY. ALL DEVELOPMENT CEASES UNTIL IT IS FOUND.
You can read the article for the full reasoning why, but basically is about being able to reproduce and fix the bug and not make it “go away” with a hammer approach. We do the same with data corruption. One of our developers stops doing anything else and investigate just that, as a top priority issue. Because we take this so seriously, we have built several layers of defense in depth into RavenDB.
All the data is signed and we compare hashed when reading from disk to validate that it hasn’t been modified. This also help us catch an enormous amount of problems with storage devices and react to them early. There are other checks that are being run to verify the integrity of the system, from debug asserts to walking the structure of the data and verifying its correctness.
In this case, analysis of the data the user provided showed that we were failing the hash validation, which should usually only happen if there is a physical file corruption. While we were rooting for that (since this would mean no issues with our code), we also looked into the error in detail. What we found was that we were somehow starting to read a document from the middle, instead of the beginning. Somehow we managed to mess up the document offset and that caused us to think that the document was corrupted.
At this point, we had a confirmed data corruption issue, since obviously we shouldn’t lose track of where we put the documents. We pulled another developer into this, to try to reproduce the behavior independently while checking if would salvage the user’s data from the corrupted files. This deserve some explanation. We don’t assume that our software is perfect, so we took steps in advanced. The hashing the data and validating it is one such step, but another is build, upfront, the recovery tools for when the inevitable happens. That meant that the way we lay out the data on disk was designed, upfront and deliberately, to allow us to recover the data in the case of corruption.
Admittedly, I was mostly thinking about corruption of the data as a result of physical failure, but the way we lay out the data on disk also protect us from errors in record keeping such as this one. This meant that we were able to extract the data out and recover everything for the user.
At this time, we had a few people trying to analyze the issue and attempting to reproduce it. The problem with trying to figure out this sort of issue from the resulting file is that by the time you have found the error, this is too late, the data is already corrupted and you have been operating in a silent bad state for a while, until it finally got to the point this become visible.
We had the first break in the investigation when we managed to reproduce this issue locally on a new database. That was great, because it allowed us to rule out some possible issues related to upgrading from an earlier version, which was one of the directions we looked at. The bad part was that this was reproduced mostly by the developer in question repeatedly hitting the keyboard with his head in frustration. So we didn’t have a known way to reproduce this.
Yes, I know that animated GIFs are annoying, so was this bug, I need a way to share the pain. At one point we got something that could reliably generate an error, it was on the 213th write to the system. Didn’t matter what write, but the 213th write will always produce an error. There is nothing magical about 213, by the way, I remember this value because we tried so very hard to figure out what was magical about it.
At this point we had four or five developers working on this (we needed a lot of heads banging on keyboards to reproduce this error). The code has been analyzed over and over. We found a few places where we could have detected the data corruption earlier, because it violated invariants and we didn’t check for that. That was the first real break we had. Because that allowed us to catch the error earlier, which let to less head banging before the problem could be reproduced. The problem was that we always caught it too late, we kept going backward in the code, each time really excited that we are going to be able to figure out what was going on there and realizing that the invariants this code relied on were already broken.
Because these are invariants, we didn’t check them, they couldn’t possibly be broken. That sounds bad, because obviously you need to validate your input and output, right? Allow me to demonstrate a sample of a very simple storage system:
There isn’t anything wrong with the code here at first glance, but look at the Remove method, and now at this code that uses the storage:
The problem we have here is not with the code in the Remove or the GetUserEmail method, instead, the problem is that the caller did something that it wasn’t supposed to, and we proceeded on the assumption that everything is okay.
The end result is that the _byName index contained a reference to a deleted document, and calling GetUserEmail will throw a null reference exception. The user visible problem is the exception, but the problem was actually caused much earlier. The invariant that we violating could have been caught in the Remove method, though, if we did something like this:
These sort of changes allow us to get earlier and earlier to the original location where the problem first occurred. Eventually we were able to figure out that a particular pattern of writes would put the internal index inside RavenDB into a funny state, in particular, here is how this looks like from the inside.
What you see here is the internal structure of the tree inside RavenDB used to map between documents etags and their location on the disk. In this case, we managed to get into a case where we would be deleting the last item from a page that is the leftmost page in a tree that has 3 or more levels and whose parent is the rightmost page in the grandparent and is less than 25% full while the sibling to its left is completely full.
In this case, during rebalancing operation, we were forgetting to reset the downward references and ended up messing up the sort order of the tree. That worked fine, most of the time, but it would slowly poison our behavior, as we made binary searches on data that was supposed to be sorted but wasn’t.
Timeline (note, despite the title, this is pre released software and this is not a production system, the timeline reflects this):
- T-9 day, first notice of this issue in the mailing list. Database size exceed 400GB. Back and forth with the user on figuring out exactly what is going on, validating the issue is indeed corruption and getting the data.
- T-6 days, we start detailed analysis of the data in parallel to verifying that we can recover the data.
- T-5 days, user has the data back and can resume working normally, investigation proceeds.
- T-4 days, we have managed to reproduced this on our own system, no idea how yet.
- T-3 days, head banging on keyboards, adding invariants validations and checks everywhere we can think of.
- T-2 days, managed to trap the root cause of the issue, tests added, pruning investigation code for inclusion in product for earlier detection of faults.
- Issue fixed
- T – this blog post is written .
- T + 3 days, code for detecting this error and automatically resolving this is added to the next release.
For reference, here is the fix:
The last change in the area in question happened two years ago, by your truly, so this is a pretty stable part of the code.
In retrospect, there are few really good things that we learned from this.
- In a real world situation, we were able to use the recovery tools we built and get the user back up in a short amount of time. We also found several issues with the recovery tool itself, mostly the fact that its default logging format was verbose, which on a 400GB database means an enormous amount of logs that slowed down the process.
- No data was lost, and these kinds of issues wouldn’t be able to cross a machine boundary so a second replica would have been able to proceed.
- Early error detection was able to find the issue, investment with hashing and validating the data paid off handsomely here. More work was done around making the code more paranoid, not for the things that it is supposed to be responsible for but to ensure that other pieces of the code are not violating invariants.
- The use of internal debug and visualization tools (such as the one above, showing the structure of the internal low level tree) was really helpful with resolving the issue.
- We focused too much on the actual error that we got from the system (the hash check that failed), one of the things we should have done is to verify the integrity of the whole database at the start, which would have led us to figure out what the problem was much earlier. Instead, we suspected the wrong root cause all along all the way to the end. We assumed that the issue was because of modifications to the size of the documents, increasing and decreasing them in a particular pattern to cause a specific fragmentation issue that was the root cause of the failure. It wasn’t, but we were misled about it for a while because that was the way we were able to reproduce this eventually. It turned out that the pattern of writes (to which documents) was critical here, not the size of the documents.
Overall, we spent over a lot of time on figuring out what the problem was and the fix was two lines of code. I wrote this post independently of this investigation, but it hit the nail straight on.
More posts in "Production postmortem" series:
- (23 Nov 2018) The ARM is killing me
- (22 Feb 2018) The unavailable Linux server
- (06 Dec 2017) data corruption, a view from INSIDE the sausage
- (01 Dec 2017) The random high CPU
- (07 Aug 2017) 30% boost with a single line change
- (04 Aug 2017) The case of 99.99% percentile
- (02 Aug 2017) The lightly loaded trashing server
- (23 Aug 2016) The insidious cost of managed memory
- (05 Feb 2016) A null reference in our abstraction
- (27 Jan 2016) The Razor Suicide
- (13 Nov 2015) The case of the “it is slow on that machine (only)”
- (21 Oct 2015) The case of the slow index rebuild
- (22 Sep 2015) The case of the Unicode Poo
- (03 Sep 2015) The industry at large
- (01 Sep 2015) The case of the lying configuration file
- (31 Aug 2015) The case of the memory eater and high load
- (14 Aug 2015) The case of the man in the middle
- (05 Aug 2015) Reading the errors
- (29 Jul 2015) The evil licensing code
- (23 Jul 2015) The case of the native memory leak
- (16 Jul 2015) The case of the intransigent new database
- (13 Jul 2015) The case of the hung over server
- (09 Jul 2015) The case of the infected cluster