Ayende @ Rahien

Hi!
My name is Oren Eini
Founder of Hibernating Rhinos LTD and RavenDB.
You can reach me by email or phone:

ayende@ayende.com

+972 52-548-6969

, @ Q c

Posts: 6,523 | Comments: 47,985

filter by tags archive

Production postmortemdata corruption, a view from INSIDE the sausage

time to read 10 min | 1883 words

In terms of severity, there are very few things that we take more seriously than data integrity. In fact, the only thing that pops to mind as higher priority are security issues. A user reported an error when using a pre-release 4.0 database that certainly looked like data corruption, so we were very concerned when we go the report, and quite happy about the actual error. If this is strange, let me explain.

Storage bugs are nasty. I suggest reading this article to understand how tricky these can be. The article talks about memory allocators (even though it calls them storage) but the same rules apply. And the most important rule from this article?

WHEN A MEMORY DAMAGE BUG IS OBSERVED, IT TAKES PRIORITY OVER ALL OTHER BUG FIXES, ENHANCEMENTS, OR ANY OTHER DEVELOPMENT ACTIVITY.  ALL DEVELOPMENT CEASES UNTIL IT IS FOUND.

You can read the article for the full reasoning why, but basically is about being able to  reproduce and fix the bug and not make it “go away” with a hammer approach. We do the same with data corruption. One of our developers stops doing anything else and investigate just that, as a top priority issue. Because we take this so seriously, we have built several layers of defense in depth into RavenDB.

All the data is signed and we compare hashed when reading from disk to validate that it hasn’t been modified. This also help us catch an enormous amount of problems with storage devices and react to them early. There are other checks that are being run to verify the integrity of the system, from debug asserts to walking the structure of the data and verifying its correctness.

In this case, analysis of the data the user provided showed that we were failing the hash validation, which should usually only happen if there is a physical file corruption. While we were rooting for that (since this would mean no issues with our code), we also looked into the error in detail. What we found was that we were somehow starting to read a document from the middle, instead of the beginning. Somehow we managed to mess up the document offset and that caused us to think that the document was corrupted.

At this point, we had a confirmed data corruption issue, since obviously we shouldn’t lose track of where we put the documents. We pulled another developer into this, to try to reproduce the behavior independently while checking if would salvage the user’s data from the corrupted files. This deserve some explanation. We don’t assume that our software is perfect, so we took steps in advanced. The hashing the data and validating it is one such step, but another is build, upfront, the recovery tools for when the inevitable happens. That meant that the way we lay out the data on disk was designed, upfront and deliberately, to allow us to recover the data in the case of corruption.

Admittedly, I was mostly thinking about corruption of the data as a result of physical failure, but the way we lay out the data on disk also protect us from errors in record keeping such as this one. This meant that we were able to extract the data out and recover everything for the user.

At this time, we had a few people trying to analyze the issue and attempting to reproduce it. The problem with trying to figure out this sort of issue from the resulting file is that by the time you have found the error, this is too late, the data is already corrupted and you have been operating in a silent bad state for a while, until it finally got to the point this become visible.

8011098[3]We had the first break in the investigation when we managed to reproduce this issue locally on a new database. That was great, because it allowed us to rule out some possible issues related to upgrading from an earlier version, which was one of the directions we looked at. The bad part was that this was reproduced mostly by the developer in question repeatedly hitting the keyboard with his head in frustration. So we didn’t have a known way to reproduce this.

Yes, I know that animated GIFs are annoying, so was this bug, I need a way to share the pain. At one point we got something that could reliably generate an error, it was on the 213th write to the system. Didn’t matter what write, but the 213th write will always produce an error. There is nothing magical about 213, by the way, I remember this value because we tried so very hard to figure out what was magical about it.

At this point we had four or five developers working on this (we needed a lot of heads banging on keyboards to reproduce this error). The code has been analyzed over and over. We found a few places where we could have detected the data corruption earlier, because it violated invariants and we didn’t check for that. That was the first real break we had. Because that allowed us to catch the error earlier, which let to less head banging before the problem could be reproduced. The problem was that we always caught it too late, we kept going backward in the code, each time really excited that we are going to be able to figure out what was going on there and realizing that the invariants this code relied on were already broken.

Because these are invariants, we didn’t check them, they couldn’t possibly be broken. That sounds bad, because obviously you need to validate your input and output, right? Allow me to demonstrate a sample of a very simple storage system:

 

There isn’t anything wrong with the code here at first glance, but look at the Remove method, and now at this code that uses the storage:

The problem we have here is not  with the code in the Remove or the GetUserEmail method, instead, the problem is that the caller did something that it wasn’t supposed to, and we proceeded on the assumption that everything is okay.

The end result is that the _byName index contained a reference to a deleted document, and calling GetUserEmail will throw a null reference exception. The user visible problem is the exception, but the problem was actually caused much earlier. The invariant that we violating could have been caught in the Remove method, though, if we did something like this:

These sort of changes allow us to get earlier and earlier to the original location where the problem first occurred. Eventually we were able to figure out that a particular pattern of writes would put the internal index inside RavenDB into a funny state, in particular, here is how this looks like from the inside.

image

What you see here is the internal structure of the tree inside RavenDB used to map between documents etags and their location on the disk. In this case, we managed to get into a case where we would be deleting the last item from a page that is the leftmost page in a tree that has 3 or more levels and whose parent is the rightmost page in the grandparent and is less than 25% full while the sibling to its left is completely full.

In this case, during rebalancing operation, we were forgetting to reset the downward references and ended up messing up the sort order of the tree. That worked fine, most of the time, but it would slowly poison our behavior, as we made binary searches on data that was supposed to be sorted but wasn’t.

Timeline (note, despite the title, this is pre released software and this is not a production system, the timeline reflects this):

  • T-9 day, first notice of this issue in the mailing list. Database size exceed 400GB. Back and forth with the user on figuring out exactly what is going on, validating the issue is indeed corruption and getting the data.
  • T-6 days, we start detailed analysis of the data in parallel to verifying that we can recover the data.
  • T-5 days, user has the data back and can resume working normally, investigation proceeds.
  • T-4 days, we have managed to reproduced this on our own system, no idea how yet.
  • T-3 days, head banging on keyboards, adding invariants validations and checks everywhere we can think of.
  • T-2 days, managed to trap the root cause of the issue, tests added, pruning investigation code for inclusion in product for earlier detection of faults.
  • Issue fixed
  • T – this blog post is written Smile.
  • T + 3 days, code for detecting this error and automatically resolving this is added to the next release.

For reference, here is the fix:

image

The last change in the area in question happened two years ago, by your truly, so this is a pretty stable part of the code.

In retrospect, there are few really good things that we learned from this.

  • In a real world situation, we were able to use the recovery tools we built and get the user back up in a short amount of time. We also found several issues with the recovery tool itself, mostly the fact that its default logging format was verbose, which on a 400GB database means an enormous amount of logs that slowed down the process.
  • No data was lost, and these kinds of issues wouldn’t be able to cross a machine boundary so a second replica would have been able to proceed.
  • Early error detection was able to find the issue, investment with hashing and validating the data paid off handsomely here. More work was done around making the code more paranoid, not for the things that it is supposed to be responsible for but to ensure that other pieces of the code are not violating invariants.
  • The use of internal debug and visualization tools (such as the one above, showing the structure of the internal low level tree) was really helpful with resolving the issue.
  • We focused too much on the actual error that we got from the system (the hash check that failed), one of the things we should have done is to verify the integrity of the whole database at the start, which would have led us to figure out what the problem was much earlier. Instead, we suspected the wrong root cause all along all the way to the end. We assumed that the issue was because of modifications to the size of the documents, increasing and decreasing them in a particular pattern to cause a specific fragmentation issue that was the root cause of the failure. It wasn’t, but we were misled about it for a while because that was the way we were able to reproduce this eventually. It turned out that the pattern of writes (to which documents) was critical here, not the size of the documents.

Overall, we spent over a lot of time on figuring out what the problem was and the fix was two lines of code. I wrote this post independently of this investigation, but it hit the nail straight on.

Production postmortemThe random high CPU

time to read 2 min | 253 words

A customer complained that every now and then RavenDB is hitting 100% CPU and stays there. They were kind enough to provide a minidump, and I started the investigation.

I loaded the minidump to WinDB and started debugging. The first thing you do with high CPU is rung the “!runaway” command, which sorts the threads by how busy they are:

image

I switched to the first thread (39) and asked for its stack, I highlighted the interesting parts:

image

This is enough to have a strong suspicion on what is going on. I checked some of the other high CPU threads and my suspicion was confirmed, but even from this single stack trace it is enough.

Pretty much whenever you see a thread doing high CPU within the Dictionary class it means that you are accessing it in a concurrent manner. This is unsafe, and may lead to strange effects. One of them being an infinite loop.

In this case, several threads were caught in this infinite loop. The stack trace also told us where in RavenDB we are doing this, and from there we could confirm that indeed, there is a rare set of circumstances that can cause a timer to fire fast enough that the previous timer didn’t have a chance to complete, and both of these timers will modify the same dictionary, causing the issue.

You are doing it REALLY wrong, the shortest code review ever

time to read 2 min | 394 words

We got a question from a customer that included the following code:

image

This was quite unrelated to the question being asked, but it was enough to trigger all sorts of alarms. Just based on this two lines, and the fact that we can assume that _session is a RavenDB session, we can tell quite a lot about the codebase.

To start with, it is broken, probably badly so.

Let us try to break down why. The fact that you are locking on the session means that you access it concurrently. And that is utterly forbidden. The session is not meant to be used concurrently, and Strange Stuff will happen if you try to.

Now, the session doesn’t have thread affinity, which means that if you are calling it serially, you can certainly move it between threads. In fact, with async sessions, that happens quite often. So one would expect that locking on the session would provide sufficient protection for the session from concurrent use.

However, nothing could be further from the truth. Look at the output of this method. It returns User instances, which are used outside the lock. However, if two threads will call this method concurrently, they will each run in turn, taking and releasing the lock as required. Which is great, but it is very likely that they will get the same User instance back (remember, the session implements Identity Map and ensure that within the scope of the session, the same document is always represented by the same object instance). That is when things start going boom, boom, BOOM!

Even if you are locking on the session, you aren’t locking on your entities, and that means that you have concurrent access to them. I haven’t actually reviewed the entity in question, but I’m 99.999999% certain that no attempt was made to make them safe for multi threading and 100% certain that it there was such an attempt, it would fail horribly in some manner quite easily.

Entities are not meant for concurrent usage, and even trying to reason about them effectively is going to be near impossible.

A session is meant to be a scope of work, in fact, look up the Unit of Work pattern, which is exactly how you are expected to use the session.

Public Service AnnouncementConcurrentDictionary.Count is locking

time to read 2 min | 233 words

During a performance problem investigation, we discovered that the following innocent looking code was killing our performance.

This is part of a code that allow users to subscribe to changes in the database using a WebSocket. This is pretty rare, so we check that there aren’t any and skip all the work.

We had a bunch of code that run on many threads and ended up calling this method. Since there are not subscribers, this should be very cheap, but it wasn’t. The problem was that the call to Count was locking, and that created a convoy that killed our performance.

We did some trawling through our code base and through the framework code and came up with the following conclusions.

ConcurrentQueue:

  • Clear locks
  • CopyTo, GetEnumerator, ToArray, Count creates a snapshot (consume more memory)
  • TryPeek, IsEmpty are cheap

Here is a piece of problematic code, we are trying to keep the last ~25 items that we looked at:

The problem is that this kind of code ensures that there will be a lot of snapshots and increases memory utilization.

ConcurrentDictionary:

  • Count, Clear, IsEmpty, ToArray  - lock the entire thing
  • TryAdd, TryUpdate, TryRemove – lock the bucket for this entry
  • GetEnumerator does not lock
  • Keys, Values both lock the table and forces a temporary collection

If you need to iterate over the keys of a concurrent dictionary, there are two options:

Iterating over the entire dictionary is much better then iterating over just the keys.

The innocuous code that tripped me

time to read 1 min | 155 words

When building a cache, I need a way to generate a hash code from a query. A query is a complex object that has many properties. My first attempt to do so looked like this:

And it failed, disastrously. All queries always had the same hash code, and I couldn’t figure out why until I actually debugged through the code.

Here is a hint, this only happened if WaitForNonStaleResultsTimeout is null. It might be better to also look at the way the compiler looks at this:

Basically, if WaitForNonStaleResultsTimeout is null, the entire expression evaluated to null, so we return zero, losing all the previous hash values that we put in.

The solution was to parenthesis the expression, but this is something that passed a couple rounds of scrutiny and didn’t show up at all, because it looked fine, and I don’t keep the precedence rules for operators in C# in my head.

Production postmortemThe case of 99.99% percentile

time to read 2 min | 326 words

imageA customer called us with a problem, for the most part, RavenDB was very well behaved, but they were worried about the 99.99% percentile of request duration. While the 99.9% was excellent (around a few milliseconds), the 99.99% was measured in seconds.

Usually, this is a particularly slow request that happens, and we can handle that by figuring out what is slow on that particular request and fix it. Common causes for such an issue is a request that returns a lot of unnecessary data, such as large full documents, when it needs just a few fields.

In this case, however, our metrics told us that the problem was pretty widespread. There wasn’t a particular slow request, rather, at what appeared to be random times, certain requests would be slow. We also realized that it wasn’t a particular request that was slow, but all requests within a given time period.

That hinted quite strongly at the problem, it was very likely that this is a GC issue that caused a long pause. We still had to do some work to rule out other factors, such as I/O / noisy neighbors, but we narrowed down on GC as being the root cause.

The problem in this case was that the customer was running multiple databases on the same RavenDB process. And each of them was fairly large. The total amount of memory that the RavenDB process was using was around 60GB of managed memory. At that level, anything that would cause a GC can cause a significant pause and impact operations.

The solution in this case was to break that into multiple separate processes, one for each database. In this manner, we didn’t have a single managed heap that the GC had to traverse, each heap was much smaller, and the GC pause times were both greatly reduced and spaced much further apart.

Production postmortemThe lightly loaded trashing server

time to read 3 min | 522 words

1329757286-300pxWe got a puzzle. A particular customer was seeing very high latency in certain operations on a fairly frequent basis. The problem is that when this happened, the server was practically idle, serving around 5% of the usual requests/sec. What was even stranger was that during the times when we had reached the peek of requests/seconds, we’ll see no such slowdowns. The behavior was annoyingly consistent, we’ll see no slowdown at all during high load, but after a period of relatively light load, the server would appear to choke.

That one took a lot of time to figure out, because it was so strange. The immediate cause was pretty simple to figure out, the server was busy paging into memory a lot of data, but why would it need to do this? The server was just sitting there doing nothing much, but it was thrashing like crazy, and that was affecting the entire system.

I’ll spare you the investigation, because it was mostly grunting and frustration, but the sequence of events as we pieced it together was something like this:

  • The system is making heavy use of caching, with a cache duration set to 15 minutes or so. Most pages would hit the cache first, and if there was a miss, generate it and save it back. The cached documents was setup with the RavenDB expiration bundle.
  • During periods of high activity, we’ll typically have very few cache expiration (because we kept using the cached data) and we’ll fill up the cache quite heavily (the cache db was around 100GB or so).
  • That would work just fine and rapidly be able to serve a high number of requests.
  • And then came the idle period…
  • During that time, we had other work (by a different process) going on in the server, which we believe would give the OS reason to page the now unused memory to disk.

So far, everything goes on as predicted, but then something happens. The expiration timer is hit, and we now have a lot of items that need to be expired. RavenDB expiration is coarse, and it runs every few minutes, so each run we had a lot of stuff to delete. Most of it was on disk, and we needed to access all of it so we can delete it. And that caused us to trash, affecting the overall server performance.

As long as we were active, we wouldn’t expire so much at once, and we had a lot more of the db in memory, so the problem wasn’t apparent.

The solution was to remove the expiration usage and handle the cache invalidation in the client, when you fetched a cached value, you checked its age, and then you can apply a policy decision if you wanted to update it or not. This actually turned out to be a great feature in general for that particular customer, since they had a lot of data that can effectively be cached for much longer periods, and that gave them the ability to express that policy.

Bad bugs makes for self assigning issues

time to read 1 min | 144 words

One of our developers just added the following bug:

image

This is in an area that of the code that this particular developer is not regularly traversing*. The image above includes the full contents of the bug. And that caused me to immediately assign it back to its opener.

The problem?  If you say that you got an error, include the error. In many cases, you can save a lot of time and guessing.

For an internal bug, where the person who opened it is available, we have much lower bar for bug report quality. Most bugs are closed relatively quickly anyway. But lower bar for bug report quality still means there is a bar.

* I started to say, not responsible for, but we don’t have code ownership, so that wouldn’t have been right.

We won’t be fixing this race condition

time to read 2 min | 345 words

During the work on restoring backup, the developer in charge came up with the following problematic scenario.

  • Start restoring backup of database Northwind on node A, which can take quite some time for large database
  • Create a database named Northwind on node B while the restore is taking place.

The problem is that during the restore the database doesn’t exists in a proper form in the cluster until it is done restoring. During that time, if an administrator is attempting to create a database it will look like it is working, but it will actually create a new database on all the other nodes and fail on the node where the restore is going on.

When the restore will complete, it will either remove the previously created database or it will join it and replicate the restored data to the rest of the nodes, depending exactly on when the restore and the new db creation happened.

Now, trying to resolve this issue involve us coordinating the restore process around the cluster. However, that also means that we need to do heartbeats during the restore process (to the entire cluster), handle timeouts and recovery and effectively take upon us a pretty big burden of pretty complicated code. Indeed, the first draft of the fix for this issue suffered from the weakness that it would only work when running on a single node, and only work in a cluster mode in very specific cases.

In this case, it is a very rare scenario that require an admin (not just a standard user) to do two things that you’ll not usually expect them together, and the outcome of this is a bit confusing even if you managed, but there isn’t any data loss.

The solution was to document that during the restore process you shouldn’t create a database with the same name but instead let RavenDB complete and then let the database span additional nodes. That is a much simpler alternative to going in to a distributed mode reasoning just for something that is an operator error in the first place.

Bug storiesThe memory ownership in the timeout

time to read 3 min | 442 words

imageWe are running a lot of tests right now on RavenDB, in all sort of interesting configurations. Some of the more interesting results came from testing wildly heterogeneous systems. Put a node on a fast Windows machine, connect it to a couple of Raspberry PIs, a cheap Windows tablet over WiFi and a slow Linux machine and see how that kind of cluster is handling high load.

This has turned out a number of bugs, the issue with the TCP read buffer corruption is one such example, but another is the reason for this post. In one of our test runs, the RavenDB process crashed with invalid memory access. That was interesting to see. Tracking down the issue led us to the piece of code that is handling incoming replication. In particular, the issue was possible if the following happened:

  • Node A is significantly slower than node B, primarily with regards to disk I/O.
  • Node B is being hit with enough load that it send large requests to node A.
  • There is a Node C that is also trying to replicate the same information (because it noticed that node A isn’t catching fast enough and is trying to help).

The root cause was that we had a bit of code that looked like this:

Basically, we read the data from the network into a buffer, and now we hand it off to the transaction merger to run. However, if there is a lot of load on the server, it is possible that the transaction merger will not have a chance to return in time.  We try to abort the connection here, since something is obviously wrong, and we do just that. The problem is that we sent a buffer to the transaction merger, and while it might not have gotten to processing our request yet (or haven’t completed it, at least), there is no way for us to actually be able to pull the request out (it might have already started executing, after all).

The code didn’t consider that, and what happened when we did get a timeout is that the buffer was returned to the pool, and if it was freed in time, we would get an access violation exception if we were lucky, or just garbage in the buffer (that we previously validated, so we didn’t check again) that would likely also cause a crash.

The solution was to wait for the task to complete, but ping the other host to let it know that we are still alive, and that the connection shouldn’t be aborted.

FUTURE POSTS

  1. Queries++ in RavenDB: I suggest you can do better - 11 hours from now
  2. Distributed computing fallacies: There is one administrator - about one day from now
  3. Queries++ in RavenDB: Spatial searches - 4 days from now
  4. PR Review: The simple stuff will trip you - 5 days from now
  5. The married couple component design pattern - 6 days from now

There are posts all the way to Dec 20, 2017

RECENT SERIES

  1. PR Review (9):
    08 Nov 2017 - Encapsulation stops at the assembly boundary
  2. Queries++ in RavenDB (4):
    11 Dec 2017 - Gimme more like this
  3. Production postmortem (21):
    06 Dec 2017 - data corruption, a view from INSIDE the sausage
  4. API Design (9):
    04 Dec 2017 - The lack of a method was intentional forethought
  5. The best features are the ones you never knew were there (5):
    27 Nov 2017 - You can’t do everything
View all series

Syndication

Main feed Feed Stats
Comments feed   Comments Feed Stats