Bug chasing, narrowing down the scope

time to read 9 min | 1666 words

I just completed a major refactoring of a piece of code inside RavenDB that is responsible for how we manage sorted queries. The first two tiers of tests all passed, which is great. Now was the time to test how this change performed. I threw 50M records into RavenDB and indexed them. I did… not like the numbers I got back. It makes sense, since I was heavily refactoring to get a particular structure, I could think of a few ways to improve performance, but I like doing this based on profiler output.

When running the same scenario under the profiler, the process crashed. That is… quite annoying, as you can imagine. In fact, I discovered a really startling issue.

If I index the data and query on it, I get the results I expect. If I restart the process and run the same query, I get an ExecutionEngineException. Trying to debug those is a PITA. In this case, I’m 100% at fault, we are doing a lot of unsafe things to get better performance, and it appears that I messed up something along the way. But my only reproduction is a 50M records dataset. To give some context, this means 51GB of documents to be indexed and 18 GB of indexing. Indexing this in release mode takes about 20 minutes. In debug mode, it takes a lot longer.

Trying to find an error there, especially one that can only happen after you restart the process is going to be a challenging task. But this isn’t my first rodeo. Part of good system design is knowing how to address just these sorts of issues.

The indexing process inside RavenDB is single-threaded per index. That means that we can rule out a huge chunk of issues around race conditions. It also means that we can play certain tricks. Allow me to present you with the nicest tool for debugging that you can imagine: repeatable traces.

Here is what this looks like in terms of code:

In this case, you can see that this is a development only feature, so it is really bare-bones one. What it does is capture the indexing and commit operations on the system and write them to a file. I have another piece of similarly trivial code that reads and applies them, as shown below. Don’t bother to dig into that, the code itself isn’t really that interesting. What is important is that I have captured the behavior of the system and can now replay it at will.

The code itself isn’t much, but it does the job. What is more important, note that we have calls to StopDatabase() and StartDatabase(), I was able to reproduce the crash using this code.

That was a massive win, since it dropped my search area from 50M documents to merely 1.2 million.

The key aspect of this is that I now have a way to play around with things. In particular, instead of using the commit points in the trace, I can force a commit (and start / stop the database) every 10,000 items (by calling FlushIndexAndRenewWriteTransaction). When using that, I can reproduce this far faster. Here is the output when I run this in release mode:

1 With 0
2 With 10000
3 With 10000
4 With 10000
5 With 10000
6 With 10000
7 With 10000
8 With 10000
9 With 10000
10 With 10000
11 With 10000
Fatal error. Internal CLR error. (0x80131506)

So now I dropped the search area to 120,000 items, which is pretty awesome. Even more important, when I run this in debug mode, I get this:

1 With 0
2 With 10000
Process terminated. Assertion failed.
   at Voron.Data.Containers.Container.Get(Low...

So now I have a repro in 30,000 items, what is even better, a debug assertion was fired, so I have a really good lead into what is going on.

The key challenge in this bug is that it is probably triggered as a result of a commit and an index of the next batch. There is a bunch of work that we do around batch optimizations that likely cause this sort of behavior. By being able to capture the input to the process and play with the batch size, we were able to reduce the amount of work required to generate a reproduction from 50M records to 30,000 and have a lead into what is going on.

With that, I can now start applying more techniques to narrow down what is going on. But by far the most important aspect as far as I’m concerned is the feedback cycle. I can now hit F5 to run the code and encounter the problem in a few seconds.

image

It looks like we hit a debug assertion because we keep a reference to an item that was already freed. That is really interesting, and now I can find out which item and then figure out why this is the case. And at each point, I can simply go one step back in the investigation, and reproduce the state, I have to hit F5 and wait a bit. This means that I can be far more liberal in how I figure out this bug.

This is triggered by a query on the indexed data, and if I follow up the stack, I have:

image

This is really interesting, I wonder… what happens if I query before I restart the database? With this structure, this is easy to do.

image

This is actually a big relief. I had no idea why restarting the database would cause us to expose this bug.

Another thing to note is that when I ran into the problem, I reproduced this on a query that sorted on a single field. In the test code, I’m testing on all fields, so that may be an asset in exposing this faster.

Right now the problem reproduces on the id field, which is unique. That helps, because it removes a large swath of code that deals with multiple terms for an entry. The current stage is that I can now reproduce this issue without running the queries, and I know exactly where it goes wrong.

image

And I can put a breakpoint on the exact location where this entry is created:

image

By the way, note that I’m modifying the code instead of using a conditional breakpoint. This is because of the performance difference. For a conditional breakpoint, the debugger has to stop execution, evaluate the condition and decide what to do. If this is run a lot, it can have a huge impact on performance. Easier to modify the code. The fact that I can do that and hit F5 and get to the same state allows me to have a lot more freedom in the ergonomics of how I work.

This allows me to discover that the entry in question was created during the second transaction. But the failure happens during the third, which is really interesting. More to the point, it means that I can now do this:

image

With the idea that this will trigger the assert on the exact entry that cause the problem. This is a good idea, and I wish that it worked, but we are actually doing a non-trivial amount of work during the commit process, so now we have a negative feedback and another clue. This is happening in the commit phase of the indexing process. It’s not a big loss, I can do the same in the commit process as well. I have done just that and now I know that I have a problem when indexing the term: “tweets/1212163952102137856”. Which leads to this code:

image

And at this point, I can now single step through this and figure out what is going on, I hope.

When working on complex data structures, one of the things that you need to do is to allow to visualize them. Being able to manually inspect the internal structure of your data structures can save you a lot of debugging. As I mentioned, this isn’t my first rodeo. So when I narrowed it down to a specific location, I started looking into exactly what is going on.

Beforehand, I need to explain a couple of terms (pun intended):

  • tweets/1212163952102137856 – this is the entry that triggers the error.
  • tweets/1212163846623727616 – this is the term that should be returned for 1679560

Here is what the structure looks like at the time of the insert:

image

You can notice that the value here for the last page is the same as the one that we are checking for 1679560.

To explain what is going on will take us down a pretty complex path that you probably don’t care about, but the situation is that we are keeping track of the id in two locations. Making sure to add and remove it in both locations as appropriate. However, at certain points, we may decide to shuffle things around inside the tree, and we didn’t sync that up properly with the rest of the system, leading to a dangling reference.

Now that I know what is going on, I can figure out how to fix it. But the story of this post was mostly about how I figured it out, not the bug itself.

The key aspect was to get to the point where I can reproduce this easily, so I can repeat it as many times that is needed to slowly inch closer to the solution.