The saga of the race condition re-fix

time to read 8 min | 1594 words

This is a story about a bug, RavenDB-1280, to be exact. It is really bad when we start with a race condition bug, and it only gets worse from there.

The story begins with the following index:

public class EmailIndex : AbstractIndexCreationTask<EmailDocument, EmailIndexDoc>
{
    public EmailIndex()
    {
        Map =
            emails => from email in emails
                    let text = LoadDocument<EmailText>(email.Id + "/text")                 
                    select new
                            {
                                email.To,
                                email.From,
                                email.Subject,
                                Body = text == null ? null : text.Body
                            };
    }
}

So far, this  is a pretty standard index using LoadDocument. The LoadDocument feature allows us to also index data from related documents, and the database will ensure that when the loaded document is changed, the requesting document will also be re-indexed.

What does this mean? It means that when we change the text of an email, we will also be re-indexing that email’s document. Internally, we store it in something that looks like this (using relation model because it is probably easy to follow for most of you):

CREATE TABLE DocumentReferences
{
   Source,
   Reference
}

So in the case above, we will have (Source: emails/1 and Reference: emails/1/text).

Whenever we modify a document, we check the stored references for matches. If we were using a relational database, the code would be something like:

foreach ( var src in (SELECT Source FROM DocumentReferences WHERE Reference = @ref))
{
   TouchDocument(src);
}

So far, so good. This ensures that we are always keep the index up to date. Obviously, this also means that the indexing process needs to update the references as well. Which brings us to the problematic part:

Parallel.For(0, iterations, i =>
{
    using (var session = documentStore.OpenSession())
    {
        session.Store(new EmailDocument {Id = "Emails/"+ i,To = "root@localhost", From = "nobody@localhost", Subject = "Subject" + i});
        session.SaveChanges();
    }

    using (var session = documentStore.OpenSession())
    {
        session.Store(new EmailText { Id = "Emails/" + i + "/text", Body = "MessageBody" + i });
        session.SaveChanges();
    }
});

This code will result in an interesting edge case. Because we are using two different sessions (and thus two different transactions), it is possible for the index to pick up the emails/1 document and start indexing it, but not pick up the emails/1/text document (it hasn’t been saved yet).

However, during the save of emails/1/text document, there wouldn’t be anything in the references storage, so we won’t know that we need to re-index emails/1. The result is that we violated our promise of re-indexing if the document changed.

As I said, just setting up the problem require parallel thinking and a severe case of headache. Luckily, we had a great bug report from Barry Hagan, which included an occasionally failing test. (Even just writing “occasionally failing” causes me to throw up a little bit).

After we identified the problem, we tried to solve it by holding up a list of modified documents in memory that were also required (and missing) during indexing using LoadDocument. Don’t bother to follow up on this statement, it is complex, and it is hard, and we did it. And it worked, except that it didn’t. That just moved the race condition from the entire process to the edges of the process. We have had two guys sitting on this for a couple of days and in danger of hair tearing, with no luck narrowing it down.

Eventually I sat down and tried to figure out how to deal with it. I was able to build on their work, and narrow down exactly where we had the race condition now. And just thinking about trying to solve it was way too hard. It would require multiple locks and pretty strange behavior overall. I wasn’t happy with the implications, and that code already created quite a lot of complications as it was.

So I started by reverting all of that code and had a clean slate to work with. Then I sat down and thought about it, and finally figured out a better way to do it. The problem was only relevant when we had a missing reference (an existing reference would properly generate the re-indexing under all conditions). So I started with that, I gathered up all of the missing references, and stored them in a db task.  A db task is a RavenDB notion of a transactionally safe way of registering things to run in another transaction. So after the indexing transaction was done, we would then go an execute that task.

The idea here is that doing it this way prevent us from having to deal with any issues with regards to “what is the server crashes midway”, etc. Anyway, it also means that this task is running in a separate transaction and after the indexing transaction is done. The only thing that this task is doing is just forcing re-indexing of all the documents that tried to call LoadDocument but got null as a result. Because we are now in a separate transaction, there are only two options:

  • The related document is still not there, in which case were are merely doing some extra work.
  • The related document is there, and will be indexed.

The first case deserve some more attention, even if the document that was missing comes while we are re-indexing, we don’t care, we already setup the references, and committed it, so it would force yet another re-indexing, and in the end, everyone will be happy.

And now, I have written this blog post while I was running a stress test on this feature, but it took me long enough that I am sure it works properly. So I’ll call it a day and go do something more fun, maybe 3.0 work.