Feature intersection bugs are the hardest to predict
This post is the story of RavenDB-6230, or as it is more commonly known as: “Creating auto-index on non-existent field breaks querying via Id”. It isn’t a big or important bug, and it has very little real world impact. But it is an interesting story because it shows one of the hardest things that we deal with, not an issue with a specific feature, but the behavior of the system as a whole, especially when we have multiple things that may affect the end result.
Let us see what the bug actually is. We have the following query:
And this works and gets you the right results. Under the cover, it will select the appropriate index (and create one if it isn’t there) to query and get the right results.
Now, issue the following query on a non existent field:
And issue the first query again. You’ll get no results.
That is a bit surprising, I’ll admit, but it makes absolute sense when you break it up to component parts.
First, a query by id can be answered by any index, but if you don’t have an index, one will be created for you. In this case, it will be on the “__document_id” field, since we explicitly queried on that. This isn’t typically done explicitly, which is important to understand the bug.
Then, we have another query, on another field, so we generate an index on that as well. We do that while taking into account the historical behavior of the queries on the server. However, we ignore the “__document_id” field because all indexes already contain it, so it is superfluous. That means that we have an index with one explicit field (Nice_Doggy_Nice, in this case) and an implicit one (__document_Id). Which works great, even though in one case it is implicit and the other explicit, there is no actual difference in how we treat them.
So what is the problem? The problem is that the field Nice_Doggy_Nice doesn’t actually exists. So when it comes the time to actually index documents using this index, we read the document and index that, but find that we have nothing to index. At this point, we have only a single field to index, just the document id, but as it is an implicit field, and we have nothing else, we skip indexing that document entirely. The example I used in the office is that you can’t get an answer for “when was the last time you had given birth” if you ask a male (except Schwarzenegger, in which case the answer is 1994).
So far, it all makes sense. But we need to introduce another feature into the mix. The RavenDB query optimizer.
That component is responsible for routing dynamic queries to the most relevant index, and it is doing that with the idea that we should direct work to the biggest index around, because that would make it the most active, at which point we can retire the smaller indexes (which are superfluous once the new wider index is up to date).
All features are working as intended so far. The problem is that the query optimizer indeed selects the wider index, and it is an index that has filtered all the results, so the query by id returns nothing.
Everything works as designed, and yet the user is surprised.
In practice, there are several mitigating factors here. The only way you can get this issue is if you never made any queries on any other (valid) fields on the documents in question. If you have made even a single such query, you’ll not be able to reproduce it. So you have to really work hard at it to get it to fail. But the point isn’t so much the actual bug, but pointing how how multiple unrelated behaviors can combine to cause a bit of a problem.
Highly recommended reading: How Complex Systems Fail – Cook 2000
Comments
"Everything works as designed, and yet the user is surprised."
Was the design: "creating an index on an non existing field should make an query by Id return no results" ? I sure hope not. It is definitely a bug, even if it an obscure one, it's by the book definition of a bug. "All features are working as intended so far" Except querying by ID (when previously an index on a non existing field was created). So no, not all features are working as intended, and any user would be right to be surprised.
Pop Catalin, All the features that are actually features are working as intended, there is no single feature here that is implemented wrong. It is the confluence of multiple stuff, which weren't foreseen), that caused the problem.
"At this point, we have only a single field to index, just the document id, but as it is an implicit field, and we have nothing else, we skip indexing that document entirely" Why does the Optimizer choose an index that skipped documents during indexing, when you're selecting by ID? I think this is a conceptual bug.
Pop Catalin, The optimizer has no idea what the index is actually doing. The skipping is done by the index itself and handled as an internal matter.
Does this mean I could easily get wrong results when selecting by ID due to the optimizer selecting and Index that skipped documents 2, 6 ,8 or another index that skipped documents 1, 3, 8 when selecting by ID?
Pop Catalin, No. Queries by id are rare and very strange. And indexes only skip the document if all the fields are not there, it is enough that a single field will be present for this to work.
And, we fixed the issue.
Comment preview