When the error is byzantine

time to read 3 min | 478 words

In distributed systems, the term Byzantine fault tolerance refers to working in an environment where the other nodes in the system are going to violate the invariants held by the system. Sometimes, that is because of a bug, sometimes because of a hardware issue (Figure 11) and sometimes that is a malicious action.

A user called us to let us know about a serious issue, they have a 100 documents in their database, but the index reports that there are 105 documents indexed. That was… puzzling. None of the avenues of investigation we tried help us. There wasn’t any fanout on the index, for example.

That was… strange.

We looked at the index in more details and we noticed something really strange. There were documents ids that were in the index that weren’t in the database, and they were all at the end. So we had something like:

  • users/1 – users/100 – in the index
  • users/101 – users/105 – not in the index

What was even stranger was that the values for the documents that were already in the index didn’t match the values from the documents. For that matter, the last indexed etag, which is how RavenDB knows what documents to index.

Overall, this is a really strange thing, and none of that is expected to happen. We asked the user for more details, and it turns out that they don’t have much. The error was reported in the field, and as they described their deployment scenario, we were able to figure out what was going on.

On certain cases, their end users will want to “reset” the system. The manner in which they do that is to shut down their application and then delete the RavenDB folder. Since all their state is in RavenDB, that brings them back up at a clean state. Everything works, and this is a documented manner in which they are working.

However, the manner in which the end user will delete is done as: “Delete the database files”, the actual task is done by the end user.

A RavenDB database internally is composed of a few directories, for data and indexes. What would happen in the case where you deleted just the database file, but kept the index files? In that case, RavenDB will just accept that this is the case (soft delete is a a thing, so we have to accept that) and open the index file. That index file, however, came from a different database, so a lot of invariants are broken. I’m actually surprised that it took this long to find out that this is problematic, to be honest.

We explained the issue, and then we spent some time ensuring that this will never be an invisible error. Instead, we will validate that the index is coming from the right source, or error explicitly. This is one bug that we won’t have to hunt ever again.