Dedicated operations road bypasses

time to read 2 min | 398 words

We care very deeply about the operations side of RavenDB. Support calls are almost never about “where are you? I want to send you some wine & roses”, and they tend to come at unpleasant timing. One of the things that we had learnt was that when stuff breaks, it tend to do so in ways that are… interesting.

Let me tell you a story…

A long time ago, a customer was using an index definition that relied on the presence of a custom assembly to enrich the API available for indexing. During the upgrade process from one major version of RavenDB to the next, they didn’t take into account that they need to also update the customer assembly.

When they tried to start RavenDB, it failed because of the version mismatch, since they weren’t actually using that index anyway. The customer then removed the assembly, and started RavenDB again. At this point, the following sequence of events happened:

  • The database started, saw that it is using an old version of the storage format, and converted to the current version of the storage format.
  • The database started to load the indexes, but the index definition was invalid without the customer assembly, so it failed. (Index definitions are validated at save time, so the code didn’t double check that at the time).

The customer was now stuck, the database format was already converted, so in order to rollback, they would need to restore from backup. They could also not remove the index from the database, because the database wouldn’t start to let them do so. Catch 22.

At this point, the admin went into the IndexDefinitions directory, and deleted the BadIndex.index-definition file, and restarted RavenDB again. The database then recognized that the index definition is missing, but the index exists, deleted the index from the server, and run happily ever after.

Operations road bypass is our terminology for giving administrators a path to changing internal state in our system using standard tools, without requiring the system to be up and functioning. The example with the index definition is a good one, because the sole reason we keep the index definition on disk is to allow administrators the ability to touch them without needing RavenDB in a healthy state.

What do you do in your system to make it possible for the admin to recover from impossible situations?