On fixing a bug (and all its siblings) with a forward looking view

time to read 3 min | 543 words

We run into a strange situation deep in the guts of RavenDB. A cluster command (the backbone of how RavenDB is coordinating action in a distributed cluster) failed because of an allocation failure. That is something that we are ready for, since RavenDB is a robust system that handles such memory allocation failures. The problem was that this was a persistent allocation failure. Looking at the actual error explained what was going on. We allocate memory in units that are powers of two, and we had an allocation request that would overflow a 32 bits integer.

Let me reiterate that, we have a single cluster command that would need more memory than can fit in 32 bits. A cluster command isn’t strictly limited, but a 1MB cluster command is huge, as far as we are concerned. Seeing something that exceeds the GB mark was horrifying. The actual issue here was somewhere completely different, there was a bug that caused quadratic growth in the size of a database record. This post isn’t about that problem, it is about the fix.

We believe in defense in depth for such issues. So aside from fixing the actual cause for this problem, the issue was how we can prevent similar issues in the future. We decided that we’ll place a reasonable size limit on the cluster commands, and we chose 128MB as the limit (this is far higher than any expected value, mind). We chose that value since it is both big enough to be outside anyone's actual usage, but at the same time, it is small enough that we can increase this if we need to. That means that this needs to be a configuration value, so the user can modify that in place if needed. The idea is that we’ll stop the generation of a command of this size, before it hits the actual cluster and poison it.

Which brings me to this piece of code, which was the reason for this blog post:

This is where we are actually throwing the error if we found a command that is too big (the check is done by the caller, not important here).

Looking at the code, it does what is needed, but it is missing a couple of really important features:

  • We mention the size of the command, but not the actual size limit.
  • We don’t mention that this isn’t a hard coded limit.

The fix here would be to include both those details in the message. The idea is that the user will not only be informed about what the problem is, but also be made aware of how they can fix it themselves. No need to contact support (and if support is called, we can tell right away what is going on).

This idea, the notion that we should be quite explicit about not only what the problem is but also how to fix it, is very important to the overall design of RavenDB. It allows us to produce software that is self supporting, instead of ErrorCode: 413, you get not only the full details, but how you can fix it.

Admittedly, I fully expect to never ever hear about this issue again in my lifetime. But in case I’m wrong, we’ll be in a much better position to respond to it.