Robustness fixes are hard

time to read 2 min | 311 words

The following is a fix we did to resolve a production crash of RavenDB. Take a look at the code, and consider what kind of error we are trying to handle here.

Hint: The logger class can never throw.

image

The underlying issue was simple. We run out of memory, which is an expected occurrence, and is handled by the very function that we are looking at above.

However, under low memory conditions, allocations can fail. In the code above, the allocation of the log string statement failed, which threw an error. This caused an exception to escape a thread boundary and kill the entire process.

Moving the log statement to the inside of the try statement allows us to recover from it, attempt to report the error, and release any currently held memory and attempt to reduce our memory utilization.

This particular error is annoying. A string allocation will always allocate, but even if you run out of actual memory, such allocations will often succeed because it can be served out of the already existing GC heap,without the need to trigger actual allocation from the OS. This is just a reminder that anything that can go wrong will, and with just the right set of circumstances to cause us pain.

I’ll use this opportunity to recommend, once again, reading How Complex Systems Fail. Even in this small example, you can see that it takes multiple separate things to align just right for an error to actually happen. You have to have low memory and the GC heap should be empty and only then you’ll get the actual issue. Low memory without the GC heap being full, and the code works as intended. GC heap being full and no low memory, no problemo. Smile