Invisible race conditionsThe cache has poisoned us

time to read 2 min | 262 words

imageWe got a memory corruption error one of those days that was quite interesting. It was in a place where we previous fixed a memory corruption error and was, at a glance, quite impossible.

The code would checkout an item from the cache and increment its ref count, which will keep it alive for as long as we were using it. But something made it fail, and quite horribly, too. We finally tracked the code down to this piece of code, which is run when we update the cache:

When the ref count goes to zero, we’ll release the memory, and _items is a Concurrent Dictionary.

Do you see the error?

The AddOrUpdate method will call the updateValueFactory when it needs to update a value, but it makes no promises with regards to its atomicity. In other words, if you have two threads calling this method, the update lambda will be called twice with the same item, resulting in early release of the value and hence memory corruption.

This can be seen here:

As you can see, we are looking at a loop that may be executed several times, as such the updateValueFactory can be called several times, and the only guarantee we have is that after the method has returned, the last value we were called with was the value that was in the cache and we replaced.

Here is the fix:

That was quite hard to figure out, because at a glance, this looks just fine.

More posts in "Invisible race conditions" series:

  1. (02 Jan 2018) The cache has poisoned us
  2. (27 Dec 2017) The sometimes failing script
  3. (26 Dec 2017) The async query