Invisible race conditionsThe cache has poisoned us
We got a memory corruption error one of those days that was quite interesting. It was in a place where we previous fixed a memory corruption error and was, at a glance, quite impossible.
The code would checkout an item from the cache and increment its ref count, which will keep it alive for as long as we were using it. But something made it fail, and quite horribly, too. We finally tracked the code down to this piece of code, which is run when we update the cache:
When the ref count goes to zero, we’ll release the memory, and _items is a Concurrent Dictionary.
Do you see the error?
The AddOrUpdate method will call the updateValueFactory when it needs to update a value, but it makes no promises with regards to its atomicity. In other words, if you have two threads calling this method, the update lambda will be called twice with the same item, resulting in early release of the value and hence memory corruption.
This can be seen here:
As you can see, we are looking at a loop that may be executed several times, as such the updateValueFactory can be called several times, and the only guarantee we have is that after the method has returned, the last value we were called with was the value that was in the cache and we replaced.
Here is the fix:
That was quite hard to figure out, because at a glance, this looks just fine.
More posts in "Invisible race conditions" series:
- (02 Jan 2018) The cache has poisoned us
- (27 Dec 2017) The sometimes failing script
- (26 Dec 2017) The async query
Comments
Hello 😊 what is your view on System.Collections.Immutable? I find them easier to reason about, but have no idea how un/performant they actually are… probably nothing to write home about?
Julie, Conceptually, I find them wonderful. I wish we could use them. Performance wise, there are a no go for us. See the discussion around them here: https://ayende.com/blog/164739/immutable-collections-performance
And in the related posts.
I see that sort of mistake commonly made with concurrent dictionaries and the conceptual "Add or get existing" call. Using a ConcurrentDictionary<K, Lazy<V>> tends to solve that but those who haven't read the docs closely don't realise that multiple callers are allowed to race when adding.
What's worse is that the MSDN docs at https://msdn.microsoft.com/en-us/library/ee378665%28v=vs.110%29.aspx?f=255&MSPPError=-2147217396 don't really call out this race condition. The remarks found on a linked webpage for one of the method overloads does call this out... https://msdn.microsoft.com/en-us/library/ee378675(v=vs.110).aspx
That's not the best wording but it gets the point across, but only mentions the addValueFactory.
Reading for the other overload - https://msdn.microsoft.com/en-us/library/ee378664(v=vs.110).aspx - finds a horrible example. It goes about calculating some concurrency level using the number of processors available. But then it does some for loops and regular single-threaded AddOrUpdate calls. It's like whoever wrote the docs didn't have their brain engaged that day???
The same bad code is copied across to the newer docs.microsoft.com site (https://docs.microsoft.com/en-au/dotnet/api/system.collections.concurrent.concurrentdictionary-2.addorupdate?view=netframework-4.7.1), although the example at the top at least does something in parallel.
It's interesting that the docs do not call out that updateFunc may be called twice. Certainly that section of code demonstrates this. But, without seeing the source, one could argue that perhaps the add Func is special because it allows race to initialisation, but once there's a key/value in the dictionary, there's at least something for the dictionary to serialise and thus the updates would be processed atomically.
I might make it a bit of a mission later this evening to submit an update to the new docs.microsoft.com site :)
Last paragraph here explains it pretty good.
https://docs.microsoft.com/en-us/dotnet/standard/collections/thread-safe/how-to-add-and-remove-items
Comment preview