Finding and tracking a race condition in MemoryCache
Following my previous posts, about the use after free bug that was found in a pull request, I put a lot of effort into testing the validity of the fix. As it turned out, my code was right and the fix worked properly. However, I uncovered a race condition in the .NET MemoryCache implementation. Here is what I got when I put the system under load:
Unhandled exception. System.ArgumentException: Unable to sort because the IComparer.Compare() method returns inconsistent results. Either a value does not compare equal to itself, or one value repeatedly compared to another value yields different results. IComparer: 'System.Comparison`1[Microsoft.Extensions.Caching.Memory.CacheEntry]'. at System.Collections.Generic.ArraySortHelper`1.Sort(Span`1 keys, Comparison`1 comparer) in System.Private.CoreLib.dll:token 0x60066cd+0x1d at System.Collections.Generic.List`1.Sort(Comparison`1 comparison) in System.Private.CoreLib.dll:token 0x600688b+0x3 at Microsoft.Extensions.Caching.Memory.MemoryCache.g__ExpirePriorityBucket|27_0(Int64& removedSize, Int64 removalSizeTarget, Func`2 computeEntrySize, List`1 entriesToRemove, List`1 priorityEntries) in Microsoft.Extensions.Caching.Memory.dll:token 0x6000061+0x21 at Microsoft.Extensions.Caching.Memory.MemoryCache.Compact(Int64 removalSizeTarget, Func`2 computeEntrySize) in Microsoft.Extensions.Caching.Memory.dll:token 0x600005b+0xff at Microsoft.Extensions.Caching.Memory.MemoryCache.OvercapacityCompaction(MemoryCache cache) in Microsoft.Extensions.Caching.Memory.dll:token 0x6000059+0xad at System.Threading.ThreadPoolWorkQueue.Dispatch() in System.Private.CoreLib.dll:token 0x6002b7c+0x110 at System.Threading.PortableThreadPool.WorkerThread.WorkerThreadStart() in System.Private.CoreLib.dll:token 0x6002c66+0x190
There are a few interesting things here. First, we can see that this killed the process, this isn’t just an error, this is an error from a background thread that ended up unhandled and killed everything. That is a nope, for server applications. The second issue is that the error is strange. What exactly is going on here?
Here is the relevant piece of code that throw the error, inside the MemoryCache:
priorityEntries.Sort((e1, e2) => e1.LastAccessed.CompareTo(e2.LastAccessed));
This is a really interesting line, because of what it does. The priorityEntries is a local list of cache entries, which we need to sort by the last access, to figure out what we can evict. What can go wrong here?
Well, the MemoryCache is a naturally concurrent instance, of course. What happens when we have an access to the entry in question? We’ll update the LastAccessed value. And if we do this just right, we may give the sort function different values for the same cache entry, leading to this problem.
The bug in question was in place for as long as I can track the MemoryCache codebase. In the best case scenario, it will cause evictions of data that shouldn’t be evicted. Not a major issue, and unlikely to be noticed. But if it fail in this manner, it will kill the process, so very likely to be noticed.
My test scenario had a lot of concurrency and a very small cache (leading to a lot of evictions / compactions), I’m guessing that is why I’m able to trigger this so easily.
Comments
Concurrency testing should be done on all concurrent collections and caches. It's so easy and powerful.
See the John Hughes talk:
concurrency-testing
Comment preview