The deadlock in the fairness assumption

time to read 4 min | 744 words

Looking into some benchmark code, I noticed that the developer has put some stuff there that shouldn't be there (and is likely slowing down the benchmark). When I commented on that, the answer was: "If I don't do that, this breaks".

Indeed, he was able to show me that this will actually break. And that was very bad news. Because that was a pretty sensitive spot. In fact, afterward he actually filed several bugs, including scary looking deadlock detected one. The problem is that this code works. In fact, it has been deployed, in production, for years, and I have very high confidence that it works.

But eventually I was able to reproduce this locally, and then track it down to the actual root cause. First, let me show you the reproduction. First, the sample code:

public class Program
{
    public static void Main(string[] args)
    {
        var fair = new FairPractice();
        var thread = new Thread(fair.OtherWork);
        thread.Start();

        while (thread.IsAlive)
        {
            fair.DoMainWork();
        }
    }
}

public class FairPractice
{
    SemaphoreSlim _s = new SemaphoreSlim(1);

    public void OtherWork()
    {
        while (true)
        {
            Thread.Sleep(100);
            if (_s.Wait(1000) == false)
            {
                Console.WriteLine("\r\nCan't get it!");
                return;
            }
            Console.WriteLine("\r\nGot IT!!!!");
            _s.Release();
        }
    }

    private int _work;
    public void DoMainWork()
    {
        if (_s.Wait(500) == false)
            throw new TimeoutException("Can't get semaphore");

        Console.Write("\r{0}", ++_work);
        Thread.Sleep(30);
        _s.Release();
    }
}

Now, this code seems pretty simple. We have a bunch of work that happens in the main loop. Taking the semaphore and holding it, then doing a small amount of work, then releasing the semaphore.

On the other thread, we have a (long) wait for this semaphore, and that is it. Now, the interesting thing here is that we would expect the 2nd thread to register a wait on the semaphore so when it is released, the 2nd thread wakes up, and get the semaphore. In practice, it looks like the main thread is the one having all the fun, and it is fast enough to re-acquire the semaphore before the 2nd thread can catch it. Looking into the code of SemaphoreSlim seems to confirm it. There are actually a bunch of stuff going on there, and this particular scenario (very short wait, just a single entry in the semaphore, a second waiter with a high threshold) seems to hit it very hard, and violate all fairness assumptions.

The good thing is that in order to hit this issue, you need to have very high speed. The reason this is good is that we are benchmarking, and that means that our code (which actually also does disk I/O) is now fast enough to his this issue.