The deadlock in the fairness assumption

Apr 11 2016

The deadlock in the fairness assumption

time to read 4 min | 744 words

Looking into some benchmark code, I noticed that the developer has put some stuff there that shouldn't be there (and is likely slowing down the benchmark). When I commented on that, the answer was: "If I don't do that, this breaks".

Indeed, he was able to show me that this will actually break. And that was very bad news. Because that was a pretty sensitive spot. In fact, afterward he actually filed several bugs, including scary looking deadlock detected one. The problem is that this code works. In fact, it has been deployed, in production, for years, and I have very high confidence that it works.

But eventually I was able to reproduce this locally, and then track it down to the actual root cause. First, let me show you the reproduction. First, the sample code:

public class Program
{
    public static void Main(string[] args)
    {
        var fair = new FairPractice();
        var thread = new Thread(fair.OtherWork);
        thread.Start();

        while (thread.IsAlive)
        {
            fair.DoMainWork();
        }
    }
}

public class FairPractice
{
    SemaphoreSlim _s = new SemaphoreSlim(1);

    public void OtherWork()
    {
        while (true)
        {
            Thread.Sleep(100);
            if (_s.Wait(1000) == false)
            {
                Console.WriteLine("\r\nCan't get it!");
                return;
            }
            Console.WriteLine("\r\nGot IT!!!!");

            _s.Release();
        }
    }

    private int _work;
    public void DoMainWork()
    {
        if (_s.Wait(500) == false)
            throw new TimeoutException("Can't get semaphore");

        Console.Write("\r{0}", ++_work);
        Thread.Sleep(30);
        _s.Release();
    }
}

Now, this code seems pretty simple. We have a bunch of work that happens in the main loop. Taking the semaphore and holding it, then doing a small amount of work, then releasing the semaphore.

On the other thread, we have a (long) wait for this semaphore, and that is it. Now, the interesting thing here is that we would expect the 2nd thread to register a wait on the semaphore so when it is released, the 2nd thread wakes up, and get the semaphore. In practice, it looks like the main thread is the one having all the fun, and it is fast enough to re-acquire the semaphore before the 2nd thread can catch it. Looking into the code of SemaphoreSlim seems to confirm it. There are actually a bunch of stuff going on there, and this particular scenario (very short wait, just a single entry in the semaphore, a second waiter with a high threshold) seems to hit it very hard, and violate all fairness assumptions.

The good thing is that in order to hit this issue, you need to have very high speed. The reason this is good is that we are benchmarking, and that means that our code (which actually also does disk I/O) is now fast enough to his this issue.

Tweet Share Share 13 comments

Tags:

Comments

11 Apr 2016
12:59 PM

mark

That seems unfair indeed. Try reporting it on GitHub. Stephen Toub seems to be in charge of the TPL, he is very competent.

11 Apr 2016
15:42 PM

Oren Eini

Mark, https://github.com/dotnet/coreclr/issues/4267

11 Apr 2016
20:35 PM

archiver

I got a question related to wait() Once wait(1000) is called it'll have to wait for 1000 ms even if release is executed?

Thanks!

11 Apr 2016
20:40 PM

Oren Eini

Archiver, No, the 1000 is the max amount of time it will wait for an available spot

11 Apr 2016
20:48 PM

archiver

Thanks for your quick answer!

I'm afraid that I don't understand something. I would expect more or less to see the main thread loops for 4,5 or even 6 times (while second thread sleeps for 100 ms) and after that main thread releases and make second thread to catch the semaphore at that moment. But anyway, maybe this is some kind of certain basic knowledge I could not have. In that case sorry for it :P

11 Apr 2016
20:50 PM

Oren Eini

Archiver, No, you are correct. The main thread should be looping, and then the second thread should be able to get the semaphore. The fact that it doesn't is the bug

11 Apr 2016
20:54 PM

archiver

Ok! Sorry again for the misunderstanding. I'll keep reading all the posts as they're quite interesting!

12 Apr 2016
11:31 AM

Oren Eini

Interesting discussion on this can be found here: http://www.pcdoctor-community.com/wp/posts/2007/10/fairness-in-win32-lock-objects/

12 Apr 2016
18:19 PM

Ian Yates

I'm able to reproduce this in LinqPad v5 (running against .NET framework 4.6 I believe) although often the very first execution does work at least once.

If I switch from SemaphoreSlim to Semaphore then it works a treat (I appreciate that the non-slim class is frowned upon these days).

Nice find!

13 Apr 2016
12:51 PM

Szymon Kulec 'Scooletz'

The mentioned fairness assumption is based on the gut feeling of how a semaphore should work, or is it documented?

13 Apr 2016
13:13 PM

Oren Eini

Szymon, I expect a thread that is waiting on the semaphore to eventually get the lock. This behavior is supported by the fact that literally every other lock object I checked will behave properly. Including Sempahore, which SemaphoreSlim is supposed to be a slim alternative for. The key aspect here is that we don't really care about whatever it is truly fair or not, we care that statistically, all threads will get a chance to execute eventually. It doesn't.

13 Apr 2016
20:07 PM

Szymon Kulec 'Scooletz'

I curious enough to read through the MSDN page of the slim semaphore. The only mention about ordering is following: If multiple threads are blocked, there is no guaranteed order, such as FIFO or LIFO, that controls when threads enter the semaphore. Additionally, there's no mention of a thread/task Waiting for semaphore time after time. It looks like you've entered a grayish area. Anyway, that's an interesting finding.

14 Apr 2016
05:00 AM

Oren Eini

Szymon, There isn't an idea of fairness, sure, and that is pretty much fine. But there should be an idea of "eventually you'll get the queue", rather than you are stuck forever. It might be best to phrase that in terms of lock liveliness, rather than fairness

Comment preview

Comments have been closed on this topic.

Oren Eini

Oren Eini

CEO of RavenDB