Find the bug: The case of the degrading system–Answer

Jan 05 2016

Find the bugThe case of the degrading system–Answer

time to read 3 min | 583 words

In my previous post I showed the following code, and asked what the bug was, and what the implications of that would be.

class Program
{
    private Timer nextcheck;
    public event EventHandler ServerSigFailed;
    static void Main(string[] args)
    {
        var program = new Program();
        if(program.ValidateServerSig() == false) 
            return;
        program.DoOtherStuff();
    }

    public bool ValidateServerSig()
    {
        nextcheck = new Timer(state => ValidateServerSig());

        var response = DoRequest("http://remote-srv/signature");
        if(response.Failed)
        {
            var copy = ServerSigFailed;
            if(copy!=null) copy(this, EventArgs.Empty);
            return false;
        }
        var result = Utils.CheckPublic KeySignatureMatches(response);
        if(result.Valid)
        if(response.Failed)
        {
            var copy = ServerSigFailed;
            if(copy!=null) copy(this, EventArgs.Empty);
            return false;
        }
        
        // setup next check
        nextcheck.Change(TimeSpan.FromSeconds(15), TimeSpan.FromSeconds(15));
        return true;
    }
}

The answer is quite simple, look at the first line of the ValidateServerSig function. It setups a new timer that recursively call this function. So every 15 seconds, we are going to have a new timer to run this function.

And 15 seconds after that, we are going to have 4 timers, and 15 seconds after that…

But the interesting thing here that those are timers, so they get scheduled on the thread pool. So while the growth rate of the number of tasks is phenomenal, in practice, they don't get run all together.So after 20 minutes of this, it will only run about 800 times. Because the thread pool is growing slowly, we'll end up with a much slower growth rate, but eventually the thread pool queues are going to be filled with nothing but this task. And since we typically also process requests on the thread pool, eventually requests cannot be handled, because the threads are always busy running the validation code.

In the real codebase, the timing was 5 minutes, and this issue typically manifested itself only after several weeks of continuous running.

Tweet Share Share 9 comments

Tags:

bugs

More posts in "Find the bug" series:

(29 Feb 2016) When you can't rely on your own identity
(05 Jan 2016) The case of the degrading system–Answer
(04 Jan 2016) The case of the degrading system
(11 Sep 2015) The concurrent memory buster
(20 Apr 2011) Why do I get a Null Reference Exception?
(25 Nov 2010) A broken tree
(13 Aug 2010) RavenDB HiLo implementation
(25 Jul 2010) Accidental code reviews

Comments

05 Jan 2016
12:55 PM

mark

Timers do not run on the thread pool. They are a kernel data structure without stack. The ticks run on the thread-pool, though, and might conflict as you say (which is a bug in itself).

05 Jan 2016
17:23 PM

dhasenan

You're using Timer.Change rather than creating a new Timer. The documentation says this updates the next invocation time and the invocation interval rather than creating a new Timer.

Looking at the source code for CoreRT, Timer.Change eventually calls TimerQueue.UpdateTimer at https://github.com/dotnet/corert/blob/7ae0ee55ca2eaaa62fb7e3fdca217b1aac337598/src/System.Private.Threading/src/System/Threading/Timer.cs#L392 , which updates the existing timer if it's already enqueued. Similarly, for the native timer underlying the TimerQueue, the queue will create a new native timer and then delete the old one, should that be necessary -- but in this case, it won't be.

I tested, and Mono also does the right thing.

So did you identify a bug that only applies to a different version of .NET? If so, I'm not surprised I didn't get the answer.

05 Jan 2016
18:00 PM

csmager

@dhasenan see the first line of the ValidateServerSig function. There's definitely a new timer created!

06 Jan 2016
02:33 AM

Roman

I remember when it was fun to read your blog :(

07 Jan 2016
10:28 AM

ASmits

Why the double check for Response.Failed ?

11 Jan 2016
15:42 PM

Manu

Funny, I remember when I encountered a very similar bug on a system that had two services calling each other with a certain interval. During some refactoring, we started using function calls instead of service calls and whoops, the service shut down approx. every 15 hours without throwing an exception.

It took me a full week to figure out that there was a recursive function call which resulted in a stackoverflow exception (which is one of the few exceptions that cannot be caught!)

Very annoying kind of bug!

21 Jan 2016
14:09 PM

Per Erik Stendahl

Oh that's nasty. An easy mistake to make, hard to spot and hard to reproduce. So how do we avoid this kind of problem? It would be nice to have some static analysis for this but I'm unsure on what rules to apply. Highlight all places where is recursion and threads are created?

22 Jan 2016
06:50 AM

Oren Eini

Per, I don't think there is a good way to do a static analysis of this. If we would have thought this is the issue, I'm sure we could inspect the timer queue, but that wouldn't be obvious unless you already know where to look

22 Jan 2016
06:50 AM

Oren Eini

Comment preview

Comments have been closed on this topic.

Oren Eini

Oren Eini

CEO of RavenDB