Find the bugThe case of the degrading system–Answer
In my previous post I showed the following code, and asked what the bug was, and what the implications of that would be.
class Program { private Timer nextcheck; public event EventHandler ServerSigFailed; static void Main(string[] args) { var program = new Program(); if(program.ValidateServerSig() == false) return; program.DoOtherStuff(); } public bool ValidateServerSig() { nextcheck = new Timer(state => ValidateServerSig()); var response = DoRequest("http://remote-srv/signature"); if(response.Failed) { var copy = ServerSigFailed; if(copy!=null) copy(this, EventArgs.Empty); return false; } var result = Utils.CheckPublic KeySignatureMatches(response); if(result.Valid) if(response.Failed) { var copy = ServerSigFailed; if(copy!=null) copy(this, EventArgs.Empty); return false; } // setup next check nextcheck.Change(TimeSpan.FromSeconds(15), TimeSpan.FromSeconds(15)); return true; } }
The answer is quite simple, look at the first line of the ValidateServerSig function. It setups a new timer that recursively call this function. So every 15 seconds, we are going to have a new timer to run this function.
And 15 seconds after that, we are going to have 4 timers, and 15 seconds after that…
But the interesting thing here that those are timers, so they get scheduled on the thread pool. So while the growth rate of the number of tasks is phenomenal, in practice, they don't get run all together.So after 20 minutes of this, it will only run about 800 times. Because the thread pool is growing slowly, we'll end up with a much slower growth rate, but eventually the thread pool queues are going to be filled with nothing but this task. And since we typically also process requests on the thread pool, eventually requests cannot be handled, because the threads are always busy running the validation code.
In the real codebase, the timing was 5 minutes, and this issue typically manifested itself only after several weeks of continuous running.
More posts in "Find the bug" series:
- (29 Feb 2016) When you can't rely on your own identity
- (05 Jan 2016) The case of the degrading system–Answer
- (04 Jan 2016) The case of the degrading system
- (11 Sep 2015) The concurrent memory buster
- (20 Apr 2011) Why do I get a Null Reference Exception?
- (25 Nov 2010) A broken tree
- (13 Aug 2010) RavenDB HiLo implementation
- (25 Jul 2010) Accidental code reviews
Comments
Timers do not run on the thread pool. They are a kernel data structure without stack. The ticks run on the thread-pool, though, and might conflict as you say (which is a bug in itself).
You're using Timer.Change rather than creating a new Timer. The documentation says this updates the next invocation time and the invocation interval rather than creating a new Timer.
Looking at the source code for CoreRT, Timer.Change eventually calls TimerQueue.UpdateTimer at https://github.com/dotnet/corert/blob/7ae0ee55ca2eaaa62fb7e3fdca217b1aac337598/src/System.Private.Threading/src/System/Threading/Timer.cs#L392 , which updates the existing timer if it's already enqueued. Similarly, for the native timer underlying the TimerQueue, the queue will create a new native timer and then delete the old one, should that be necessary -- but in this case, it won't be.
I tested, and Mono also does the right thing.
So did you identify a bug that only applies to a different version of .NET? If so, I'm not surprised I didn't get the answer.
@dhasenan see the first line of the ValidateServerSig function. There's definitely a new timer created!
I remember when it was fun to read your blog :(
Why the double check for Response.Failed ?
Funny, I remember when I encountered a very similar bug on a system that had two services calling each other with a certain interval. During some refactoring, we started using function calls instead of service calls and whoops, the service shut down approx. every 15 hours without throwing an exception.
It took me a full week to figure out that there was a recursive function call which resulted in a stackoverflow exception (which is one of the few exceptions that cannot be caught!)
Very annoying kind of bug!
Oh that's nasty. An easy mistake to make, hard to spot and hard to reproduce. So how do we avoid this kind of problem? It would be nice to have some static analysis for this but I'm unsure on what rules to apply. Highlight all places where is recursion and threads are created?
Per, I don't think there is a good way to do a static analysis of this. If we would have thought this is the issue, I'm sure we could inspect the timer queue, but that wouldn't be obvious unless you already know where to look
Per, I don't think there is a good way to do a static analysis of this. If we would have thought this is the issue, I'm sure we could inspect the timer queue, but that wouldn't be obvious unless you already know where to look
Comment preview