ChallengeFind the resource leak
I am not sure if I should be embarrassed or furious about this bug, but I spent a lot of time working on issue. Basically, I had a resource leak in my app. I knew that I had it, and even though it took a while to create a reliable reproduction, I managed to narrow it down to a simple reproduction. It was obvious that the issue is leaking resources during the shutdown phase of the application as a result of a race condition.
Now, usually we don’t care about such things because shutting down will, obviously, clear all of those nasty things anyway. Unfortunately, my scenario is to start & shutdown many times in a single process, as part of a unit test. So I had to figure out what was actually going on. The maddening part is that the code should work.
Here is a close approximation of the code in question:
public class QueueActions : IDisposable { UnmanagedDatabaseConnection database; public string Name { get; private set; } public class QueueActions( UnmanagedDatabaseConnectionFactory factory) { database = factory.Create(); database.Open(()=> Name = database.ReadName()); } // assume proper GC finalizer impl public void Dispose() { database.Dispose(); } }
And the code using this:
using(var factory = CreateFactory()) { ThreadPool.QueueUserWorkItem(()=> { using(var actions = new QueueActions(factory)) { actions.Send("abc"); } }); }
Note that this isn’t remotely like the actual code, it is highly simplified
And yes, there is an intentional race condition here.
Here are two pieces of information that are important to remember about this unmanaged library:
- The factory will not close if there are opened connections (exception)
- Trying to open a connection after the factory was closed will fail (exception)
How many problems does the code above have?
More posts in "Challenge" series:
- (01 Jul 2024) Efficient snapshotable state
- (13 Oct 2023) Fastest node selection metastable error state–answer
- (12 Oct 2023) Fastest node selection metastable error state
- (19 Sep 2023) Spot the bug
- (04 Jan 2023) what does this code print?
- (14 Dec 2022) What does this code print?
- (01 Jul 2022) Find the stack smash bug… – answer
- (30 Jun 2022) Find the stack smash bug…
- (03 Jun 2022) Spot the data corruption
- (06 May 2022) Spot the optimization–solution
- (05 May 2022) Spot the optimization
- (06 Apr 2022) Why is this code broken?
- (16 Dec 2021) Find the slow down–answer
- (15 Dec 2021) Find the slow down
- (03 Nov 2021) The code review bug that gives me nightmares–The fix
- (02 Nov 2021) The code review bug that gives me nightmares–the issue
- (01 Nov 2021) The code review bug that gives me nightmares
- (16 Jun 2021) Detecting livelihood in a distributed cluster
- (21 Apr 2020) Generate matching shard id–answer
- (20 Apr 2020) Generate matching shard id
- (02 Jan 2020) Spot the bug in the stream
- (28 Sep 2018) The loop that leaks–Answer
- (27 Sep 2018) The loop that leaks
- (03 Apr 2018) The invisible concurrency bug–Answer
- (02 Apr 2018) The invisible concurrency bug
- (31 Jan 2018) Find the bug in the fix–answer
- (30 Jan 2018) Find the bug in the fix
- (19 Jan 2017) What does this code do?
- (26 Jul 2016) The race condition in the TCP stack, answer
- (25 Jul 2016) The race condition in the TCP stack
- (28 Apr 2015) What is the meaning of this change?
- (26 Sep 2013) Spot the bug
- (27 May 2013) The problem of locking down tasks…
- (17 Oct 2011) Minimum number of round trips
- (23 Aug 2011) Recent Comments with Future Posts
- (02 Aug 2011) Modifying execution approaches
- (29 Apr 2011) Stop the leaks
- (23 Dec 2010) This code should never hit production
- (17 Dec 2010) Your own ThreadLocal
- (03 Dec 2010) Querying relative information with RavenDB
- (29 Jun 2010) Find the bug
- (23 Jun 2010) Dynamically dynamic
- (28 Apr 2010) What killed the application?
- (19 Mar 2010) What does this code do?
- (04 Mar 2010) Robust enumeration over external code
- (16 Feb 2010) Premature optimization, and all of that…
- (12 Feb 2010) Efficient querying
- (10 Feb 2010) Find the resource leak
- (21 Oct 2009) Can you spot the bug?
- (18 Oct 2009) Why is this wrong?
- (17 Oct 2009) Write the check in comment
- (15 Sep 2009) NH Prof Exporting Reports
- (02 Sep 2009) The lazy loaded inheritance many to one association OR/M conundrum
- (01 Sep 2009) Why isn’t select broken?
- (06 Aug 2009) Find the bug fixes
- (26 May 2009) Find the bug
- (14 May 2009) multi threaded test failure
- (11 May 2009) The regex that doesn’t match
- (24 Mar 2009) probability based selection
- (13 Mar 2009) C# Rewriting
- (18 Feb 2009) write a self extracting program
- (04 Sep 2008) Don't stop with the first DSL abstraction
- (02 Aug 2008) What is the problem?
- (28 Jul 2008) What does this code do?
- (26 Jul 2008) Find the bug fix
- (05 Jul 2008) Find the deadlock
- (03 Jul 2008) Find the bug
- (02 Jul 2008) What is wrong with this code
- (05 Jun 2008) why did the tests fail?
- (27 May 2008) Striving for better syntax
- (13 Apr 2008) calling generics without the generic type
- (12 Apr 2008) The directory tree
- (24 Mar 2008) Find the version
- (21 Jan 2008) Strongly typing weakly typed code
- (28 Jun 2007) Windsor Null Object Dependency Facility
Comments
Looks like you could get both exceptions:
1) the factory could dispose before the connection is open and
2) the connection could be open and 'Send()'ing while the factory is trying to close.
Not sure why you would spawn a thread at that point though.
One of problems is the use of FactoryFactoryFactory pattern.
You can have:
An exception on closing the factory
An exception on the costructor of QueueAction
An exception on the send
An exception disposing the QueueAction
So at a first glance there is more possible exception than lines ;)
Making a singleton thread safe Factory does not solve all the problem ?
Both of these can happen, and I don't see any other possible problems (assuming the Unmanaged* classes are thread-safe -- if they aren't, then anything might happen).
The 'send' call cannot fail since the factory won't close if there are open connections. Disposing the connection also shouldn't fail according to your description.
So I'd say this code will always do one of the following:
1) crash when opening the connection (factory.Create() call)
2) crash when closing the factory (factory.Dispose() call)
3) run successfully
Where's the resource leak?
Well obviously, if closing the factory fails with an exception, it'll stay open until the GC calls the finalizer. If that doesn't correctly handle the case of a failed close attempt, you might end up with a permanent resource leak.
The QueueActions (and associated database connection) won't be disposed if an exception is thrown in the constructor.
It's also not a best practice to do expensive calls (like opening connections) in a constructor.
@Ayende ... you gave it all away!
The code
creates a factory
queues a thread to
2a. create a connection
2b. close the connection
But the factory could easily be closed before the queued thread is started or the thread attempts to create a connection, resulting in an exception on creating a connection. Additionally, the factory could easily be closed after the queued thread creates a connection, resulting in an exception on closing the factory.
The solution is either to put the creation of the factory into the queued thread, or to join the outer thread to the queued thread (or use other signalling mechanisms such as WaitHandles) so that the factory is closed in the outer thread only once the queued thread has run to completion, at least of the connection-oriented work.
My guess would be that it is the fact that you're using the ThreadPool.QueueUserWorkItem - INSIDE your using block. It looks to me like the outer using statement is likely to dispose of the factory before the inner using block is ever ran. Now if that does not happen - seeing as this code is now running in 2 different threads - the factory.Dispose() could be called before actions.Dispose() (both by the end of the using blocks when expanded into try/finally) and that could also cast an exception.
That's just my idea anyway.
If database.Open throws an Exception then the reference to the UnmanagedDatabaseConnection instance is lost... database.Dispose() will never be called as the Dispose method of QueueActions cannot be called (an exception will be thrown from the QueueActions constructor so there's no instance of QueueActions to call Dispose on).
Not really worked much with unmanaged code directly but I'm guessing that the CLR has no way to clean this up, so the UnmanagedDatabaseConnection will live on in memory after the application ends.
The factory can get disposed during the QueueActions constructor. If this happens after calling
database = factory.Create();
but before
database.Open(()=> Name = database.ReadName());
.. then opening the connection will throw Exception. This will cause QueueActions.Dispose() to never be called and the database will leak.
I think that in order to fix bug you should move using block for a factory inside of the ThreadPool.QueueUserWorkItem action. At the time of execution of an item in the ThreadPool queue the factory for this item may be already disposed or this factory will be disposed (with an exception) after the creation of a QueueActions instance but before disposing it, so there will be an opened connection.
depending on what you do with the lambda in the Open call, you'd stop to have full control about the lifetime of the database and the QueueActions since they both need to be captured for the lambda to work, but I don't know if that's an issue here...
using(var factory = CreateFactory())
{
ThreadPool.QueueUserWorkItem(()=>
{
}
It is possible that before the ThreadPool work item will have a chance to start execution the "main" main thread will have disposed the factory. Let start with this scenarion
As there are no open connection yet disposing the factory will succeed
When constructor of QueueActions class tries to create connection an exception will be thrown because the factory will have been disposed to the time
As the exception is thrown in the constuctor the reference on the QueueActions is not assigned to "actions" local variable and consecuently the object won't be disposed
Uncaught exception in the thread pool's thread will lead to the AppDomain unload whcih will run finalizer for QueueActions which in turn will call QueueActions.Dispose method
As the factory failed to create a connection and threw exception the QueueActions object will have null value in its "database" member variable = > NullReferenceException in the finalizer thread
I would suggest you not to throw exceptions from constructors and to remove the finalizer from QueueActions class as is doesnt deal DIRECTLY with an unmanaged resource (disposing disposables in the Dispose method is enough in). Having finaliser in order to to deal with exceptional constructors... smells a bit.
But we still havent found the leak of connections as in the aforementioned scenario a connection event were not created.
Th second scenario I can see is:
The main thread places the work item in ThreadPool and preempted by a thread pool thread that starts excecution of the the work item
The thread pool thread in the constructor of QueueActions executes factory.Create() and preemted by main thread before the created connection assigned to "database" member
The factory doesn't close because there is a created connetion and throws an exception.
In the rpocess of unloading the AppDomain a thread abort is injected in the thread pool thread exactly before the assignment to "database" memeber =>
created QueueActions object is not disposed and when it finalizer is executed database == null
"Last but not least" sorry for my English :)
Comment preview