ChallengeThe race condition in the TCP stack
Occasionally, one of our tests hangs. Everything seems to be honky dory, but it just freezes and does not complete. This is a new piece of code, and thus is it suspicious unless proven otherwise, but an exhaustive review of it looked fine. It took over two days of effort to narrow it down, but eventually we managed to point the finger directly at this line of code:
In certain cases, this line would simply not read anything on the server. Even though the client most definitely sent the data. Now, given that TCP is being used, dropped packets might be expected. But we are actually testing on the loopback device, which I expect to be reliable.
We spent a lot of time investigating this, ending up with a very high degree of certainty that the problem was in the TCP stack somewhere. Somehow, on the loopback device, we were losing packets. Not always, and not consistently, but we were absolutely losing packets, which led the server to wait indefinitely for the client to send the message it already did.
Now, I’m as arrogant as the next developer, but even I don’t think I found that big a bug in TCP. I’m pretty sure that if it was this broken, I would have known about it. Beside, TCP is supposed to retransmit lost packets, so even if there were lost packets on the loopback device, we should have recovered from that.
Trying to figure out what was going on there sucked. It is hard to watch packets on the loopback device in WireShark, and tracing just told me that a message is sent from the client to the server, but the server never got it.
But we continued, and we ended up with a small reproduction of the issue. Here is the code, and my comments are below:
This code is pretty simple. It starts a TCP server, and listens to it, and then it reads and writes to the client. Nothing much here, I think you’ll agree.
If you run it, however, it will mostly work, except that sometimes (anywhere between 10 runs and 500 runs on my machine), it will just hang. I’ll save you some time and let you know that there are no dropped packets, TCP is working properly in this case. But the code just doesn’t. What is frustrating is that it is mostly working, it takes a lot of work to actually get it to fail.
Can you spot the bug? I’ll continue discussion of this in my next post.
More posts in "Challenge" series:
- (04 Jan 2023) what does this code print?
- (14 Dec 2022) What does this code print?
- (01 Jul 2022) Find the stack smash bug… – answer
- (30 Jun 2022) Find the stack smash bug…
- (03 Jun 2022) Spot the data corruption
- (06 May 2022) Spot the optimization–solution
- (05 May 2022) Spot the optimization
- (06 Apr 2022) Why is this code broken?
- (16 Dec 2021) Find the slow down–answer
- (15 Dec 2021) Find the slow down
- (03 Nov 2021) The code review bug that gives me nightmares–The fix
- (02 Nov 2021) The code review bug that gives me nightmares–the issue
- (01 Nov 2021) The code review bug that gives me nightmares
- (16 Jun 2021) Detecting livelihood in a distributed cluster
- (21 Apr 2020) Generate matching shard id–answer
- (20 Apr 2020) Generate matching shard id
- (02 Jan 2020) Spot the bug in the stream
- (28 Sep 2018) The loop that leaks–Answer
- (27 Sep 2018) The loop that leaks
- (03 Apr 2018) The invisible concurrency bug–Answer
- (02 Apr 2018) The invisible concurrency bug
- (31 Jan 2018) Find the bug in the fix–answer
- (30 Jan 2018) Find the bug in the fix
- (19 Jan 2017) What does this code do?
- (26 Jul 2016) The race condition in the TCP stack, answer
- (25 Jul 2016) The race condition in the TCP stack
- (28 Apr 2015) What is the meaning of this change?
- (26 Sep 2013) Spot the bug
- (27 May 2013) The problem of locking down tasks…
- (17 Oct 2011) Minimum number of round trips
- (23 Aug 2011) Recent Comments with Future Posts
- (02 Aug 2011) Modifying execution approaches
- (29 Apr 2011) Stop the leaks
- (23 Dec 2010) This code should never hit production
- (17 Dec 2010) Your own ThreadLocal
- (03 Dec 2010) Querying relative information with RavenDB
- (29 Jun 2010) Find the bug
- (23 Jun 2010) Dynamically dynamic
- (28 Apr 2010) What killed the application?
- (19 Mar 2010) What does this code do?
- (04 Mar 2010) Robust enumeration over external code
- (16 Feb 2010) Premature optimization, and all of that…
- (12 Feb 2010) Efficient querying
- (10 Feb 2010) Find the resource leak
- (21 Oct 2009) Can you spot the bug?
- (18 Oct 2009) Why is this wrong?
- (17 Oct 2009) Write the check in comment
- (15 Sep 2009) NH Prof Exporting Reports
- (02 Sep 2009) The lazy loaded inheritance many to one association OR/M conundrum
- (01 Sep 2009) Why isn’t select broken?
- (06 Aug 2009) Find the bug fixes
- (26 May 2009) Find the bug
- (14 May 2009) multi threaded test failure
- (11 May 2009) The regex that doesn’t match
- (24 Mar 2009) probability based selection
- (13 Mar 2009) C# Rewriting
- (18 Feb 2009) write a self extracting program
- (04 Sep 2008) Don't stop with the first DSL abstraction
- (02 Aug 2008) What is the problem?
- (28 Jul 2008) What does this code do?
- (26 Jul 2008) Find the bug fix
- (05 Jul 2008) Find the deadlock
- (03 Jul 2008) Find the bug
- (02 Jul 2008) What is wrong with this code
- (05 Jun 2008) why did the tests fail?
- (27 May 2008) Striving for better syntax
- (13 Apr 2008) calling generics without the generic type
- (12 Apr 2008) The directory tree
- (24 Mar 2008) Find the version
- (21 Jan 2008) Strongly typing weakly typed code
- (28 Jun 2007) Windsor Null Object Dependency Facility
I guess since stream probably buffered write data, it doesn't send due to missing 'await networkSteam.FlushAsync();' in line 64
I thought it could be some weird timing thing and the fact that the ReadLineAsync method must automagically deal with all possible line endings - \r, \n and \r\n. But at best, all I've managed to convince myself of so far is that relying on bare \r endings is a bad idea if you want things to be responsive since so far as I can see, the ReadLineAsync won't return until it's definitely managed to sample the next character.
The buffered data in the first stream reader is discarded.
Further, AFAIK flushing a socket or a NetworkStream is not possible. According to Reflector, networkStream.FlushAsync() does nothing. I assume you want to flush the nagling buffer. I never found a flushing API for that. Is there one?
while (string.IsNullOrEmpty(reply)) is a bug because it spins forever in case of a closed stream (null). This can happen even in this code when the server has an exception.
Isn't there a race-condition within TcpListener due to the fact that 5 AcceptTcpClientAsync calls are active simultaneously all the time?
Huberto, No, that is fine
The code needs more await.
StreamReader reads via a buffer, so creating a second StreamReader discards whatever was already read "ahead" in the first StreamReader. This would be a timing issue for any kind of stream (not just TCP/IP), but it would just happen to fail reliably on local streams such as file/memory streams, due to their implementation.
Finding a true bug in the TCP/IP stack would be an extremely rare occurrence. :)
It's not that the first StreamReader buffers ahead of the second StreamReader. It's that the first StreamReader closes the stream when it is disposed. It usually works because it's a classic race condition: can the second StreamReader finish before the first StreamReader is disposed at the end of the using block?
John, The first stream reader isn't disposed or closed.
Oops. Well, I was close. It's the network stream that is disposed in ListenToNewTcpConnection(). This is the race condition.
John, Note that this is a using statement that contains an
await ProcessConnection(networkStream);, so the connection is alive while it is running
Encoding issue? Default encoding on StreamReader / StreamWriter is UTF16, I think ... and you're using a mix of UTF 8 and UTF 16 in reads / writes?
Thilak, That isn't the issue. From the docs: " It automatically recognizes UTF-8, little-endian Unicode, and big-endian Unicode text" https://msdn.microsoft.com/en-us/library/ms143456(v=vs.110).aspx
I put a break point on line 18 to QuickWatch the TcpListener and by the time I resumed stepping through the code execution would reliably hang on the first attempted client connection. This appears to happen every time I pause the execution for any length of time before starting the client connections. Is this something to do with the aysnc methods setup to handle the incoming connections getting cleaned up or timing out silently?
Adam, You are just using the debugger to trigger the race condition every time