ChallengesSpot the bug in the stream–answer
In my previous post, I asked you to find the bug in the following code:
This code looks okay, at a glance, but it turns out that this is a really nasty data corruption bug waiting to happen. Here is what the problematic usage looks like:
Do you see the error now?
If the operation will time out, an exception will be raised, but the underlying operation isn’t over. We are using a shared pool, so the buffer we use may be handed over to someone else. At this point, we do something with the buffer, but the pending I/O operation will read data into this buffer, meaning that this is probably going to be garbage in it when we actually use it.
To actually happen, you need to have a timeout operation, reuse of the buffer and the I/O operation completing at just the wrong time. So a sequence of highly unlikely events that would assuredly happen within an hour of pushing something like that to production. For fun, this will reliably happen the moment you have some network issues. So imagine that you have a slow node, which then cause memory corruption, which end up being a visible bug (instead of maybe aborted request) very rarely, and with no indication on how this happened.
How do you fix this? Like this:
This will use a cancellation token, which will cause the operation to be aborted at the stream level, meaning that we can safely reuse values that we passed the underlying stream.
More posts in "Challenges" series:
- (03 Jan 2020) Spot the bug in the stream–answer
- (15 Feb 2010) Where is the optimization?
Comments
I assume this isn't a theoretical question, but an actual issue you encountered... within an hour of being used
Peter,
We run into this in code review.
I think this still has a race condition. At least the default implementation of Stream will only check before it calls BeginRead if the operation has been cancelled. Later cancel operations via the linked CancellatonTokenSource will not cause any side effect to the already running read operation e.g. a slow network read.
The code above looks better but it will perform equally bad as the original version if the underlying stream was not overriden with another implementation. Cancelling is always asynchronous and you never know when you have reached a safe state except if you wait for the operation to complete and time out properly.
Alois,
If there are calls that need to be cancelled, the underlying stream likely already handles them. For example, when using
NetworkStream
, you'll end up here:https://github.com/dotnet/runtime/blob/4f9ae42d861fcb4be2fcd5d3d55d5f227d30e723/src/libraries/System.Net.Sockets/src/System/Net/Sockets/SocketAsyncEventArgs.Windows.cs#L233 https://github.com/dotnet/runtime/blob/4f9ae42d861fcb4be2fcd5d3d55d5f227d30e723/src/libraries/System.Net.Sockets/src/System/Net/Sockets/SocketAsyncContext.Unix.cs#L796
If the underlying stream impl can't be cancelled, you can't safely abort the operation anyway, so nothing really can be done.
Yes NetworkStream seems to handle this properly. Thanks for the NetworkStream links. Async code is much harder to understand and riddled with race conditions. Personally I am still not convinced that the everything is async is the right thing to do. If you have chained thousands of async operations which work nicely in a good working system everything is fine but how do you handle in every layer a sudden slowdown because e.g. Network is slow or file operations take longer than before? In that case thousands of tasks pile up and will eventually hit the Threadpool Worker Thread limit leading to threadpool starvation and a sudden drop in throughput. If you have dependant tasks then you can even get a deadlock by awaiting tasks which are never scheduled to run because of the worker thread limit. That are hard to debug issues. You need to reverse engineer all of the task dependencies to get things working again which is no fun. Are there any libraries which make such things easier you would recommend?
Alois,
The key here is that you don't try to handle them all at a granular level. That is why the
CanecllationToken
abstraction is so great.You get the ability to set policy at the highest level.Then you must define when you want to cancel. Is the system just slow or did it deadlock? How would you decide to cancel running operations? Cancel with a timeout as the code above implies? On a slow machine timing is very different compared to a big server machine. In principle it is easy but in practice it is still a hard problem.
Alois,
We aren't trying to provide a generic overall system. That is typically not viable. We know that the underlying stream is going to be either a file or network stream, that is it. And there are mechanisms to cancel that sort of I/O ops.
Deciding to cancel is typically easy:
Those sort of things, which give me something to react to and then cancel the operation and cleanup the rest of the system.
Comment preview