Ayende @ Rahien

Refunds available at head office

Riddle me this: File access & Multi threading

What would be the output of the following code?

var list = new List<Task>();

var fileStream = File.Open("test", FileMode.Create, FileAccess.ReadWrite, FileShare.None);
for (int i = 0; i < 150; i++)
{
    var index = i;
    var task = Task.Factory.StartNew(() =>
    {
        var buffer = Encoding.ASCII.GetBytes(index + "\r\n");
        fileStream.Write(buffer, 0, buffer.Length);
    });
    list.Add(task);
}


Task.WaitAll(list.ToArray());
fileStream.Flush(true);
fileStream.Dispose();

Please note that:

  • What order the data is saved is not important.
  • Preventing data corruption is important.

Can you guess?

Comments

Petar Repac
10/12/2010 10:29 AM by
Petar Repac

fileStream.Write(buffer, 0, buffer.Length) can be interrupted and you end up with corrupted data on disk ?

write to disk should be serialized ?

vmykyt
10/12/2010 10:38 AM by
vmykyt

Index variable looks as immutable.

Also file flashed at the end of all tasks.

So the question is -

can something like MemoryStream (some underground stream wich are used for buffering) support simultaneous write access?

I don't know exactly but as I remember Console.Output allow writing from different threads.

So we probably receive mess of numbers.

Duarte
10/12/2010 10:59 AM by
Duarte

I’m pretty sure the file position will get corrupted, meaning that the code will probably overwrite some data (although it may take far more than 150 iterations for the error to manifest), which the CLR may detect and then throw an IO Exception. However, I don’t see the point of potentially launching 150 threads (the thread pool will inject more threads when they block, which is the case with Write since all IO on windows is asynchronous). You should use continuations to serialize the access to the underlying stream and, if you think the cost of asynchronous IO justifies itself (it involves a kernel APC and more context switching), use TaskFactory.FromAsync to build the tasks.

Matt
10/12/2010 11:00 AM by
Matt

You will occasionally get the incorrect sequence of binary bits written to the file because multiple threads are writing to the file stream at the same time.

To a computer or human interpreting the binary data in the resulting file as ASCII, the symptoms would be missing characters or incorrect characters.

Dalibor Čarapić
10/12/2010 11:23 AM by
Dalibor Čarapić

Since the code block which is executed async is a closure which accesses the outer 'i' variable the code would probably never write the first numbers (see csharpindepth.com/Articles/Chapter5/Closures.aspx).

IMHO the output would probably be something like

148

148

149

149

150

150

150

... repeat 150 times

Rob Kent
10/12/2010 11:36 AM by
Rob Kent

According to the MS docs you will likely get an exception because each thread does not have an exclusive lock on its handle:

"When a FileStream object does not have an exclusive hold on its handle, another thread could access the file handle concurrently and change the position of the operating system's file pointer that is associated with the file handle...

If an unexpected change in the handle position is detected in a call to the Write method, the contents of the buffer are discarded and an IOException is thrown."

vmykyt
10/12/2010 11:37 AM by
vmykyt
   "Since the code block which is executed async is a closure which  accesses the outer 'i' variable the code would probably never write the first numbers (see csharpindepth.com/Articles/Chapter5/Closures.aspx)."

Hmm..

looks like this local variable prevent such behavior

var index = i;

tobi
10/12/2010 11:52 AM by
tobi

Only two things can happen: Either whole lines are saved or partial lines can occur. I believe what actually happens depends on the file system implementation. My guess is that whole lines will be preserved although this is not guaranteed.

Most speculations in the comments are wrong. No IOException will occur, the file position will not get corrupted and the closure issue has explicitly been avoided.

Lachlan
10/12/2010 12:24 PM by
Lachlan

The default TaskFactory is configured with TaskScheduler,MaximumConcurrencyLevel == 1

Chris patterson
10/12/2010 12:30 PM by
Chris patterson

My guess would be

1

150

149

148

...

2

Patrick Huizinga
10/12/2010 01:07 PM by
Patrick Huizinga

@Lachlan

According to MSDN MaxConcurrencyLevel is actually Int32.MaxValue.

The whole purpose of System.Threading.Tasks is to manage concurrency, not to prevent it.

Matt
10/12/2010 01:44 PM by
Matt

Does this blog have a comments feed?

Philip
10/12/2010 02:02 PM by
Philip

Yes, the big issue here is that file access through a writer is not thread safe at all. Not just the file position pointer, but the actual write buffer as well - fileStream.Write() simply queues up the write and returns.

I'm guessing the output of your code is a file with 0-x things written into it, and an IOException.

BTW: There is a way to get a synchronized stream for a text writer "TextWriter.Synchronized()", but this requires using a TextWriter instead of a FileStream. It also has the downside of having the FCL be in charge of your performance if this is constantly called code.

Philip R.
10/12/2010 02:30 PM by
Philip R.
  • You could have out of order writes - which you stated isn't a problem. This is pretty likely.

  • You could have "corrupted" entries - the acess through FileStream.Write isn't syncronous, it uses an internal buffer, and it uses native methods and passes in buffer offsets to do the write. It would be be somewhat easy to have a write "smash" the buffer and end up writing NULL instead of the characters of the correct string, or possibly even duplicate a string. This is fairly likely.

  • The documentation says IOExceptions could occurr with multiple-thread writing, but it looks like that only happens when the native handle is exposed. I'm not sure this could happen.

tobi
10/12/2010 04:08 PM by
tobi

I just confirmed with reflector that filestream does not synchronize its buffer as I assumed. This means that the program will likely not complete because of a contract exception inside of the filestream.

James
10/12/2010 04:26 PM by
James

You'll get garbled content in the files, but no exceptions. This is not platform/OS dependent because the problem is not with the file writes, but the FileStream internal buffer.

You are attempting to write a total of 640 bytes which is a lot less than the default 4K internal buffer for FileStream. So you will not actually write to the file until the Flush(). Thus you will not get any threading issues with accessing the file itself.

So the real question is, what happens with multiple threads updating the internal buffer? Reflector shows that both the actual change to the internal buffer, and the change to keep track of the write position are not thread-safe. Both of these could cause corruption of the writes. However they will not cause any exceptions, and so you will have something in the internal buffer to write to the file when Flush() is called, even though it could be corrupted.

Caveat: This is for .NET 2.0, not looked into other versions.

Nick Aceves
10/12/2010 04:47 PM by
Nick Aceves

You're going to get some of the numbers being written over, and some will have invalid data (either 3, 4, or 5 bytes of it?) before them.

This code inside FileStream is the culprit:

Buffer.InternalBlockCopy(array, offset, this.buffer, this.writePos, byteCount);

this._writePos += byteCount;

If two of the tasks are executing this at almost the same time, one may copy over what the other has just copied to the buffer, but _writePos will still get increased twice (probably, at least).

Gian Maria
10/12/2010 05:01 PM by
Gian Maria

Since MSDN help of msdn.microsoft.com/.../system.io.filestream.aspx ">FileStream tells

Any public static (Shared in Visual Basic) members of this type are thread safe. Any instance members are not guaranteed to be thread safe.

it is not safe to use Write() method from different threads, so the above piece of code could actually corrupts the data. Actually with integer number, is less likely you encounter a corruption, but if you write longer string you can easily spot that writes are actually intermixed.

PS
10/12/2010 07:00 PM by
PS

Actually, I think that the code will work and satisfy the requirements (i.e. noncorrupted, order-insensitive write) due to FileShare.None. All of the tasks will get scheduled on the thread pool, will run in an arbitrary order, but only one at a time will be able to write to the stream, effectively serializing them.

PS
10/12/2010 07:41 PM by
PS

On second thought (and rereading the msdn doco on the method), I take that back, FileShare.None only applies to the act of the opening of the stream, not any actual IO done over it. My bad...

Sam
10/12/2010 09:26 PM by
Sam

Methinks Nick is teh winnar.

Sam
10/12/2010 09:29 PM by
Sam

err, I mean James & Nick. Sorry James I pulled a TL;DR on your answer :)

Alex Simkin
10/12/2010 09:35 PM by
Alex Simkin

On my computer this gives unpredictable output with every run sometimes missing numbers or entering 0x00 in random places.

James
10/12/2010 10:24 PM by
James

@Sam - to be honest, I'd have done that to myself! Nick's answer is definitely clearer :-)

krlm
10/13/2010 09:11 AM by
krlm

IMO: The correct solution should be based on the separate writing thread which collects data from the other threads and owns the file handler and whole writing and error handling logic. Regards.

Manu
10/13/2010 05:17 PM by
Manu

I am not sure what happens, but why would you not chose to use Paralles.For to achieve the same outcome?

Konstantin
10/14/2010 04:17 AM by
Konstantin

@Manu Parallel.For is for compute-bound operations. Ayende's example uses IO-bound ones

Comments have been closed on this topic.