Some observations on saving to file

time to read 5 min | 922 words

This post from Dare Obasanjo has led me to some very interesting reading. Broadly, the question is what we should optimize when thinking about disk access.

Traditional thinking would say that we want to write as little as possible, because the disk is slow. But it turn out that this is not quite accurate in the way I used to think about it. The problem is not that the disk is slow, but that seek time is slow. Let us consider the following problem:

Given a file with N integers in it, with values from 0..N, update every 100th number to twice its current value.

On the face of it, this looks like a good way of solving the problem:

using file = File.OpenWrite("my.file"):
	for i in range(N):
		file.Position = i * 4 // position of number in file
		bytes = BitConverter.GetBytes(i*2)
		file.Write(bytes, 0, 4)
	file.Flush()

It turns out that it is actually significantly cheaper to overwrite the old values rather than actually skip ahead. And this is in a situation where we always skip ahead.

I decided that I wanted to test three scenarios:

  • Sequential sparse update - updating a record every N bytes, basically the code we have above. On the face of it, this looks like it would be the most efficient approach, since we make the least amount of modifications, and we always seek in one direction.
  • Sequential update - overwrite the entire file.
  • Random update - update each record in isolation, but do not do this in a sequential order.

Here are the actual numbers, for a file where N is one million (file size is 3.8 MB). The numbers on the Y axis represent the duration in milliseconds.

image

Note that in order to actually see this in action, you need to disable the OS buffering, otherwise the OS will optimize the writes for you to be as sequential as possible. That said, even when using the OS buffers, there is still an interesting affect going on. As you can see in this graph:

image

Note, however, that without write through, the data is not actually written to disk, the OS is using delayed write, so a system crash can cause data loss. (In the vast majority of cases, you should not use write through, mind you, you should let the OS handle this).

Those numbers are relatively consistent no matter what is the file size I am talking about. For a file of 38 MB in size, it takes 8 seconds to overwrite it fully, 64 seconds to perform sparse update and over five minutes to perform a random update.

And here is the code that I used to test this:

var useWriteThrough = FileOptions.WriteThrough;

using (var log = File.CreateText("log.txt"))
{
	var baseCounts = new[]
	{
		10000,
		100000,
		1000000,
	};
	var path = "test.data";
	log.WriteLine("Bytes\tInitial write\tSequential Sparse Update Write\tSequential Update Write\tRandom Update Write");
	for (int j = 0; j < baseCounts.Length; j++)
	{
		var count = baseCounts[j];
		var sizeInBytes = count*sizeof (int);
		Console.WriteLine("File size is: " + (sizeInBytes/1024m/1024m) + " MB");
		log.Write(sizeInBytes + "\t");
		if (File.Exists(path))
			File.Delete(path);
		Stopwatch stopwatch = Stopwatch.StartNew();
		using (
			Stream stream = new FileStream(path, FileMode.CreateNew, FileAccess.Write, FileShare.None, 0x1000,
			                               useWriteThrough | FileOptions.SequentialScan))
		{
			for (int i = 0; i < count; i++)
			{
				var bytes = BitConverter.GetBytes(i);
				stream.Write(bytes, 0, bytes.Length);
			}
			stream.Flush();
		}
		stopwatch.Stop();
		log.Write("{0}\t", stopwatch.ElapsedMilliseconds);

		stopwatch = Stopwatch.StartNew();
		using (
			Stream stream = new FileStream(path, FileMode.Open, FileAccess.Write, FileShare.None, 0x1000,
			                               useWriteThrough | FileOptions.SequentialScan))
		{
			for (int i = 0; i < count; i += 100)
			{
				stream.Position = sizeof (int)*i;
				var bytes = BitConverter.GetBytes(i*2);
				stream.Write(bytes, 0, bytes.Length);
			}
			stream.Flush();
		}
		stopwatch.Stop();
		log.Write("{0}\t", stopwatch.ElapsedMilliseconds);

		stopwatch = Stopwatch.StartNew();
		using (
			Stream stream = new FileStream(path, FileMode.Open, FileAccess.Write, FileShare.None, 0x1000,
			                               useWriteThrough | FileOptions.SequentialScan))
		{
			for (int i = 0; i < count; i++)
			{
				int value;
				if (i%100 == 0)
					value = i*2;
				else
					value = i;
				var bytes = BitConverter.GetBytes(value);
				stream.Write(bytes, 0, bytes.Length);
			}
			stream.Flush();
		}
		stopwatch.Stop();
		log.Write("{0}\t", stopwatch.ElapsedMilliseconds);

		stopwatch = Stopwatch.StartNew();
		using (
			Stream stream = new FileStream(path, FileMode.Open, FileAccess.Write, FileShare.None, 0x1000,
			                               useWriteThrough | FileOptions.RandomAccess))
		{
			var rand = new Random();
			for (int i = 0; i < count; i += 100)
			{
				// this ensure that we are always writing at int size, not half int
				stream.Position = rand.Next(sizeof (int), count);
				stream.Position = stream.Position - stream.Position%sizeof (int);

				var bytes = BitConverter.GetBytes(i*2);
				stream.Write(bytes, 0, bytes.Length);
			}
			stream.Flush();
		}
		stopwatch.Stop();
		log.WriteLine("{0}", stopwatch.ElapsedMilliseconds);
		log.Flush();
	}
}

I wonder if this means that I should treat making a disk call as a remote call, and use the same optimization techniques there as well.