The cost of working with strings

time to read 15 min | 2973 words

Following my last post, I decided that it might be better to actually show what the difference is between direct string manipulation and working at lower levels.

I generated a sample CSV file with 10 million lines and 6 columns. The file size was 658MB. I then wrote the simplest code that I could possibly think of:

   1: public class TrivialCsvParser
   2: {
   3:     private readonly string _path;
   4:  
   5:     public TrivialCsvParser(string path)
   6:     {
   7:         _path = path;
   8:     }
   9:  
  10:     public IEnumerable<string[]> Parse()
  11:     {
  12:         using (var reader = new StreamReader(_path))
  13:         {
  14:             while (true)
  15:             {
  16:                 var line = reader.ReadLine();
  17:                 if (line == null)
  18:                     break;
  19:                 var fields = line.Split(',');
  20:                 yield return fields;
  21:             }
  22:         }
  23:     }
  24: }

This run in 8.65 seconds (with a no-op action) and kept the memory utilization at about 7MB.

Then next thing to try was just reading through the file without doing any parsing. So I wrote this:

   1: public class NoopParser
   2: {
   3:     private readonly string _path;
   4:  
   5:     public NoopParser(string path)
   6:     {
   7:         _path = path;
   8:     }
   9:  
  10:     public IEnumerable<object> Parse()
  11:     {
  12:         var buffer = new byte[1024];
  13:         using (var stream = new FileStream(_path,FileMode.Open, FileAccess.Read))
  14:         {
  15:             while (true)
  16:             {
  17:                 var result = stream.Read(buffer, 0, buffer.Length);
  18:                 if (result == 0)
  19:                     break;
  20:                 yield return null; // noop
  21:             }
  22:         }
  23:     }
  24: }

Note that this isn’t actually doing anything. But this took 0.83 seconds, so we see a pretty important big difference here. By the way, the amount of memory used isn’t noticeably different here. Both use about 7 MB. Probably because we aren’t actually holding up to any of the data in any meaningful way.

I have run the results using release build, and I run it multiple times, so the file is probably all in the OS cache. So I/O cost is pretty minimal here. However, note that we aren’t doing a lot of stuff that is being done by the TrivialCsvParser. For example, doing line searches, splitting the string to fields, etc. But interestingly enough, just removing the split will reduce the cost from 8.65 seconds to 3.55 seconds.