Strings are annoying

time to read 11 min | 2036 words

I hate a love/hate/hate relationship with .NET strings. That is because they are both incredibly convenient and horribly inefficient in a bad way. Let us look at the following file:

   1: "first_name","last_name","company_name","address","city","county","state","zip","phone1","phone2","email","web"
   2: "James","Butt","Benton, John B Jr","6649 N Blue Gum St","New Orleans","Orleans","LA",70116,"504-621-8927","504-845-1427","jbutt@gmail.com","http://www.bentonjohnbjr.com"
   3: "Josephine","Darakjy","Chanay, Jeffrey A Esq","4 B Blue Ridge Blvd","Brighton","Livingston","MI",48116,"810-292-9388","810-374-9840","josephine_darakjy@darakjy.org","http://www.chanayjeffreyaesq.com"
   4: "Art","Venere","Chemel, James L Cpa","8 W Cerritos Ave #54","Bridgeport","Gloucester","NJ","08014","856-636-8749","856-264-4130","art@venere.org","http://www.chemeljameslcpa.com"

Reading this is a simple matter of writing something like this:

   1: var headerLine = reader.ReadLine();
   2: var headers = headerLine.Split(',').Select(h=>h.Trim('"')).ToArray();
   3:  
   4: while(reader.EndOfStream == false)
   5: {
   6:     var line = reader.ReadLine();
   7:     var columns = line.Split(",");
   8:     var dic = new Dictionary<string,string>();
   9:     for(var i=0;i<headers.Length;i++)
  10:     {
  11:         dic[headers[i]] = columns[i].Trim('"');
  12:     }
  13:     yield return dic;
  14: }

Now, let us look at the same code again, but this time, I marked places where we are doing string allocation:

   1: var headerLine = reader.ReadLine();
   2: var headers = headerLine.Split(',').Select(h=>h.Trim('"')).ToArray();
   3:  
   4: while(reader.EndOfStream == false)
   5: {
   6:     var line = reader.ReadLine();
   7:     var columns = line.Split(",");
   8:     var dic = new Dictionary<string,string>();
   9:     for(var i=0;i<headers.Length;i++)
  10:     {
  11:         dic[headers[i]] = columns[i].Trim('"');
  12:     }
  13:     yield return dic;
  14: }

Those are a lot of strings that we are allocating. And if we are reading a large file, that can very quickly turn into a major performance issue. If I was writing the same in C, for example, I would be re-using the allocated string multiple times, but here we’ve to allocate and discard them pretty much continuously.

The really sad thing about it, it is incredibly easy to do this, usually without paying any attention. But even if you know what you are doing, you pretty much have to roll your own everything to get it to work. And that sucks quite badly.