The cost of working with strings

time to read 15 min | 2973 words

Following my last post, I decided that it might be better to actually show what the difference is between direct string manipulation and working at lower levels.

I generated a sample CSV file with 10 million lines and 6 columns. The file size was 658MB. I then wrote the simplest code that I could possibly think of:

   1: public class TrivialCsvParser

   2: {

   3:     private readonly string _path;

4:

   5:     public TrivialCsvParser(string path)

   6:     {

   7:         _path = path;

   8:     }

9:

  10:     public IEnumerable<string[]> Parse()

  11:     {

  12:         using (var reader = new StreamReader(_path))

  13:         {

  14:             while (true)

  15:             {

  16:                 var line = reader.ReadLine();

  17:                 if (line == null)

  18:                     break;

  19:                 var fields = line.Split(',');

  20:                 yield return fields;

  21:             }

  22:         }

  23:     }

  24: }

This run in 8.65 seconds (with a no-op action) and kept the memory utilization at about 7MB.

Then next thing to try was just reading through the file without doing any parsing. So I wrote this:

   1: public class NoopParser

   2: {

   3:     private readonly string _path;

4:

   5:     public NoopParser(string path)

   6:     {

   7:         _path = path;

   8:     }

9:

  10:     public IEnumerable<object> Parse()

  11:     {

  12:         var buffer = new byte[1024];

  13:         using (var stream = new FileStream(_path,FileMode.Open, FileAccess.Read))

  14:         {

  15:             while (true)

  16:             {

  17:                 var result = stream.Read(buffer, 0, buffer.Length);

  18:                 if (result == 0)

  19:                     break;

  20:                 yield return null; // noop

  21:             }

  22:         }

  23:     }

  24: }

Note that this isn’t actually doing anything. But this took 0.83 seconds, so we see a pretty important big difference here. By the way, the amount of memory used isn’t noticeably different here. Both use about 7 MB. Probably because we aren’t actually holding up to any of the data in any meaningful way.

I have run the results using release build, and I run it multiple times, so the file is probably all in the OS cache. So I/O cost is pretty minimal here. However, note that we aren’t doing a lot of stuff that is being done by the TrivialCsvParser. For example, doing line searches, splitting the string to fields, etc. But interestingly enough, just removing the split will reduce the cost from 8.65 seconds to 3.55 seconds.

Tweet Share Share 27 comments

Tags:

Comments

15 Jan 2014
10:35 AM

Bartosz Adamczewski

The amount of used mem should be different but since lot's of strings that are allocated in example 1 will be collected in gen0 the memory allocation delta should be actually different.

The split is slow because you are doing an O(n) operation 10 million times and copying char arrays around or calling substrings with marked indexes and putting them into an array or list (I haven't checked the implementation recently but I would expect that to happen). This sort of thing could be easily optimized with little effort by implementing a method called ReadLineAndSplit since while reading the buffer one could do the marking and splitting (either by doing it like the sub-sting op does or by retuning an optimized indexed buffer)

That again concludes that at least some custom implementation would be required if we want fast/mem efficient string operations :P.

15 Jan 2014
10:38 AM

Ayende Rahien

Bartosz, The point I am making here is that the mere reading of the data into strings is very costly.

15 Jan 2014
10:47 AM

Christophe

I understand what you try to explain to us, that using Strings can be expensive. But I can't help it to think that this comparison doesn't make any sense, it's like comparing apples with oranges, because the NoopParser actually does nothing with the read bytes in comparison with the TrivialCsvParser.

If you would compare both Parsers with car factories, then the TrivialCsvParser represents a real car factory that accepts the different car parts (the file bytes) and returns an assembled car (the IEnumerable[string]), while the NoopParser represents a car factory that accepts the different car parts (the file bytes) and only puts them on a shelve (the buffer) and returns no car whatsoever.

As you mentioned yourself, removing the split already reduces the cost, so the less processing that is done by the TrivialCsvParser, the less time it will take, so if it doesn't try to build a car, it will be faster.

I think a better comparison would be to effectively do something with the bytes, like have both Parsers return an instance of the same type, containing the data stored in the CSV, because you can delay the creation of a string as long as possible, but if you need the string in the end, you'll need to create it eventually.

15 Jan 2014
14:23 PM

Josh

Can you post the code used to generate the CSV file?

15 Jan 2014
14:28 PM

Ayende Rahien

Josh, I use spawner project for this. http://sourceforge.net/projects/spawner/

15 Jan 2014
14:40 PM

Scooletz

It looks like you're preparing for removing (partially) Lucene indexes in RavenDb and switching to the Voron trees. Am I right?

15 Jan 2014
14:46 PM

Ayende Rahien

Scooletz, No, that isn't something that is currently under consideration

15 Jan 2014
15:00 PM

Josh

What are the column types and subtypes you used when generating the file? Basically, I want to test some stuff with a similar file.

15 Jan 2014
16:37 PM

Thomas Krause

It would be also interesting to know if encoding makes any difference. With UTF8 I assume it will actually go through the data parsing the various code points while creating the string. If the encoding is already the same as in memory, it could optimize this to a simple mem copy.

Also how about reading the byte array and converting this to a string without going through StreamReader. Does it make any difference in performance?

16 Jan 2014
01:50 AM

Hannes K

It's pretty clear that the Gen0 generation of lots of little strings is causing the difference. Probably a PerfView diff would show. Has nothing to do with your amount of memory used (?). As Joe Duffy and others showed, string.Split(and Substring) are strictly forbidden in any perf-oriented C# code. To me it reads like the author was surprised by the outcome, yet it seems totally in line with the expected outcome. Slightly confused..

16 Jan 2014
01:54 AM

Hannes K

Second thought: ok, the difference between the non-split version and the byte array version is interesting, but probably has to do with the CLR trying to pool the strings, among the UTF-8 parsing as metioned. So, agreed, new info, besides the split() one.

16 Jan 2014
07:23 AM

Ayende Rahien

Josh, This is the schema I used:

Id,Numbers,Sequence,1|0|1 SSN,Human,Social Security Number Name,Human,Full Name,True|True Email,Human,E-Mail Address (name@domain.tld) ZipCode,Human,US ZIP Code (#####) IP,Networking,IPv4 Address,167772229|24|32|32|False|False

16 Jan 2014
07:47 AM

Ayende Rahien

Thomas, The actual cost is that you are generating a lot of strings, it doesn't really matter how you do it.

16 Jan 2014
07:58 AM

Ayende Rahien

Hannes, The CLR doesn't do string pooling.

16 Jan 2014
08:41 AM

Bartosz Adamczewski

@Hannes K, I think that reading strings from UTF-8 stream does not need to be parsed in any way, since it's a byte operation we are just converting an array of bytes in UTF-8 to UTF-16 since the codepoints are more or less the same size the additional cost should be very small if implemented properly (range lookups, no buffer expansion).

But nevertheless that would be an interesting test to make.

16 Jan 2014
08:44 AM

Ayende Rahien

Bartosz, In my tests, a lot of the cost went into the encoding operation, yes.

16 Jan 2014
08:51 AM

Bartosz Adamczewski

@Ayende, So in that that case I wonder if the conversion operation isn't the one to blame here for performance hickups, well from what you are saying it is. I'm going to take a huge guess but probably range lookups aren't done while converting. Since this is a core functionality that one has very little control of I guess that improving the situation would be hard.

16 Jan 2014
14:27 PM

Hannes K

Extranews: the CLR does do string pooling. ;-)

http://msdn.microsoft.com/en-us/library/system.string.intern.aspx "The common language runtime conserves string storage by maintaining a table, called the intern pool, that contains a single reference to each unique literal string declar....."

16 Jan 2014
14:33 PM

Hannes K

That's btw also the reason for some strange C# cornercases.. http://stackoverflow.com/questions/8317054/strange-string-literal-comparison and http://stackoverflow.com/questions/194484/whats-the-strangest-corner-case-youve-seen-in-c-sharp-or-net

16 Jan 2014
17:11 PM

Bartosz Adamczewski

@Hannes K,

This kind of behavior is only available by using the String.Intern method therefor it's not very useful. As far as I remember String.Intern is problematic as no matter what size of the strings that you were allocation they were always allocated in LOH and never freed as the intern table was holding on to those references, thus that did lead to LOH fragmentation in a big way. I'm not sure if this was fixed though (It was present in 3.5)

16 Jan 2014
19:32 PM

Hannes K

Please reference where this is stated. This is CLR behavior, the string.Intern method is just a way to programmatically access this behavior. See my stackoverflow links. The CLR does use an internal string table. There is also a post from jon Skeet explaining this, though I would have to dig that out from his blog..

16 Jan 2014
19:35 PM

Hannes K

Here it is.. http://csharpindepth.com/Articles/General/Strings.aspx To quote him "...and unfortunately it's much misunderstood.". Yep, agree on that one...

16 Jan 2014
19:42 PM

Petar Repac

http://blogs.msdn.com/b/ericlippert/archive/2009/09/28/string-interning-and-string-empty.aspx

16 Jan 2014
20:14 PM

Bartosz Adamczewski

@Hannes K, the link that Petar provided actually explains this phenomena, even more so the link that you provided by Jon states that automatic intent operation is only made on string literals this is very important distinction.

As for the fragmentation I did my own checks armed with WinDbg but not looking around in google I found this: http://stackoverflow.com/questions/686950/large-object-heap-fragmentation my tests were similar actually we had lots! of small objects flying around in LOH further analysis what was behind the address show that this is a string :P and since we wanted to be clever and smart, we used String.Intern heavily and removing that code and rolling custom pooling made the problem go away (we still had small strings in LOH but to much smaller extent).

Now that I think of it it's hard to say that String.Intern is even pooling as the CLR maintains a table of references should we change the string in any way it writes this as a new reference to a table, so it's more like copy on write from Unix :P Just saying.

16 Jan 2014
20:39 PM

Hannes K

Ok, everyday's a schoolday, learned something new. I bow my head to you, sir. And anyone who knew that.

16 Jan 2014
20:53 PM

Bartosz Adamczewski

@Hannes K, Glad that I could provide some meaningful info :)

17 Jan 2014
08:31 AM

Ayende Rahien

Hannes, String intern is only used if you do it explicitly.

The easiest thing to show it is by:

object.ReferenceEquals("foo" +1 , "foo" + 1);

This returns false.

Comment preview

Comments have been closed on this topic.

Markdown turns plain text formatting into fancy HTML formatting.

Phrase Emphasis

*italic*   **bold**
_italic_   __bold__

Links

Inline:

An [example](http://url.com/ "Title")

Reference-style labels (titles are optional):

An [example][id]. Then, anywhere
else in the doc, define the link:
  [id]: http://example.com/  "Title"

Images

Inline (titles are optional):

![alt text](/path/img.jpg "Title")

Reference-style:

![alt text][id]
[id]: /url/to/img.jpg "Title"

Headers

Setext-style:

Header 1
========
Header 2
--------

atx-style (closing #'s are optional):

# Header 1 #
## Header 2 ##
###### Header 6

Lists

Ordered, without paragraphs:

1.  Foo
2.  Bar

Unordered, with paragraphs:

*   A list item.
    With multiple paragraphs.
*   Bar

You can nest them:

*   Abacus
    * answer
*   Bubbles
    1.  bunk
    2.  bupkis
        * BELITTLER
    3. burper
*   Cunning

Blockquotes

> Email-style angle brackets
> are used for blockquotes.
> > And, they can be nested.
> #### Headers in blockquotes
> 
> * You can quote a list.
> * Etc.

Horizontal Rules

Three or more dashes or asterisks:

---
* * *
- - - -

Manual Line Breaks

End a line with two or more spaces:

Roses are red,   
Violets are blue.

Fenced Code Blocks

Code blocks delimited by 3 or more backticks or tildas:

```
This is a preformatted
code block
```

Header IDs

Set the id of headings with {#<id>} at end of heading line:

## My Heading {#myheading}

Tables

Fruit    |Color
---------|----------
Apples   |Red
Pears	 |Green
Bananas  |Yellow

Definition Lists

Term 1
: Definition 1
Term 2
: Definition 2

Footnotes

Body text with a footnote [^1]
[^1]: Footnote text here

Abbreviations

MDD <- will have title
*[MDD]: MarkdownDeep

Oren Eini

Oren Eini

CEO of RavenDB