The cost of working with strings
Following my last post, I decided that it might be better to actually show what the difference is between direct string manipulation and working at lower levels.
I generated a sample CSV file with 10 million lines and 6 columns. The file size was 658MB. I then wrote the simplest code that I could possibly think of:
1: public class TrivialCsvParser2: {
3: private readonly string _path;4:
5: public TrivialCsvParser(string path)6: {
7: _path = path;
8: }
9:
10: public IEnumerable<string[]> Parse()11: {
12: using (var reader = new StreamReader(_path))13: {
14: while (true)15: {
16: var line = reader.ReadLine();
17: if (line == null)18: break;19: var fields = line.Split(',');20: yield return fields;21: }
22: }
23: }
24: }
This run in 8.65 seconds (with a no-op action) and kept the memory utilization at about 7MB.
Then next thing to try was just reading through the file without doing any parsing. So I wrote this:
1: public class NoopParser2: {
3: private readonly string _path;4:
5: public NoopParser(string path)6: {
7: _path = path;
8: }
9:
10: public IEnumerable<object> Parse()11: {
12: var buffer = new byte[1024];13: using (var stream = new FileStream(_path,FileMode.Open, FileAccess.Read))14: {
15: while (true)16: {
17: var result = stream.Read(buffer, 0, buffer.Length);
18: if (result == 0)19: break;20: yield return null; // noop21: }
22: }
23: }
24: }
Note that this isn’t actually doing anything. But this took 0.83 seconds, so we see a pretty important big difference here. By the way, the amount of memory used isn’t noticeably different here. Both use about 7 MB. Probably because we aren’t actually holding up to any of the data in any meaningful way.
I have run the results using release build, and I run it multiple times, so the file is probably all in the OS cache. So I/O cost is pretty minimal here. However, note that we aren’t doing a lot of stuff that is being done by the TrivialCsvParser. For example, doing line searches, splitting the string to fields, etc. But interestingly enough, just removing the split will reduce the cost from 8.65 seconds to 3.55 seconds.
Comments
The amount of used mem should be different but since lot's of strings that are allocated in example 1 will be collected in gen0 the memory allocation delta should be actually different.
The split is slow because you are doing an O(n) operation 10 million times and copying char arrays around or calling substrings with marked indexes and putting them into an array or list (I haven't checked the implementation recently but I would expect that to happen). This sort of thing could be easily optimized with little effort by implementing a method called ReadLineAndSplit since while reading the buffer one could do the marking and splitting (either by doing it like the sub-sting op does or by retuning an optimized indexed buffer)
That again concludes that at least some custom implementation would be required if we want fast/mem efficient string operations :P.
Bartosz, The point I am making here is that the mere reading of the data into strings is very costly.
I understand what you try to explain to us, that using Strings can be expensive. But I can't help it to think that this comparison doesn't make any sense, it's like comparing apples with oranges, because the NoopParser actually does nothing with the read bytes in comparison with the TrivialCsvParser.
If you would compare both Parsers with car factories, then the TrivialCsvParser represents a real car factory that accepts the different car parts (the file bytes) and returns an assembled car (the IEnumerable[string]), while the NoopParser represents a car factory that accepts the different car parts (the file bytes) and only puts them on a shelve (the buffer) and returns no car whatsoever.
As you mentioned yourself, removing the split already reduces the cost, so the less processing that is done by the TrivialCsvParser, the less time it will take, so if it doesn't try to build a car, it will be faster.
I think a better comparison would be to effectively do something with the bytes, like have both Parsers return an instance of the same type, containing the data stored in the CSV, because you can delay the creation of a string as long as possible, but if you need the string in the end, you'll need to create it eventually.
Can you post the code used to generate the CSV file?
Josh, I use spawner project for this. http://sourceforge.net/projects/spawner/
It looks like you're preparing for removing (partially) Lucene indexes in RavenDb and switching to the Voron trees. Am I right?
Scooletz, No, that isn't something that is currently under consideration
What are the column types and subtypes you used when generating the file? Basically, I want to test some stuff with a similar file.
It would be also interesting to know if encoding makes any difference. With UTF8 I assume it will actually go through the data parsing the various code points while creating the string. If the encoding is already the same as in memory, it could optimize this to a simple mem copy.
Also how about reading the byte array and converting this to a string without going through StreamReader. Does it make any difference in performance?
It's pretty clear that the Gen0 generation of lots of little strings is causing the difference. Probably a PerfView diff would show. Has nothing to do with your amount of memory used (?). As Joe Duffy and others showed, string.Split(and Substring) are strictly forbidden in any perf-oriented C# code. To me it reads like the author was surprised by the outcome, yet it seems totally in line with the expected outcome. Slightly confused..
Second thought: ok, the difference between the non-split version and the byte array version is interesting, but probably has to do with the CLR trying to pool the strings, among the UTF-8 parsing as metioned. So, agreed, new info, besides the split() one.
Josh, This is the schema I used:
Id,Numbers,Sequence,1|0|1 SSN,Human,Social Security Number Name,Human,Full Name,True|True Email,Human,E-Mail Address (name@domain.tld) ZipCode,Human,US ZIP Code (#####) IP,Networking,IPv4 Address,167772229|24|32|32|False|False
Thomas, The actual cost is that you are generating a lot of strings, it doesn't really matter how you do it.
Hannes, The CLR doesn't do string pooling.
@Hannes K, I think that reading strings from UTF-8 stream does not need to be parsed in any way, since it's a byte operation we are just converting an array of bytes in UTF-8 to UTF-16 since the codepoints are more or less the same size the additional cost should be very small if implemented properly (range lookups, no buffer expansion).
But nevertheless that would be an interesting test to make.
Bartosz, In my tests, a lot of the cost went into the encoding operation, yes.
@Ayende, So in that that case I wonder if the conversion operation isn't the one to blame here for performance hickups, well from what you are saying it is. I'm going to take a huge guess but probably range lookups aren't done while converting. Since this is a core functionality that one has very little control of I guess that improving the situation would be hard.
Extranews: the CLR does do string pooling. ;-)
http://msdn.microsoft.com/en-us/library/system.string.intern.aspx "The common language runtime conserves string storage by maintaining a table, called the intern pool, that contains a single reference to each unique literal string declar....."
That's btw also the reason for some strange C# cornercases.. http://stackoverflow.com/questions/8317054/strange-string-literal-comparison and http://stackoverflow.com/questions/194484/whats-the-strangest-corner-case-youve-seen-in-c-sharp-or-net
@Hannes K,
This kind of behavior is only available by using the String.Intern method therefor it's not very useful. As far as I remember String.Intern is problematic as no matter what size of the strings that you were allocation they were always allocated in LOH and never freed as the intern table was holding on to those references, thus that did lead to LOH fragmentation in a big way. I'm not sure if this was fixed though (It was present in 3.5)
Please reference where this is stated. This is CLR behavior, the string.Intern method is just a way to programmatically access this behavior. See my stackoverflow links. The CLR does use an internal string table. There is also a post from jon Skeet explaining this, though I would have to dig that out from his blog..
Here it is.. http://csharpindepth.com/Articles/General/Strings.aspx To quote him "...and unfortunately it's much misunderstood.". Yep, agree on that one...
http://blogs.msdn.com/b/ericlippert/archive/2009/09/28/string-interning-and-string-empty.aspx
@Hannes K, the link that Petar provided actually explains this phenomena, even more so the link that you provided by Jon states that automatic intent operation is only made on string literals this is very important distinction.
As for the fragmentation I did my own checks armed with WinDbg but not looking around in google I found this: http://stackoverflow.com/questions/686950/large-object-heap-fragmentation my tests were similar actually we had lots! of small objects flying around in LOH further analysis what was behind the address show that this is a string :P and since we wanted to be clever and smart, we used String.Intern heavily and removing that code and rolling custom pooling made the problem go away (we still had small strings in LOH but to much smaller extent).
Now that I think of it it's hard to say that String.Intern is even pooling as the CLR maintains a table of references should we change the string in any way it writes this as a new reference to a table, so it's more like copy on write from Unix :P Just saying.
Ok, everyday's a schoolday, learned something new. I bow my head to you, sir. And anyone who knew that.
@Hannes K, Glad that I could provide some meaningful info :)
Hannes, String intern is only used if you do it explicitly.
The easiest thing to show it is by:
object.ReferenceEquals("foo" +1 , "foo" + 1);
This returns false.
Comment preview