Strings are annoying
I hate a love/hate/hate relationship with .NET strings. That is because they are both incredibly convenient and horribly inefficient in a bad way. Let us look at the following file:
1: "first_name","last_name","company_name","address","city","county","state","zip","phone1","phone2","email","web"2: "James","Butt","Benton, John B Jr","6649 N Blue Gum St","New Orleans","Orleans","LA",70116,"504-621-8927","504-845-1427","jbutt@gmail.com","http://www.bentonjohnbjr.com"3: "Josephine","Darakjy","Chanay, Jeffrey A Esq","4 B Blue Ridge Blvd","Brighton","Livingston","MI",48116,"810-292-9388","810-374-9840","josephine_darakjy@darakjy.org","http://www.chanayjeffreyaesq.com"4: "Art","Venere","Chemel, James L Cpa","8 W Cerritos Ave #54","Bridgeport","Gloucester","NJ","08014","856-636-8749","856-264-4130","art@venere.org","http://www.chemeljameslcpa.com"
Reading this is a simple matter of writing something like this:
1: var headerLine = reader.ReadLine();
2: var headers = headerLine.Split(',').Select(h=>h.Trim('"')).ToArray();3:
4: while(reader.EndOfStream == false)5: {
6: var line = reader.ReadLine();
7: var columns = line.Split(",");8: var dic = new Dictionary<string,string>();9: for(var i=0;i<headers.Length;i++)10: {
11: dic[headers[i]] = columns[i].Trim('"');12: }
13: yield return dic;14: }
Now, let us look at the same code again, but this time, I marked places where we are doing string allocation:
1: var headerLine = reader.ReadLine();
2: var headers = headerLine.Split(',').Select(h=>h.Trim('"')).ToArray();3:
4: while(reader.EndOfStream == false)5: {
6: var line = reader.ReadLine();
7: var columns = line.Split(",");8: var dic = new Dictionary<string,string>();9: for(var i=0;i<headers.Length;i++)10: {
11: dic[headers[i]] = columns[i].Trim('"');12: }
13: yield return dic;14: }
Those are a lot of strings that we are allocating. And if we are reading a large file, that can very quickly turn into a major performance issue. If I was writing the same in C, for example, I would be re-using the allocated string multiple times, but here we’ve to allocate and discard them pretty much continuously.
The really sad thing about it, it is incredibly easy to do this, usually without paying any attention. But even if you know what you are doing, you pretty much have to roll your own everything to get it to work. And that sucks quite badly.
Comments
It's even worse when you start dealing with your own rolled out string handling and have to introduce CultureInfo, startsWiths, etc. Joe Duffy, who currently works for a brand new C#, wrote about it an interestring post http://joeduffyblog.com/2012/10/30/beware-the-string/
I'm assuming this is just an example for a blog post because I don't believe splitting on "," would ever work. Your data wouldn't support that.
"Benton, John B Jr"
Your data contains a comma so you would split in the wrong place.
Phillip, Yes, it was just sample data that I didn't look at.
Is the allocation really that expensive for most scenarios?
I expect not sharing underlying buffers makes memory management more space efficient and simpler when garbage collection time comes.
Also just to spite your example it's probably a good scenario to use a stream reader.
Jahmai, I am using a stream reader here. And yes, it would be expensive for any scenario where this runs for a while. We generate a LOT of temp strings here. That leads to a lot of GC pressure, that leads to GC collections that leads to pauses and high costs.
As a side not, you also have a lot of allocations of char[] - every time you call Trim(char) is will allocate a new char[] - http://marcgravell.blogspot.co.il/2013/11/allocaction-allocation-allocation.html
I quite like the fact that strings behave the way they do in .net, but I agree string manipulation functions aren't appropriate for this type of usage on potentially large input files.
The problem with sharing buffers which isn't apparent in this example is the GC would need every substring reference to the parent buffer to be gone before you could collect which could be, for example, 20 bytes of a 1MB buffer.
What will you do? Call Read() and make a stateful tokenizer yourself?
One needs to be very careful with using the build in string methods. In a couple of commercial projects we had to roll out own string operations as the build in methods allocate way to much mem.
But it gets worse as most of build in .NET types (see System.Xml) actually are very inefficient with respect to how they actually handle string ops and allocation so in high frequency scenario one needs to actually replace lot's of existing types just to handle basic platform operations.
One nice idea is to create an off heap region just for strings and use it like an object pool.
For the new JIT I proposed on User Voice to add escape analysis so that small objects of known lifetime can be stack allocated. This is a well-known compiler technique. I think the JVM does that to a certain extent.
The problem with .Net strings is that the most natural thing to write tends to be the wrong thing.
.Net should have an OverlappedString class that allows creating strings as segments of an original string without copying. But then you get into the memory leak problems (who used to be so common in Java).
What is the best way to catch things like this? Memory profiler? I'm sure I'm guilty of this many times over but it's hard to catch every instance.
Mike, A memory profiler would help, but you usually won't see this as used memory. You'll want to do allocation tracking to see this.
why not allocate the vars before the loop? You'd be re-using the allocated memory then
jmrjr, No you will not. Variables != what they point at.
I have used CsvHelper (https://github.com/JoshClose/CsvHelper) on a number of projects. It is a pretty well written project. It makes it easy to map CSV to your POCO, handles embedded commas, customizable delimiters and much more. Take a look.
Yes, this is a bit of a problem in string-manipulation-heavy applications. However, I'd like to point out a few things that mitigate the issue a bit: One, most of the string manipulation methods will return the original string if no change is done, ie. Trim("Hello") doesn't allocate a new string. Second, the variables (and the returned strings in chained calls) are considered short lived - they will almost always be collected in Gen 0, and the GC will usually be able to avoid compacting the heap (which is the costly operation) - and of course, the heap itself is an implementation detail, a future version of .NET can do further optimizations (that's the pro of higher abstraction). Third, if you are writing a string-manipulation-heavy application, you can avoid string allocations a lot, although it usually means more work (that said, it's not that much more work than you'd have in eg. C++). So yeah, the biggest issue is that you can easily write something that will crush your performance. That said, the default string concatenation functions on zero terminated "strings" in C can cause even bigger performance issues, and is just as easy to write (thank the maker for C++'s std::string).
What's worked for me in the past has been to create string extension methods that contain logic that relies heavily on internal caches. This has worked great in single threaded scenarios. What I see are the same strings like 'hello ' trim value over and over again. Caching the result means the .net trim method will only be called once.
What about ReadAll instead of ReadLine? Maybe the disk access will be faster.
Is it possible to use StringBuilder?
What about the fixed keyword? (Unsafe)
Are there any benchmarks? Does it matter. Is parsing the bottleneck or e.g. disk access.
Are there high level constructs like dictionary who may decrease performance.
See http://readwrite.com/2011/06/06/cpp-go-java-scala-performance-benchmark#awesm=~osRIUhe14IBerT
@Carsten Hansen ReadAll should be faster, but it will introduce a whole new level of problems, now instead of multiple small gen0 allocations per call you will likely have LOH allocations per call (depending on the data size).
Fixed introduces pinning depending on the scenario we can end up either in many pinned blocks that are short lived or single large pinned block, besides I think that if you are using fixed you need to manipulate raw byte array buffers so nice .NET stream manipulations are off the table (I could be wrong).
Phil, That still allocates a lot of strings. The point wasn't really about CSV parsing, it was about string allocations.
Luaan, That works only as long as the amount of work that you are doing is relatively small. Assume you need to read a file that is several GB in size. When you do that using strings, you're going to end up having to do a LOT of work just allocating & managing the space. Sure, a lot of the garbage goes away fast, but that still has a lot of impact on how it works. And yes, null terminated strings in C are a big pain. I wish that PDP11 would have length prefix strings, or something like that. It would have drastically changed the face of computing.
Gary, a) That you have repeated strings. That is very unlikely for a lot of data (SSN, IP Addresses, Emails, etc). b) You still need to first allocate the string before you can actually check if it is in the cache. c) You now made your strings live much longer, and then might survive a Gen0 collection
Carsten, Congrats, you just blew away all the memory in the machine. Or you just allocated stuff on the LOH, which is MUCH more expensive to work with. StringBuilder isn't relevant here, because we can't get it to to many important operations (trim, starts with, etc). It is also not relevant because you can't get a string builder from a string builder.
And yes, this is very important.
Thanks for advise and I learned about LOH (Large Object Heap). See http://blogs.msdn.com/b/mariohewardt/archive/2013/06/26/no-more-memory-fragmentation-on-the-large-object-heap.aspx
You are right. If the datafile is small then speed does not matter unless you have a lot small files below eg. 1GB. When doing benchmarking you might consider a 100 GB file or 100 x 1 GB file.
Maybe the reading strategy should be dependent of the file size which might be known before starting.
@Carsten Hansen, compacting LOH is a nice addition but it does not solve anything with regards to costly sting allocations if anything (again depending on scenario) it can make things even worse. By issuing a compaction operation on LOH which can have size of several Gigabytes (depending on data size and data frequency) by issuing a compaction we are effectively slowing down the entire process as this sort of operation for GC means 'Stop the World' (unless we are running in a concurrent GC mode which has it's own set of problems in this scenario).
The other issue is that this is platform dependant feature and most codebases will not easily migrate to it.
Comment preview