Ayende @ Rahien

Refunds available at head office

Without strings, it is a dark, cold place…

So I set out to do some non trivial stuff with file parsing. The file format is CSV, and I am explicitly trying to do it with as few string allocations as possible.

In effect, I am basically relying on a char array that I manually manage. But as it turns out, this is not so easy. To start with, 65279 should be taken out and shot. That is the BOM marker (U+FEFF), and it is has a very nasty habit of showing up when you are mixing StreamWriter and reading from a byte stream, even when I made sure to use the UTF8 encoding anyway.

It is possible, as I said, but it is anything but nice. I set out to do non trivial stuff using this approach, but I wonder how useful this actually is. From experience, this can kill a system performance. This has been more than just my experience: http://joeduffyblog.com/2012/10/30/beware-the-string

Of course, the moment that you start dealing with your own string type, it is all back in the good bad days of C++ and BSTR vs cstr vs std::string vs. MyString vs OmgStr. For example, how do you look at the value during debug…

I am pretty sure that in general, that isn’t something that you’ll want to do. In my spike, quite a lot of the issues that came up were directly associated with this. On the other hand, this did let me do things like string pooling, efficient parsing with no allocations, etc.

But I’ll talk about that specific project in my next post.

Comments

kpvleeuwen
01/14/2014 11:55 AM by
kpvleeuwen

This is an area where the DebuggerDisplay attribute is very helpful :) String indexing is really nontrivial with all Unicode stuff decorating the characters, so is it not better to use a single string as backing instead of a manual managed char array, or does this have the same issues? For your previous example, that would allocate just a single string per line.

Comments have been closed on this topic.