Stupid smart code

time to read 3 min | 549 words

We had the following code:

public static void WriteDataToRequest(HttpWebRequest req, string data)
{
    var byteArray = Encoding.UTF8.GetBytes(data);

    req.ContentLength = byteArray.Length;

    using (var dataStream = req.GetRequestStream())
    {
        dataStream.Write(byteArray, 0, byteArray.Length);
        dataStream.Flush();
    }
}

And that is a problem, because it allocates the memory twice, once for the string, once for the buffer. I changed that to this:

public static void WriteDataToRequest(HttpWebRequest req, string data)
{
    var byteCount = Encoding.UTF8.GetByteCount(data);
    req.ContentLength = byteCount;
    using (var dataStream = req.GetRequestStream())
    {
        if(byteCount <= 0x1000) // small size, just let the system allocate it
        {
            var bytes = Encoding.UTF8.GetBytes(data);
            dataStream.Write(bytes, 0, bytes.Length);
            dataStream.Flush();
            return;
        }

        var buffer = new byte[0x1000];
        var maxCharsThatCanFitInBuffer = buffer.Length / Encoding.UTF8.GetMaxByteCount(1);
        var charBuffer = new char[maxCharsThatCanFitInBuffer];
        int start = 0;
        var encoder = Encoding.UTF8.GetEncoder();
        while (start < data.Length)
        {
            var charCount = Math.Min(charBuffer.Length, data.Length - start);

            data.CopyTo(start, charBuffer, 0, charCount);
            var bytes = encoder.GetBytes(charBuffer, 0, charCount, buffer, 0, false);
            dataStream.Write(buffer, 0, bytes);
            start += charCount;
        }
        dataStream.Flush();
    }
}

And I was quite proud of myself.

Then I realized that I was stupid. Why?

Tweet Share Share 49 comments

Tags:

Performance

Comments

20 Dec 2011
10:03 AM

tobi

Lots of scanning/copying of data. Worse than zeroing and filling an array once.

20 Dec 2011
10:10 AM

tobi

btw, take the Flush out. Why are people always doing this? Why would the stream delete data on dispose?

20 Dec 2011
10:16 AM

Patrick Huizinga

There's also an overload of GetBytes that accepts a string instead of a char buffer. http://msdn.microsoft.com/en-us/library/c20ss51h.aspx

so you can remove the charBuffer and the while loop can be changed to:
var charCount = Math.Min(maxCharsThatCanFitInBuffer, data.Length - start);
var bytes = encoder.GetBytes(data, start, charCount, buffer, 0); // could not find any GetBytes overload that takes a bool
dataStream.Write(buffer, 0, bytes);
start += charCount

However, I fear this is not the stupid part of the code. For that I'll wait till tomorow.

20 Dec 2011
10:16 AM

wouter

Couldn't you just have used a StreamWriter? You can specify encoding and buffer size in its constructor and then write the string directly...

20 Dec 2011
10:21 AM

Patrick Huizinga

tobi: where exactly is the scanning happening? I can only spot operations that are / should be O(1)

And are you sure copying 100 times to an array of size 4K is slower that copying once to an array of size 400K?

20 Dec 2011
10:23 AM

Fredrik Mörk

The charBuffer seems unnecessary, given that a string can also act as a char[], so instead of shuffling those chars, you can have the encoder.GetBytes call pull the characters directly from data.

20 Dec 2011
10:25 AM

Richard Dingwall

What is the expected size of 'string data'?

20 Dec 2011
10:36 AM

Ryan Heath

You run the risk to split the string's surrogate pairs?

// Ryan

20 Dec 2011
10:38 AM

Thomas Levesque

@wouter, this doesn't allow you to know the length in bytes before you start writing the data, and headers must be sent before you start writing the body

20 Dec 2011
10:49 AM

Dave Hng

What is in 'string data'? Isn't that already a buffer of some sorts?

20 Dec 2011
11:02 AM

tobi

Patrick, GetByteCount is O(N) because UTF8 is variable length. Need to decode the entire string for this. I like the smaller chunks, however, for cache efficiency reasons.

20 Dec 2011
11:06 AM

Jarrett

Because you started with code that was crystal clear and anyone should be able to read & understand and replaced it with code that you need to read, reread, think about, and read again.

20 Dec 2011
11:07 AM

sam

my guess

public virtual int GetByteCount(string s)
{
if (s == null)
{
throw new ArgumentNullException("s");
}
char[] chars = s.ToCharArray();
return this.GetByteCount(chars, 0, chars.Length);
}

20 Dec 2011
11:12 AM

Mose

GetRequestStream() returns a synchrounous stream.

That means until the stream is closed or flushed, data is not sent,
it's stored in memory.

20 Dec 2011
11:15 AM

Ant

Writing data to a request? Huh?

20 Dec 2011
11:26 AM

Sam

I'm with wouter - my first thought was: "he's reimplemented StreamWriter.Write()".

@Thomas - ContentLength is being determined by Encoding.UTF8.GetByteCount(), not by examining the size of the encoded array (although I'm trusting Microsoft to have not implemented GetByteCount like that).

20 Dec 2011
11:28 AM

tobi

Sam, I believe there is no O(1) way to calculate GetByteCount.

20 Dec 2011
11:28 AM

Sam

@Ant - for instance, the body of a POST or PUT request.

20 Dec 2011
11:31 AM

"bytes" var is declared inside the loop so N many will be allocated. Of course these will eventually be GCed but the point of the exercise was to sacrifice simplicity for efficiency.

20 Dec 2011
11:35 AM

Sam

@tobi - my understanding is that the optimisation was intended to avoid allocating memory for the (potentially large) request body twice, not to minimise encoding time.

20 Dec 2011
11:56 AM

Sam

@JD - 'bytes' is actually an int. The joys of var and ambiguous variable names!

20 Dec 2011
11:56 AM

Patrick Huizinga

@tobi, missed that one, although I only looked inside the loop

@JD, bytes is an int, not much GC-ing there :)

20 Dec 2011
12:45 PM

IgorM

You can use the lambda expression and execute it in one line:

Var reader = new StringReader(data);
reader.CopyTo(req.GetRequestStream();

20 Dec 2011
14:26 PM

Rajeesh C V

If the method is not going to write big chunk of data, then
you have increased the complexity instead of keeping it simple

20 Dec 2011
14:44 PM

Dennis

Jarrett is the only one who's got it so far. Do you really need the extra code complexity and lack of maintainability for that performance gain? Do you have evidence that the extra memory allocation was really hurting the system?

20 Dec 2011
15:09 PM

Mike

I'm guessing this code is stupid because you probably don't have any profiling metrics that show it actually needs optimization in the first place. So, rather than fixing true bottlenecks or implementing new features, you have needlessly invested a lot of time and introduced much more complexity.

20 Dec 2011
15:35 PM

Frank Quednau

Confused by the first version in the first place. The simplest abstraction to write a string to a stream is a streamwriter, is it not?

using (var w = new StreamWriter(r.GetRequestStream(), Encoding.UTF8))
sw.Write(data);

20 Dec 2011
15:54 PM

Dennis

Jarrett indeed is the first one who mentions the first thing that popped into my mind when reading this: "why change something which intent initially was perfectly obvious into something that you need to read multiple times and is a lot harder to understand for a micro-optimization that may not even necessarily be an issue?"

Doesn't seem like it's worth the extra complexity it introduces at a cost of read- and maintainability.

20 Dec 2011
16:42 PM

Bart Verkoeijen

What's the gain in performance compared to the extra complexity of the code?

20 Dec 2011
16:51 PM

James_2JS

@Frank Quednau - the first version is like it is due to the need to set ContentLength before writing to the stream

20 Dec 2011
16:54 PM

James_2JS

I believe the problem is two-fold:

GetByteCount(string s) converts the string into char[s.Length] - so nothing has been gained there
worse still, GetByteCount has to encode the characters to produce its result, then the characters are encoded a 2nd time to write to the stream

20 Dec 2011
18:32 PM

João Angelo

As other pointed out I'm on the side that the "I'm stupid" epiphany was due to realizing that you just reimplemented logic that is already implemented by StreamWriter.

20 Dec 2011
18:33 PM

Michael McDaniel

How about flushing inside the loop if you want the data streamed? Let's call it lots of "courtesy flushes"

20 Dec 2011
19:00 PM

Karep

I'm not sure I understand.
In first solution you say memory is allocated twice. One for the string (parameter?) and for the buffer.

In second solution memory is allocated for string and then for buffer but multiple times.
var buffer...
var charBuffer...
and in a while loop
var bytes

So what you gain is that when data is very big you don't end up allocating a lot of memory on LOH. Instead you do a lot of small allocations if data is big.

Don't know what you mean by being stupid. I don't like the code before and after modification if data is very big. If it is not then why bother changing original code?

20 Dec 2011
20:00 PM

dotnetchris

Tell us what you see Ayende

20 Dec 2011
20:05 PM

Jay

Use Stream.Copy

20 Dec 2011
21:36 PM

Steve Py

Jarret + 0x1000!

Seeing as these are strings being passed back on a request I cannot see them being HUGE. (many MBs) I'd expect if anything was a memory drain it would be from the server serving many requests in which case the complex code would solve nothing. Temporarily allocating a few extra K for byte arrays to maintain readable code is certainly worth it.

20 Dec 2011
22:35 PM

Ben

data.CopyTo(...)

copies the value and doesn't reference the original char?

21 Dec 2011
00:56 AM

Ryan

My guess is that since you instantiate a new "bytes" var in every loop it actually ends up holding everything in memory twice anyways(at least until the GC kicks in).

21 Dec 2011
07:50 AM

meisinger

hmm... don't know; this seems a little like "select ain't broken" but...
clearly i'm not the smartest guy in the room but i would think that you would want to flush the internal state of the encoder

of course i have been bitten by this type of post in the past to where it could be that there was no check for the "data" parameter being null or empty

21 Dec 2011
11:37 AM

tobi

It is almost always wrong to flush. The only reason I can come up with is to harden data on disk immediately for durability semantics. Flush = fsync. It is always slower and does not change the data written at all.

21 Dec 2011
11:52 AM

Ayende Rahien

Tobi,
No, it isn't. There are a lot of reasons to flush. Reducing memory usage, for example, ensuring lower latency in network scenarios, for another.

21 Dec 2011
13:07 PM

tobi

Ayende,
but data consistency is not one. Memory usage should be controlled by configuring the buffer size. You cannot release memory by flushing. The OS will flush to disk but still have the data cached. Network latency: true. Rarely needed and reduces throughput.

21 Dec 2011
13:11 PM

Ayende Rahien

Tobi,
That is why I moved to stream writer there in the first place.
And the flushing merely ensures that the data goes out. Without it, you are reliant on the Dispose calling Flush, and there are good reasons not of call it, sometimes.
For example, you might call Dispose on the writer, which will dispose the stream, but you don't want that.
One example why you want that.

21 Dec 2011
18:57 PM

Torvin

req.GetRequestStream() returns an instance of internal ConnectStream class, whose Flush() method for some reason does exactly nothing...

22 Dec 2011
10:30 AM

Nitin

var byteCount = Encoding.UTF8.GetByteCount(data);
GetByteCount creates one more array of characters?

22 Dec 2011
10:33 AM

Ayende Rahien

Nitin,
GetBytesCount does NOT create an array

22 Dec 2011
10:47 AM

Nitin

Internally before counting the number of characters?

22 Dec 2011
11:09 AM

Ayende Rahien

Nitin,
No, it doesn't do that

Comment preview

Comments have been closed on this topic.

Markdown turns plain text formatting into fancy HTML formatting.

Phrase Emphasis

*italic*   **bold**
_italic_   __bold__

Links

Inline:

An [example](http://url.com/ "Title")

Reference-style labels (titles are optional):

An [example][id]. Then, anywhere
else in the doc, define the link:
  [id]: http://example.com/  "Title"

Images

Inline (titles are optional):

![alt text](/path/img.jpg "Title")

Reference-style:

![alt text][id]
[id]: /url/to/img.jpg "Title"

Headers

Setext-style:

Header 1
========
Header 2
--------

atx-style (closing #'s are optional):

# Header 1 #
## Header 2 ##
###### Header 6

Lists

Ordered, without paragraphs:

1.  Foo
2.  Bar

Unordered, with paragraphs:

*   A list item.
    With multiple paragraphs.
*   Bar

You can nest them:

*   Abacus
    * answer
*   Bubbles
    1.  bunk
    2.  bupkis
        * BELITTLER
    3. burper
*   Cunning

Blockquotes

> Email-style angle brackets
> are used for blockquotes.
> > And, they can be nested.
> #### Headers in blockquotes
> 
> * You can quote a list.
> * Etc.

Horizontal Rules

Three or more dashes or asterisks:

---
* * *
- - - -

Manual Line Breaks

End a line with two or more spaces:

Roses are red,   
Violets are blue.

Fenced Code Blocks

Code blocks delimited by 3 or more backticks or tildas:

```
This is a preformatted
code block
```

Header IDs

Set the id of headings with {#<id>} at end of heading line:

## My Heading {#myheading}

Tables

Fruit    |Color
---------|----------
Apples   |Red
Pears	 |Green
Bananas  |Yellow

Definition Lists

Term 1
: Definition 1
Term 2
: Definition 2

Footnotes

Body text with a footnote [^1]
[^1]: Footnote text here

Abbreviations

MDD <- will have title
*[MDD]: MarkdownDeep

Oren Eini

Oren Eini

CEO of RavenDB