Ayende @ Rahien

It's a girl

Stupid smart code

We had the following code:

public static void WriteDataToRequest(HttpWebRequest req, string data)
{
    var byteArray = Encoding.UTF8.GetBytes(data);

    req.ContentLength = byteArray.Length;

    using (var dataStream = req.GetRequestStream())
    {
        dataStream.Write(byteArray, 0, byteArray.Length);
        dataStream.Flush();
    }
}

And that is a problem, because it allocates the memory twice, once for the string, once for the buffer. I changed that to this:

public static void WriteDataToRequest(HttpWebRequest req, string data)
{
    var byteCount = Encoding.UTF8.GetByteCount(data);
    req.ContentLength = byteCount;
    using (var dataStream = req.GetRequestStream())
    {
        if(byteCount <= 0x1000) // small size, just let the system allocate it
        {
            var bytes = Encoding.UTF8.GetBytes(data);
            dataStream.Write(bytes, 0, bytes.Length);
            dataStream.Flush();
            return;
        }

        var buffer = new byte[0x1000];
        var maxCharsThatCanFitInBuffer = buffer.Length / Encoding.UTF8.GetMaxByteCount(1);
        var charBuffer = new char[maxCharsThatCanFitInBuffer];
        int start = 0;
        var encoder = Encoding.UTF8.GetEncoder();
        while (start < data.Length)
        {
            var charCount = Math.Min(charBuffer.Length, data.Length - start);

            data.CopyTo(start, charBuffer, 0, charCount);
            var bytes = encoder.GetBytes(charBuffer, 0, charCount, buffer, 0, false);
            dataStream.Write(buffer, 0, bytes);
            start += charCount;
        }
        dataStream.Flush();
    }
}

And I was quite proud of myself.

Then I realized that I was stupid. Why?

Comments

tobi
12/20/2011 10:03 AM by
tobi

Lots of scanning/copying of data. Worse than zeroing and filling an array once.

tobi
12/20/2011 10:10 AM by
tobi

btw, take the Flush out. Why are people always doing this? Why would the stream delete data on dispose?

Patrick Huizinga
12/20/2011 10:16 AM by
Patrick Huizinga

There's also an overload of GetBytes that accepts a string instead of a char buffer. http://msdn.microsoft.com/en-us/library/c20ss51h.aspx

so you can remove the charBuffer and the while loop can be changed to: var charCount = Math.Min(maxCharsThatCanFitInBuffer, data.Length - start); var bytes = encoder.GetBytes(data, start, charCount, buffer, 0); // could not find any GetBytes overload that takes a bool dataStream.Write(buffer, 0, bytes); start += charCount

However, I fear this is not the stupid part of the code. For that I'll wait till tomorow.

wouter
12/20/2011 10:16 AM by
wouter

Couldn't you just have used a StreamWriter? You can specify encoding and buffer size in its constructor and then write the string directly...

Patrick Huizinga
12/20/2011 10:21 AM by
Patrick Huizinga

tobi: where exactly is the scanning happening? I can only spot operations that are / should be O(1)

And are you sure copying 100 times to an array of size 4K is slower that copying once to an array of size 400K?

Fredrik Mörk
12/20/2011 10:23 AM by
Fredrik Mörk

The charBuffer seems unnecessary, given that a string can also act as a char[], so instead of shuffling those chars, you can have the encoder.GetBytes call pull the characters directly from data.

Richard Dingwall
12/20/2011 10:25 AM by
Richard Dingwall

What is the expected size of 'string data'?

Ryan Heath
12/20/2011 10:36 AM by
Ryan Heath

You run the risk to split the string's surrogate pairs?

// Ryan

Thomas Levesque
12/20/2011 10:38 AM by
Thomas Levesque

@wouter, this doesn't allow you to know the length in bytes before you start writing the data, and headers must be sent before you start writing the body

Dave Hng
12/20/2011 10:49 AM by
Dave Hng

What is in 'string data'? Isn't that already a buffer of some sorts?

tobi
12/20/2011 11:02 AM by
tobi

Patrick, GetByteCount is O(N) because UTF8 is variable length. Need to decode the entire string for this. I like the smaller chunks, however, for cache efficiency reasons.

Jarrett
12/20/2011 11:06 AM by
Jarrett

Because you started with code that was crystal clear and anyone should be able to read & understand and replaced it with code that you need to read, reread, think about, and read again.

sam
12/20/2011 11:07 AM by
sam

my guess

public virtual int GetByteCount(string s) { if (s == null) { throw new ArgumentNullException("s"); } char[] chars = s.ToCharArray(); return this.GetByteCount(chars, 0, chars.Length); }

Mose
12/20/2011 11:12 AM by
Mose

GetRequestStream() returns a synchrounous stream.

That means until the stream is closed or flushed, data is not sent, it's stored in memory.

Ant
12/20/2011 11:15 AM by
Ant

Writing data to a request? Huh?

Sam
12/20/2011 11:26 AM by
Sam

I'm with wouter - my first thought was: "he's reimplemented StreamWriter.Write()".

@Thomas - ContentLength is being determined by Encoding.UTF8.GetByteCount(), not by examining the size of the encoded array (although I'm trusting Microsoft to have not implemented GetByteCount like that).

tobi
12/20/2011 11:28 AM by
tobi

Sam, I believe there is no O(1) way to calculate GetByteCount.

Sam
12/20/2011 11:28 AM by
Sam

@Ant - for instance, the body of a POST or PUT request.

JD
12/20/2011 11:31 AM by
JD

"bytes" var is declared inside the loop so N many will be allocated. Of course these will eventually be GCed but the point of the exercise was to sacrifice simplicity for efficiency.

Sam
12/20/2011 11:35 AM by
Sam

@tobi - my understanding is that the optimisation was intended to avoid allocating memory for the (potentially large) request body twice, not to minimise encoding time.

Sam
12/20/2011 11:56 AM by
Sam

@JD - 'bytes' is actually an int. The joys of var and ambiguous variable names!

Patrick Huizinga
12/20/2011 11:56 AM by
Patrick Huizinga

@tobi, missed that one, although I only looked inside the loop

@JD, bytes is an int, not much GC-ing there :)

IgorM
12/20/2011 12:45 PM by
IgorM

You can use the lambda expression and execute it in one line:

Var reader = new StringReader(data); reader.CopyTo(req.GetRequestStream();

Rajeesh C V
12/20/2011 02:26 PM by
Rajeesh C V

If the method is not going to write big chunk of data, then you have increased the complexity instead of keeping it simple

Dennis
12/20/2011 02:44 PM by
Dennis

Jarrett is the only one who's got it so far. Do you really need the extra code complexity and lack of maintainability for that performance gain? Do you have evidence that the extra memory allocation was really hurting the system?

Mike
12/20/2011 03:09 PM by
Mike

I'm guessing this code is stupid because you probably don't have any profiling metrics that show it actually needs optimization in the first place. So, rather than fixing true bottlenecks or implementing new features, you have needlessly invested a lot of time and introduced much more complexity.

Frank Quednau
12/20/2011 03:35 PM by
Frank Quednau

Confused by the first version in the first place. The simplest abstraction to write a string to a stream is a streamwriter, is it not?

using (var w = new StreamWriter(r.GetRequestStream(), Encoding.UTF8)) sw.Write(data);

Dennis
12/20/2011 03:54 PM by
Dennis

Jarrett indeed is the first one who mentions the first thing that popped into my mind when reading this: "why change something which intent initially was perfectly obvious into something that you need to read multiple times and is a lot harder to understand for a micro-optimization that may not even necessarily be an issue?"

Doesn't seem like it's worth the extra complexity it introduces at a cost of read- and maintainability.

Bart Verkoeijen
12/20/2011 04:42 PM by
Bart Verkoeijen

What's the gain in performance compared to the extra complexity of the code?

James_2JS
12/20/2011 04:51 PM by
James_2JS

@Frank Quednau - the first version is like it is due to the need to set ContentLength before writing to the stream

James_2JS
12/20/2011 04:54 PM by
James_2JS

I believe the problem is two-fold:

1) GetByteCount(string s) converts the string into char[s.Length] - so nothing has been gained there

2) worse still, GetByteCount has to encode the characters to produce its result, then the characters are encoded a 2nd time to write to the stream

João Angelo
12/20/2011 06:32 PM by
João Angelo

As other pointed out I'm on the side that the "I'm stupid" epiphany was due to realizing that you just reimplemented logic that is already implemented by StreamWriter.

Michael McDaniel
12/20/2011 06:33 PM by
Michael McDaniel

How about flushing inside the loop if you want the data streamed? Let's call it lots of "courtesy flushes"

Karep
12/20/2011 07:00 PM by
Karep

I'm not sure I understand. In first solution you say memory is allocated twice. One for the string (parameter?) and for the buffer.

In second solution memory is allocated for string and then for buffer but multiple times. var buffer... var charBuffer... and in a while loop var bytes

So what you gain is that when data is very big you don't end up allocating a lot of memory on LOH. Instead you do a lot of small allocations if data is big.

Don't know what you mean by being stupid. I don't like the code before and after modification if data is very big. If it is not then why bother changing original code?

dotnetchris
12/20/2011 08:00 PM by
dotnetchris

Tell us what you see Ayende

Jay
12/20/2011 08:05 PM by
Jay

Use Stream.Copy

Steve Py
12/20/2011 09:36 PM by
Steve Py

Jarret + 0x1000!

Seeing as these are strings being passed back on a request I cannot see them being HUGE. (many MBs) I'd expect if anything was a memory drain it would be from the server serving many requests in which case the complex code would solve nothing. Temporarily allocating a few extra K for byte arrays to maintain readable code is certainly worth it.

Ben
12/20/2011 10:35 PM by
Ben

data.CopyTo(...)

copies the value and doesn't reference the original char?

Ryan
12/21/2011 12:56 AM by
Ryan

My guess is that since you instantiate a new "bytes" var in every loop it actually ends up holding everything in memory twice anyways(at least until the GC kicks in).

meisinger
12/21/2011 07:50 AM by
meisinger

hmm... don't know; this seems a little like "select ain't broken" but... clearly i'm not the smartest guy in the room but i would think that you would want to flush the internal state of the encoder

of course i have been bitten by this type of post in the past to where it could be that there was no check for the "data" parameter being null or empty

tobi
12/21/2011 11:37 AM by
tobi

It is almost always wrong to flush. The only reason I can come up with is to harden data on disk immediately for durability semantics. Flush = fsync. It is always slower and does not change the data written at all.

Ayende Rahien
12/21/2011 11:52 AM by
Ayende Rahien

Tobi, No, it isn't. There are a lot of reasons to flush. Reducing memory usage, for example, ensuring lower latency in network scenarios, for another.

tobi
12/21/2011 01:07 PM by
tobi

Ayende, but data consistency is not one. Memory usage should be controlled by configuring the buffer size. You cannot release memory by flushing. The OS will flush to disk but still have the data cached. Network latency: true. Rarely needed and reduces throughput.

Ayende Rahien
12/21/2011 01:11 PM by
Ayende Rahien

Tobi, That is why I moved to stream writer there in the first place. And the flushing merely ensures that the data goes out. Without it, you are reliant on the Dispose calling Flush, and there are good reasons not of call it, sometimes. For example, you might call Dispose on the writer, which will dispose the stream, but you don't want that. One example why you want that.

Torvin
12/21/2011 06:57 PM by
Torvin

req.GetRequestStream() returns an instance of internal ConnectStream class, whose Flush() method for some reason does exactly nothing...

Nitin
12/22/2011 10:30 AM by
Nitin

var byteCount = Encoding.UTF8.GetByteCount(data); GetByteCount creates one more array of characters?

Ayende Rahien
12/22/2011 10:33 AM by
Ayende Rahien

Nitin, GetBytesCount does NOT create an array

Nitin
12/22/2011 10:47 AM by
Nitin

Internally before counting the number of characters?

Ayende Rahien
12/22/2011 11:09 AM by
Ayende Rahien

Nitin, No, it doesn't do that

Comments have been closed on this topic.