Ayende @ Rahien

Hi!
My name is Oren Eini
Founder of Hibernating Rhinos LTD and RavenDB.
You can reach me by phone or email:

ayende@ayende.com

+972 52-548-6969

, @ Q c

Posts: 5,971 | Comments: 44,508

filter by tags archive

Stupid smart code


We had the following code:

public static void WriteDataToRequest(HttpWebRequest req, string data)
{
    var byteArray = Encoding.UTF8.GetBytes(data);

    req.ContentLength = byteArray.Length;

    using (var dataStream = req.GetRequestStream())
    {
        dataStream.Write(byteArray, 0, byteArray.Length);
        dataStream.Flush();
    }
}

And that is a problem, because it allocates the memory twice, once for the string, once for the buffer. I changed that to this:

public static void WriteDataToRequest(HttpWebRequest req, string data)
{
    var byteCount = Encoding.UTF8.GetByteCount(data);
    req.ContentLength = byteCount;
    using (var dataStream = req.GetRequestStream())
    {
        if(byteCount <= 0x1000) // small size, just let the system allocate it
        {
            var bytes = Encoding.UTF8.GetBytes(data);
            dataStream.Write(bytes, 0, bytes.Length);
            dataStream.Flush();
            return;
        }

        var buffer = new byte[0x1000];
        var maxCharsThatCanFitInBuffer = buffer.Length / Encoding.UTF8.GetMaxByteCount(1);
        var charBuffer = new char[maxCharsThatCanFitInBuffer];
        int start = 0;
        var encoder = Encoding.UTF8.GetEncoder();
        while (start < data.Length)
        {
            var charCount = Math.Min(charBuffer.Length, data.Length - start);

            data.CopyTo(start, charBuffer, 0, charCount);
            var bytes = encoder.GetBytes(charBuffer, 0, charCount, buffer, 0, false);
            dataStream.Write(buffer, 0, bytes);
            start += charCount;
        }
        dataStream.Flush();
    }
}

And I was quite proud of myself.

Then I realized that I was stupid. Why?


Comments

tobi

Lots of scanning/copying of data. Worse than zeroing and filling an array once.

tobi

btw, take the Flush out. Why are people always doing this? Why would the stream delete data on dispose?

Patrick Huizinga

There's also an overload of GetBytes that accepts a string instead of a char buffer. http://msdn.microsoft.com/en-us/library/c20ss51h.aspx

so you can remove the charBuffer and the while loop can be changed to: var charCount = Math.Min(maxCharsThatCanFitInBuffer, data.Length - start); var bytes = encoder.GetBytes(data, start, charCount, buffer, 0); // could not find any GetBytes overload that takes a bool dataStream.Write(buffer, 0, bytes); start += charCount

However, I fear this is not the stupid part of the code. For that I'll wait till tomorow.

wouter

Couldn't you just have used a StreamWriter? You can specify encoding and buffer size in its constructor and then write the string directly...

Patrick Huizinga

tobi: where exactly is the scanning happening? I can only spot operations that are / should be O(1)

And are you sure copying 100 times to an array of size 4K is slower that copying once to an array of size 400K?

Fredrik Mörk

The charBuffer seems unnecessary, given that a string can also act as a char[], so instead of shuffling those chars, you can have the encoder.GetBytes call pull the characters directly from data.

Richard Dingwall

What is the expected size of 'string data'?

Ryan Heath

You run the risk to split the string's surrogate pairs?

// Ryan

Thomas Levesque

@wouter, this doesn't allow you to know the length in bytes before you start writing the data, and headers must be sent before you start writing the body

Dave Hng

What is in 'string data'? Isn't that already a buffer of some sorts?

tobi

Patrick, GetByteCount is O(N) because UTF8 is variable length. Need to decode the entire string for this. I like the smaller chunks, however, for cache efficiency reasons.

Jarrett

Because you started with code that was crystal clear and anyone should be able to read & understand and replaced it with code that you need to read, reread, think about, and read again.

sam
sam

my guess

public virtual int GetByteCount(string s) { if (s == null) { throw new ArgumentNullException("s"); } char[] chars = s.ToCharArray(); return this.GetByteCount(chars, 0, chars.Length); }

Mose

GetRequestStream() returns a synchrounous stream.

That means until the stream is closed or flushed, data is not sent, it's stored in memory.

Ant
Ant

Writing data to a request? Huh?

Sam
Sam

I'm with wouter - my first thought was: "he's reimplemented StreamWriter.Write()".

@Thomas - ContentLength is being determined by Encoding.UTF8.GetByteCount(), not by examining the size of the encoded array (although I'm trusting Microsoft to have not implemented GetByteCount like that).

tobi

Sam, I believe there is no O(1) way to calculate GetByteCount.

Sam
Sam

@Ant - for instance, the body of a POST or PUT request.

JD
JD

"bytes" var is declared inside the loop so N many will be allocated. Of course these will eventually be GCed but the point of the exercise was to sacrifice simplicity for efficiency.

Sam
Sam

@tobi - my understanding is that the optimisation was intended to avoid allocating memory for the (potentially large) request body twice, not to minimise encoding time.

Sam
Sam

@JD - 'bytes' is actually an int. The joys of var and ambiguous variable names!

Patrick Huizinga

@tobi, missed that one, although I only looked inside the loop

@JD, bytes is an int, not much GC-ing there :)

IgorM

You can use the lambda expression and execute it in one line:

Var reader = new StringReader(data); reader.CopyTo(req.GetRequestStream();

Rajeesh C V

If the method is not going to write big chunk of data, then you have increased the complexity instead of keeping it simple

Dennis

Jarrett is the only one who's got it so far. Do you really need the extra code complexity and lack of maintainability for that performance gain? Do you have evidence that the extra memory allocation was really hurting the system?

Mike

I'm guessing this code is stupid because you probably don't have any profiling metrics that show it actually needs optimization in the first place. So, rather than fixing true bottlenecks or implementing new features, you have needlessly invested a lot of time and introduced much more complexity.

Frank Quednau

Confused by the first version in the first place. The simplest abstraction to write a string to a stream is a streamwriter, is it not?

using (var w = new StreamWriter(r.GetRequestStream(), Encoding.UTF8)) sw.Write(data);

Dennis

Jarrett indeed is the first one who mentions the first thing that popped into my mind when reading this: "why change something which intent initially was perfectly obvious into something that you need to read multiple times and is a lot harder to understand for a micro-optimization that may not even necessarily be an issue?"

Doesn't seem like it's worth the extra complexity it introduces at a cost of read- and maintainability.

Bart Verkoeijen

What's the gain in performance compared to the extra complexity of the code?

James_2JS

@Frank Quednau - the first version is like it is due to the need to set ContentLength before writing to the stream

James_2JS

I believe the problem is two-fold:

1) GetByteCount(string s) converts the string into char[s.Length] - so nothing has been gained there

2) worse still, GetByteCount has to encode the characters to produce its result, then the characters are encoded a 2nd time to write to the stream

João Angelo

As other pointed out I'm on the side that the "I'm stupid" epiphany was due to realizing that you just reimplemented logic that is already implemented by StreamWriter.

Michael McDaniel

How about flushing inside the loop if you want the data streamed? Let's call it lots of "courtesy flushes"

Karep

I'm not sure I understand. In first solution you say memory is allocated twice. One for the string (parameter?) and for the buffer.

In second solution memory is allocated for string and then for buffer but multiple times. var buffer... var charBuffer... and in a while loop var bytes

So what you gain is that when data is very big you don't end up allocating a lot of memory on LOH. Instead you do a lot of small allocations if data is big.

Don't know what you mean by being stupid. I don't like the code before and after modification if data is very big. If it is not then why bother changing original code?

dotnetchris

Tell us what you see Ayende

Jay
Jay

Use Stream.Copy

Steve Py

Jarret + 0x1000!

Seeing as these are strings being passed back on a request I cannot see them being HUGE. (many MBs) I'd expect if anything was a memory drain it would be from the server serving many requests in which case the complex code would solve nothing. Temporarily allocating a few extra K for byte arrays to maintain readable code is certainly worth it.

Ben
Ben

data.CopyTo(...)

copies the value and doesn't reference the original char?

Ryan

My guess is that since you instantiate a new "bytes" var in every loop it actually ends up holding everything in memory twice anyways(at least until the GC kicks in).

meisinger

hmm... don't know; this seems a little like "select ain't broken" but... clearly i'm not the smartest guy in the room but i would think that you would want to flush the internal state of the encoder

of course i have been bitten by this type of post in the past to where it could be that there was no check for the "data" parameter being null or empty

tobi

It is almost always wrong to flush. The only reason I can come up with is to harden data on disk immediately for durability semantics. Flush = fsync. It is always slower and does not change the data written at all.

Ayende Rahien

Tobi, No, it isn't. There are a lot of reasons to flush. Reducing memory usage, for example, ensuring lower latency in network scenarios, for another.

tobi

Ayende, but data consistency is not one. Memory usage should be controlled by configuring the buffer size. You cannot release memory by flushing. The OS will flush to disk but still have the data cached. Network latency: true. Rarely needed and reduces throughput.

Ayende Rahien

Tobi, That is why I moved to stream writer there in the first place. And the flushing merely ensures that the data goes out. Without it, you are reliant on the Dispose calling Flush, and there are good reasons not of call it, sometimes. For example, you might call Dispose on the writer, which will dispose the stream, but you don't want that. One example why you want that.

Torvin

req.GetRequestStream() returns an instance of internal ConnectStream class, whose Flush() method for some reason does exactly nothing...

Nitin

var byteCount = Encoding.UTF8.GetByteCount(data); GetByteCount creates one more array of characters?

Ayende Rahien

Nitin, GetBytesCount does NOT create an array

Nitin

Internally before counting the number of characters?

Ayende Rahien

Nitin, No, it doesn't do that

Comment preview

Comments have been closed on this topic.

FUTURE POSTS

  1. Paying the rent online - 3 days from now

There are posts all the way to Aug 03, 2015

RECENT SERIES

  1. Production postmortem (5):
    29 Jul 2015 - The evil licensing code
  2. Career planning (6):
    24 Jul 2015 - The immortal choices aren't
  3. API Design (7):
    20 Jul 2015 - We’ll let the users sort it out
  4. What is new in RavenDB 3.5 (3):
    15 Jul 2015 - Exploring data in the dark
  5. The RavenDB Comic Strip (3):
    28 May 2015 - Part III – High availability & sleeping soundly
View all series

Syndication

Main feed Feed Stats
Comments feed   Comments Feed Stats