Ayende @ Rahien

Refunds available at head office

When mini benchmarks are important

"We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil" - Donald Knuth

I have expressed my dislike for micro benchmarks in the past, and in general, I still have this attitude, but sometimes, you really care.

A small note, while a lot of namespaces you are going to see are Google.ProtocolBuffers, this represent my private fork of this library that was customized to fit UberProf’s needs. Some of those things aren’t generally applicable (like string interning at the serialization level), so please don’t try to project from the content of this post to the library itself.

Let me show you what I mean:

image

The following is profiling this piece of code:

private static bool StringEqaulsToBuffer(ByteString byteString, ByteBuffer byteBuffer)
{
    if(byteString.Length != byteBuffer.Length)
        return false;
    for (int i = 0; i < byteString.Length; i++)
    {
        if(byteString[i] != byteBuffer.Buffer[byteBuffer.Offset+i])
            return false;
    }
    return true;
}

This looks pretty simple right?

Now, it is important to understand that this isn’t some fake benchmark that I contrived, this is the profile results from testing a real world scenario. In general, methods such as Equals or GetHashCode, or anything that they call, is likely to be called a lot of times, so paying attention to its performance is something that you should think about.

This are a couple of very easy things that I can do to make this easier, remove the call to the ByteString indexer (which show up as get_Item in the profiler results) to a direct array access and consolidate the calls to the ByteString.Length property.

After applying those two optimizations, we get the following code:

private static bool StringEqaulsToBuffer(ByteString byteString, ByteBuffer byteBuffer)
{
    var strLen = byteString.Length;
    if(strLen != byteBuffer.Length)
        return false;
    for (int i = 0; i < strLen; i++)
    {
        if(byteString.bytes[i] != byteBuffer.Buffer[byteBuffer.Offset+i])
            return false;
    }
    return true;
}

And this profiler result:

image

You can see that the this simple change resulted in drastic improvement to the StringEqualsToBuffer mehtod. As it stands now, I don’t really see a good way to optimize this any further, so I am going to look at the other stuff that showed up. Let us take a look at ByteBuffer.GetHashCode() now:

public override int GetHashCode()
{
    var ret = 23;
    for (var i = Offset; i < Offset+Length; i++)
    {
        ret = (ret << 8) | Buffer[i];
    }
    return ret;
}

The problem is that I don’t really see a way to optimize that, instead, I am going to cache that in a field. There is some problem here with the fact that ByteBuffer is mutable, but I can handle that by forcing all call sites that change it to call a method that will force hash recalculation. Note how different this decision is from the usual encapsulation that I would generally want. Placing additional burdens on call sites is a Bad Thing, but by doing so, I think that I can save quite significantly on the hash code calculation overhead.

Next, let us look at the DoCleanupIfNeeded method and see why it is taking so much time.

private void DoCleanupIfNeeded()
{
    if (strings.Count <= limit)
        return;

    // to avoid frequent thrashing, we will remove the bottom 10% of the current pool in one go
    // that means that we will hit the limit fairly infrequently
    var list = new List<KeyValuePair<ByteStringOrByteBuffer, Data>>(strings);
    list.Sort((x, y) => x.Value.Timestamp - y.Value.Timestamp);

    for (int i = 0; i < limit/10; i++)
    {
        strings.Remove(list[i].Key);                
    }
}

From the profiler output, we can see that it is an anonymous method that is causing the holdup, that is pretty interesting, since this anonymous method is the sort lambda. I decided to see if the BCL can do better, and changed that to:

private void DoCleanupIfNeeded()
{
    if (strings.Count <= limit)
        return;

    // to avoid frequent thrashing, we will remove the bottom 10% of the current pool in one go
    // that means that we will hit the limit fairly infrequently
    var toRemove = strings.OrderBy(x=>x.Value.Timestamp).Take(limit/10).ToArray();

    foreach (var valuePair in toRemove)
    {
        strings.Remove(valuePair.Key);                
    }
}

This isn’t really what I want, since I can’t take a dependency on v3.5 on this code base, but it is a good perf test scenario. Let us see what the profiler output is after those two changes:

image

This is much more interesting, isn’t it?

First, we can see that the call to ByteBuffer.GetHashCode went away, but we have a new one, ByteBuffer.ResetHash. Note, however, that ResetHash only took half as much time as the previous appearance of GetHashCode and that it is called only half as many times. I consider this a net win.

Now, let us consider the second change that we made, where previously we spend 11.1 seconds on sorting, we can see that we now spend 18 seconds, even if the number of calls is so much lower. That is a net lose, so we will revert that.

And now, it is the time for the only test that really matters, is it fast enough? I am doing that by simply running the test scenario outside of the profiler and checking to see if its performance is satisfactory. And so far, I think that it does meet my performance expectation, therefore, I am going to finish with my micro optimizations and move on to more interesting things.

Comments

Andrew
12/30/2009 10:09 AM by
Andrew

(Nitpick: Equals is spelled wrong in StringEqualsToBuffer)

With ResetHash, are you recalculating the hash immediately, or simply clearing the hash value, allowing it to be calculated again when needed, lazily?

These posts are interesting, thanks for making them!

Greg Beech
12/30/2009 11:32 AM by
Greg Beech

Are you sure the first problem with getLength/getItem isn't caused by running the code under the profiler (a Heisenbug, in effect)? In optimised code running normally I'd expect to see the property accessor methods inlined by the CLR. Often profilers disable inlining to get more 'accurate' results which can lead to false positives like this when the method call overhead becomes significant; you may want to check that your profiler settings and ensure that it was not disabling method inlining.

In addition, your GetHashCode function doesn't appear to provide a true hash. As you're simply left-shifting all the previous data by 8, you're actually just discarding all but the last four bytes of data. I'd consider replacing the aggregation term with a multiply/xor combination, e.g. "(ret * 23) ^ Buffer[i]", which will give you a true hash.

Patrick Smacchia
12/30/2009 11:50 AM by
Patrick Smacchia

Micro-optimizations are needed I think:

codebetter.com/.../...need-micro-optimization.aspx

I wrote also concerning optimization on for loop:

codebetter.com/.../...e-net-code-performances.aspx

Btw, you can still have a win on the first example by creating a temporary reference to byteBuffer.Buffer (if it is a reference type):

for (int i = 0; i < strLen; i++) {

    if(byteString.bytes[i] != ********byteBuffer.Buffer*********[byteBuffer.Offset+i])

        return false;

}
Ayende Rahien
12/30/2009 12:27 PM by
Ayende Rahien

Andrews

yes I fond that ot after I did the profiling and dht want to redo it

and I recalc immediately

Ayende Rahien
12/30/2009 12:35 PM by
Ayende Rahien

Greg

I haven't considered the effect of running under the profiler, you are right. But the perf boost was noticable even without the profiler

you are correct about the hash func as well, in my defence, I am matching the behaviour of a third party

Daniel
12/30/2009 02:29 PM by
Daniel

If you have to use that stupid hash function, surely you can implement it by looking only at the last 4 characters? That would save that loop which can get expensive for long strings.

In fact, if we're talking x86 assembly, GetHashCode() for Length>=4 can be implemented in a few instructions

// given EBX=Buffer+Offset and ECX=Length

MOV EAX, EBX[ECX-4]

BSWAP EAX // convert little endian to big endian

RET

I don't know if there's a way to make the .NET JIT output something like that, I would try something like

public override int GetHashCode()

{

if (Length >= 4) {

  return IPAddress.HostToNetworkOrder(BitConverter.ToInt32(Buffer, Offset + Length - 4));

} else {

   // your code

}

}

Rafal
12/30/2009 03:51 PM by
Rafal

Bad hash function will make hashtable operations more costly, so probably its better to fix the function than try to run the bad version faster

configurator
12/30/2009 04:20 PM by
configurator

The JIT has optimizations for array access in a loop when the loop variable is bound to the array size (no bounds checking). If ByteBuffer.Buffer is actually an array, and if the entire buffer is used commonly (Offset = 0, Length = buffer.Length), you could try:

private static bool StringEqaulsToBuffer(ByteString byteString, ByteBuffer byteBuffer)

{

var buffer = byteBuffer.Buffer;

if(byteString.Length != buffer.Length)

    return false;

for (int i = 0; i < buffer.Length; i++)

{

    if(byteString.bytes[i] != buffer[i])

        return false;

}

return true;

}

You'd have to make the Offset-using version a special case though, so if you use it often this will be no optimization at all...

configurator
12/30/2009 04:29 PM by
configurator

I agree with Daniel and Rafal. This hash function is horrible and will cost you. Here's a good, fast and tested hash function - it will give different results on 32-bit and 64-bit machines but that doesn't matter. Hope you don't mind it's unsafe.

public override unsafe int GetHashCode()

{

fixed (char* str = ((char*) Buffer))

{

    char* chPtr = str;

    int num = 0x15051505;

    int num2 = num;

    int* numPtr = (int*) chPtr;

    for (int i = Buffer.Length; i > 0; i -= 4)

    {

        num = (((num << 5) + num) + (num >> 0x1b)) ^ numPtr[0];

        if (i <= 2)

        {

            break;

        }

        num2 = (((num2 << 5) + num2) + (num2 >> 0x1b)) ^ numPtr[1];

        numPtr += 2;

    }

    return (num + (num2 * 0x5d588b65));

}

}

firefly
12/30/2009 07:27 PM by
firefly

Greg is right, most of the time the JIT compiler is really good at inlining property accessor for a loop.

I also agree with the commenters, C and Assembly programmers are extremely good at bit manipulation so I would study their code to see how they do it. Perhaps do a search on bit masking, bit shifting and bitwise operation will get you to the right place. Something similar to what configurator is doing. Though his is specific so you might want to come up with your own.

Alex Yakunin
12/30/2009 08:15 PM by
Alex Yakunin

Nice you started to care about such benchmarks ;)

For exactly this case the optimization suggested by configurator must be ideal:

  • Strings are more frequently differ in the end, so better to compare them in reverse order. If you'd take a look at built-in string comparers in .NET, you'll find they use this assumption.

  • It's really better to use pointers accessing integer values, especially if you're going to use above assumption (JIT compiler does not recognize backward array enumeration + ~ it is anyway a bit faster then byte-by-byte comparison).

Alex Yakunin
12/30/2009 08:18 PM by
Alex Yakunin

Hmm... Just noticed he wrote GetHashCode example - so it's not clear why he uses reverse enumeration in this case. But what I wrote above is still intact for comparison.

tobi
12/30/2009 08:20 PM by
tobi

Your hash function is really bad because it only includes the last 4 bytes. If that is enough for you you can optimize this method to be a lot faster^^

configurator
12/30/2009 10:22 PM by
configurator

@Alex: This example is the actual implementation of string.GetHashCode(), so I'm not the person to ask questions about it. I will note though that the enumeration is not reverse; only 'i' is reversed, and that is to prevent accessing the Length property multiple times. The pointer is incremented in each step though, so the enumeration is actually forward.

Note that this is an implementation for a char array. A similar one for bytes (This should be checked before using it, it's notepad code):

public override unsafe int GetHashCode()

{

fixed (byte* str = ((byte*) Buffer))

{

byte* bytePtr = str;

int num = 0x15051505;

int num2 = num;

int* numPtr = (int*) bytePtr;

for (int i = Buffer.Length; i > 0; i -= 8)

{

num = (((num << 5) + num) + (num >> 0x1b)) ^ numPtr[0];

if (i <= 4)

{

break;

}

num2 = (((num2 << 5) + num2) + (num2 >> 0x1b)) ^ numPtr[1];

numPtr += 2;

}

return (num + (num2 * 0x5d588b65));

}

What's changed? 'char' changed to 'byte', 'i -= 4' changed to 'i -= 8' (because a char is two bytes), and '(i <= 2)' changed to '(i <= 4)'. I think that's pretty much it and this should work for byte arrays.

Andrew
12/31/2009 03:18 AM by
Andrew

RE: Custom has function, its probably also worth checking how even the distribution is between buckets in hashtables/dictionaries etc.

Perhaps there's a performance problem somewhere because the hashcode is too similar for many strings etc.

Dennis
12/31/2009 03:44 AM by
Dennis

private static bool StringEqaulsToBuffer(ByteString byteString, ByteBuffer byteBuffer)

{

var strLen = byteString.Length;

if(strLen != byteBuffer.Length)

    return false;

var bytestr = byteString.bytes;

var bufferstr = byteBuffer.Buffer;

var offset = byteBuffer.Offset;

for (int i = 0; i < strLen; offset++, i++)

{

    if(bytestr[i] != bufferstr[offset])

        return false;

}

return true;

}

But how is byteBuffer.Buffer[] implemented? That might allow for further optimization

Ayende Rahien
12/31/2009 06:50 AM by
Ayende Rahien

Greg, Rafal & Daniel,

I was able to fix the hash function on both locations, thanks.

Ayende Rahien
12/31/2009 06:57 AM by
Ayende Rahien

Configurator,

I implemented something similar, but note that I can't do it in StringEqaulsToBuffer, for the simple reason that the byte buffer has an offset that would kill that optimization

Ayende Rahien
12/31/2009 07:05 AM by
Ayende Rahien

Dennis

I ended up doing something very much like yours, and byteBuffer.Buffer is an array of bytes.

Ayende Rahien
12/31/2009 07:06 AM by
Ayende Rahien

Andrew,

Agreed, and it was changed.

Dennis
12/31/2009 07:19 AM by
Dennis

In that case you can get it about double as fast with unsafe code (or 5% of total runtime according to your profile):

  [ReliabilityContract(Consistency.WillNotCorruptState, Cer.MayFail)]

  private static unsafe bool UnSafeArrayEquals(byte[] strA, byte[] strB)

  {

     int length = strA.Length;

     if (length != strB.Length)

        return false;


     fixed (byte* str = strA, str2 = strB)

     {

        long* chPtr3 = (long*)str;

        long* chPtr4 = (long*)str2;

        while (length >= 32)

        {

           if ((*chPtr3 != *chPtr4) ||

              (*(chPtr3 + 1) != *(chPtr4 + 1)) ||

              (*(chPtr3 + 2) != *(chPtr4 + 2)) ||

              (*(chPtr3 + 3) != *(chPtr4 + 3)))

           {

              return false;

           }

           chPtr3 += 4;

           chPtr4 += 4;

           length -= 32;

        }

        while (length >= 8)

        {

           if (*chPtr3 != *chPtr4)

           {

              return false;

           }

           chPtr3 += 1;

           chPtr4 += 1;

           length -= 8;

        }

        switch (length)

        {

           case 0: return true;

           case 1: return *(byte*)chPtr3 != *(byte*)chPtr4;

           case 2: return *(short*)chPtr3 != *(short*)chPtr4;

           case 3: return *(short*)chPtr3 != *(short*)chPtr4 && *(byte*)(((short*)chPtr3) + 1) != *(byte*)(((short*)chPtr4) + 1);

           case 4: return *(int*)chPtr3 != *(int*)chPtr4;

           case 5: return *(int*)chPtr3 != *(int*)chPtr4 && *(byte*)(((int*)chPtr3) + 1) != *(byte*)(((int*)chPtr4) + 1);

           case 6: return *(int*)chPtr3 != *(int*)chPtr4 && *(short*)(((int*)chPtr3) + 1) != *(short*)(((int*)chPtr4) + 1);

           case 7: return *(int*)chPtr3 != *(int*)chPtr4 && *(short*)(((int*)chPtr3) + 1) != *(short*)(((int*)chPtr4) + 1) && *(((byte*)chPtr3) + 6) != *(((byte*)chPtr4) + 6);

           default: return false; //dead code

        }

     }

  }
Ayende Rahien
12/31/2009 07:25 AM by
Ayende Rahien

Dennis,

That is a nice impl, although I 'll admit it took me a moment to understand it.

Wouldn't it be just as fast to do?

[ReliabilityContract(Consistency.WillNotCorruptState, Cer.MayFail)]

private static unsafe bool UnSafeArrayEquals(byte[] strA, byte[] strB)

{

int length = strA.Length;

if (length != strB.Length)

   return false;


fixed (byte* str = strA, str2 = strB)

{

   long* chPtr3 = (long*)str;

   long* chPtr4 = (long*)str2;

   while (length >= 8)

   {

      if (*chPtr3++ != *chPtr4++)

      {

         return false;

      }

      length -= 8;

   }

   switch (length)

   {

      case 0: return true;

      case 1: return *(byte*)chPtr3 != *(byte*)chPtr4;

      case 2: return *(short*)chPtr3 != *(short*)chPtr4;

      case 3: return *(short*)chPtr3 != *(short*)chPtr4 && *(byte*)(((short*)chPtr3) + 1) != *(byte*)(((short*)chPtr4) + 1);

      case 4: return *(int*)chPtr3 != *(int*)chPtr4;

      case 5: return *(int*)chPtr3 != *(int*)chPtr4 && *(byte*)(((int*)chPtr3) + 1) != *(byte*)(((int*)chPtr4) + 1);

      case 6: return *(int*)chPtr3 != *(int*)chPtr4 && *(short*)(((int*)chPtr3) + 1) != *(short*)(((int*)chPtr4) + 1);

      case 7: return *(int*)chPtr3 != *(int*)chPtr4 && *(short*)(((int*)chPtr3) + 1) != *(short*)(((int*)chPtr4) + 1) && *(((byte*)chPtr3) + 6) != *(((byte*)chPtr4) + 6);

      default: return false; //dead code

   }

}

}

Dennis
12/31/2009 07:29 AM by
Dennis

Another quick optimization due to your getHashcode optimization (and since they use the same implementation) is to compare those two if they have a current valid precomputed value. But it is only useful if the common case is that they are NOT equal.

Dennis
12/31/2009 07:37 AM by
Dennis

I'm not sure how smart the JIT is...

If it is really intelligent, it can use the repl cmpb instruction in this case, which would make it pretty much as fast as your memory is.

But even without that, the 32 byte version will allow the CPU to fill all the ALUs (assuming 4 ALUs with 2op/clock) with useful stuff to do. The 8-version has some overhead in loop accounting.

For my benchmarking the 32byte one is about 20% faster in the TRUE case for 38 values, your luck may vary depending on your input. But I would say that in general for longer strings the 32byte will win.

And the x64 version is alot faster than the x32 version (for obvious reasons, but may also be since the x64 JIT is optimized for performance rather than quick compilation).

Using ints instead of longs on the 32bit is not any faster.

Jos&#233; Romaniello
12/31/2009 11:28 AM by
José Romaniello

Sorry, but the hashcode should be constant for the live of the object. This is a very basic guideline. Am I missing something?

Patrick Smacchia
12/31/2009 01:42 PM by
Patrick Smacchia

On a side note Dennis, to avoid dead code, instead of

case 6: YYY

case 7: XXX

default: return false; //dead code

I usually do

case 6: YYY

default:

Debug.Assert(length == 7);

XXX

break;

Ayende Rahien
12/31/2009 02:08 PM by
Ayende Rahien

Jose,

There are reasons to violate this rule, in this case, ByteBuffer is a Fly Weight, which I use to reduce memory consumption

configurator
01/01/2010 12:48 AM by
configurator

Ayende, May we know which hash function you're using now that you changed it?

Comments have been closed on this topic.