Excerpts from the RavenDB Performance team reportOptimizing Memory Comparisons, Digging into the IL

time to read 5 min | 832 words

Note, this post was written by Federico. Where I had notes or stuff to extend, I explicitly marked it as such. In the previous post after inspecting the decompiled source using ILSpy  we were able to uncover potential things we could do.

Did you remember how dotPeek and ILSpy didn’t agree on the last for-loop?

dotPeek

image

ILSpy

image

Well to really know which one is right, lets dig deeper. Looks like dotPeek is just too smart for our purposes.

image

MSIL is an stack machine, so everything has to be pushed to the stack to be operated. And the lower you go the less context you have to make optimization choices. The compiler knows a lot more, therefore it can make sensible choices that the JIT can’t. Well this is one of those, the problem here is that the compiler is treating those native memory references in a very “un-native” way, leaving small room to the JIT to do its magic. Therefore we are going to give the compiler a nudge to point him in the right direction. 

We know that most architecture have a set of indexed instructions that will allow you to load from memory at a base address plus an offset and special ones optimized to operate with constants. Yeah all that magic in a single opcode.

Therefore if we can find a way that the compiler would emit such  a sequence, there is a high chance that the JIT will understand it and emit such a load statement. What could appear to be a long shot is actually quite easy. Instead of doing pointer arithmetic (pre/post increment and dereferencing) as usual, we will do something we would never do in C/C++; we will just ask for it at face value.

So what would be:

var v = *(lhs++) – *(rhs++); 

Now becomes:

var v = lhs[0] - rhs[0]; 

lhs++; 

rhs++;

What if we need the next one?

var v = lhs[1] - rhs[1]; 

And so on… However, that is true if and only if the number can be loaded into the stack using an special short instruction (a shortcut) that encodes the value to load as a named constant.

Why this work?

Because the MSIL pattern is unequivocal:

image

We push the first pointer (lhs) 

We load a byte from it and put it into an int32 register in the stack 

We push the second pointer (rhs) 

We load a byte from it and put it into an int32 register in the stack 

We subtract the two loaded int32. 

We store it into an stack variable (v) 

We load it into the stack from (v) 

We check if it is distinct from (0) 

The JIT now can figure out how to optimize this with a load + offset instruction easily. Moreover the offset is also a constant, anyone said “special opcode”?. Now let’s compare the IL code  from each approach.

Before Optimization:

image

After Optimization

image

While the amount of instructions is the same and the avid reader would have figured out by now; the code is not that different either.

However, the former translate to far more native instructions than the latter. Why? We will have to ask the JIT or the compiler guys, but my hypothesis is that the first version requires a much more deeper analysis than the second and in an effort to keep the JIT overhead low, that pattern can’t be optimized so much.

The bottom line is: “Do not optimize pointers in C# as you do in C/C++. Translating an optimized algorithm that uses pointers from C/C++ to C# will not be optimal.”

Remember this, it will make sense soon, because in the next post, we’ll tie it all together.

More posts in "Excerpts from the RavenDB Performance team report" series:

  1. (20 Feb 2015) Optimizing Compare – The circle of life (a post-mortem)
  2. (18 Feb 2015) JSON & Structs in Voron
  3. (13 Feb 2015) Facets of information, Part II
  4. (12 Feb 2015) Facets of information, Part I
  5. (06 Feb 2015) Do you copy that?
  6. (05 Feb 2015) Optimizing Compare – Conclusions
  7. (04 Feb 2015) Comparing Branch Tables
  8. (03 Feb 2015) Optimizers, Assemble!
  9. (30 Jan 2015) Optimizing Compare, Don’t you shake that branch at me!
  10. (29 Jan 2015) Optimizing Memory Comparisons, size does matter
  11. (28 Jan 2015) Optimizing Memory Comparisons, Digging into the IL
  12. (27 Jan 2015) Optimizing Memory Comparisons
  13. (26 Jan 2015) Optimizing Memory Compare/Copy Costs
  14. (23 Jan 2015) Expensive headers, and cache effects
  15. (22 Jan 2015) The long tale of a lambda
  16. (21 Jan 2015) Dates take a lot of time
  17. (20 Jan 2015) Etags and evil code, part II
  18. (19 Jan 2015) Etags and evil code, Part I
  19. (16 Jan 2015) Voron vs. Esent
  20. (15 Jan 2015) Routing