Excerpts from the RavenDB Performance team reportOptimizing Memory Comparisons, Digging into the IL
Note, this post was written by Federico. Where I had notes or stuff to extend, I explicitly marked it as such. In the previous post after inspecting the decompiled source using ILSpy we were able to uncover potential things we could do.
Did you remember how dotPeek and ILSpy didn’t agree on the last for-loop?
Well to really know which one is right, lets dig deeper. Looks like dotPeek is just too smart for our purposes.
MSIL is an stack machine, so everything has to be pushed to the stack to be operated. And the lower you go the less context you have to make optimization choices. The compiler knows a lot more, therefore it can make sensible choices that the JIT can’t. Well this is one of those, the problem here is that the compiler is treating those native memory references in a very “un-native” way, leaving small room to the JIT to do its magic. Therefore we are going to give the compiler a nudge to point him in the right direction.
We know that most architecture have a set of indexed instructions that will allow you to load from memory at a base address plus an offset and special ones optimized to operate with constants. Yeah all that magic in a single opcode.
Therefore if we can find a way that the compiler would emit such a sequence, there is a high chance that the JIT will understand it and emit such a load statement. What could appear to be a long shot is actually quite easy. Instead of doing pointer arithmetic (pre/post increment and dereferencing) as usual, we will do something we would never do in C/C++; we will just ask for it at face value.
So what would be:
var v = *(lhs++) – *(rhs++);
var v = lhs - rhs; lhs++; rhs++;
What if we need the next one?
var v = lhs - rhs;
And so on… However, that is true if and only if the number can be loaded into the stack using an special short instruction (a shortcut) that encodes the value to load as a named constant.
Why this work?
Because the MSIL pattern is unequivocal:
We push the first pointer (lhs) We load a byte from it and put it into an int32 register in the stack We push the second pointer (rhs) We load a byte from it and put it into an int32 register in the stack We subtract the two loaded int32. We store it into an stack variable (v) We load it into the stack from (v) We check if it is distinct from (0)
The JIT now can figure out how to optimize this with a load + offset instruction easily. Moreover the offset is also a constant, anyone said “special opcode”?. Now let’s compare the IL code from each approach.
While the amount of instructions is the same and the avid reader would have figured out by now; the code is not that different either.
However, the former translate to far more native instructions than the latter. Why? We will have to ask the JIT or the compiler guys, but my hypothesis is that the first version requires a much more deeper analysis than the second and in an effort to keep the JIT overhead low, that pattern can’t be optimized so much.
The bottom line is: “Do not optimize pointers in C# as you do in C/C++. Translating an optimized algorithm that uses pointers from C/C++ to C# will not be optimal.”
Remember this, it will make sense soon, because in the next post, we’ll tie it all together.
More posts in "Excerpts from the RavenDB Performance team report" series:
- (20 Feb 2015) Optimizing Compare – The circle of life (a post-mortem)
- (18 Feb 2015) JSON & Structs in Voron
- (13 Feb 2015) Facets of information, Part II
- (12 Feb 2015) Facets of information, Part I
- (06 Feb 2015) Do you copy that?
- (05 Feb 2015) Optimizing Compare – Conclusions
- (04 Feb 2015) Comparing Branch Tables
- (03 Feb 2015) Optimizers, Assemble!
- (30 Jan 2015) Optimizing Compare, Don’t you shake that branch at me!
- (29 Jan 2015) Optimizing Memory Comparisons, size does matter
- (28 Jan 2015) Optimizing Memory Comparisons, Digging into the IL
- (27 Jan 2015) Optimizing Memory Comparisons
- (26 Jan 2015) Optimizing Memory Compare/Copy Costs
- (23 Jan 2015) Expensive headers, and cache effects
- (22 Jan 2015) The long tale of a lambda
- (21 Jan 2015) Dates take a lot of time
- (20 Jan 2015) Etags and evil code, part II
- (19 Jan 2015) Etags and evil code, Part I
- (16 Jan 2015) Voron vs. Esent
- (15 Jan 2015) Routing
It surprises me that there is no opcode for increase - see http://en.wikipedia.org/wiki/List_of_CIL_instructions
Needs three opcodes ldc.i4.1 conv.i add
In Java there is an opcode http://en.wikipedia.org/wiki/Java_bytecode
25: iinc 2, 1
Hope this will be correct in the ASM code
@Carsten Actually, it shouldnt be surprising at all. Based on what we know of MSIL (I am in guess-land here). It seems that the design principle behind MSIL was to keep the surface as small as possible. That can be seen in very low level decisions like using an stack based machine when most (if not all) the target architectures are register based.
Another hint is the low quantity of trully unique opcodes. There are a ton of opcodes, but most are just shorthand notations or restricted versions of more general ones (read restricted as special purpose).
Based on that it does make sense that increment doesn't exist at all. Because increment can be generated from 2 simple opcodes. Conv.i is there because we are using a 64bits extension of the ldc opcode. The JIT can have fast-tracks to handle increments by one among other usual patterns, while at the same time the MSIL is more accesible to static analysis tools (and the JIT itself) to do its analysis work.