Excerpts from the RavenDB Performance team reportOptimizers, Assemble!
Note, this post was written by Federico. In the previous post on the topic after inspecting the decompiled source using ILSpy we were able to uncover potential things we could do.
Do you remember the loop from a couple of posts behind?
This loop was the responsible to check for the last bytes; either because the words compare found a difference or we are at the end of the memory block and the last block is not 8 bytes wide. You may remember it easier because of the awful code it generated.
After optimization work that loop looked sensibly different:
And we can double-check that the optimization did create a packed MSIL.
Now let’s look at the different operations involved. In red, blue and orange we have unavoidable instructions, as we need to do the comparison to be able to return the result.
13 instructions to do the work and 14 for the rest. Half of the operations are housekeeping in order to prepare for the next iteration.
The experienced developer would notice that if we would have done this on the JIT output each one of the increments and decrements can be implemented with a single assembler operation using specific addressing mode. However, we shouldn’t underestimate the impact of memory loads and stores.
How would the following loop look in pseudo-assembler?
:START Load address of lhs into register Load address of R into raddr-inregister Move value of lhs into R Load byte into a 32 bits register (lhs-inregister) Substract rhs-inregister, [raddr-inregister], Load int32 from R into r-inregister Jump :WEAREDONE if non zero r-inregister, Load address of lhs into lhsaddr-register Add 4 into [lhsaddr-register] (immediate-mode) Load address of rhs into rhsaddr-register Add 4 into [rhsaddr-register] (immediate-mode) Load address of n into naddr-register Increment [naddr-register] Load content of n into n-register Jump :START if bigger than zero n-register Push 0 Return :WEAREDONE Push r-inregister Return
As you can see the housekeeping keeps being a lot of operations. J
Armed with this knowledge. How would you optimize this loop?
More posts in "Excerpts from the RavenDB Performance team report" series:
- (20 Feb 2015) Optimizing Compare – The circle of life (a post-mortem)
- (18 Feb 2015) JSON & Structs in Voron
- (13 Feb 2015) Facets of information, Part II
- (12 Feb 2015) Facets of information, Part I
- (06 Feb 2015) Do you copy that?
- (05 Feb 2015) Optimizing Compare – Conclusions
- (04 Feb 2015) Comparing Branch Tables
- (03 Feb 2015) Optimizers, Assemble!
- (30 Jan 2015) Optimizing Compare, Don’t you shake that branch at me!
- (29 Jan 2015) Optimizing Memory Comparisons, size does matter
- (28 Jan 2015) Optimizing Memory Comparisons, Digging into the IL
- (27 Jan 2015) Optimizing Memory Comparisons
- (26 Jan 2015) Optimizing Memory Compare/Copy Costs
- (23 Jan 2015) Expensive headers, and cache effects
- (22 Jan 2015) The long tale of a lambda
- (21 Jan 2015) Dates take a lot of time
- (20 Jan 2015) Etags and evil code, part II
- (19 Jan 2015) Etags and evil code, Part I
- (16 Jan 2015) Voron vs. Esent
- (15 Jan 2015) Routing
Comments
Not if you are stuck with the availability of SSE.
Next version of .net will support SIMD instructions, in this case, it would be possible to vectorize the instructions in parallel, which can boost the performance 2x-4x .
Well, SIMD wouldn't really help that much because they are already packing multiple bytes to ulong in order to compare them at once. And use case for this loop is to compare tail of the arrays that was shorter than 8 bytes. You could save some instructions by indexing into the arrays instead of incrementing pointers.
for (int i = 0; i < n; ++i) { var r = lhs[i] - rhs[i]; if (r != 0) return r; }
return 0;
In the case of memcmp there is a potential use of SIMD operations in the high bandwidth checks 256 bits at once instead of 64 bits, however, until RyuJIT is final there is no way to know for sure if using RyuJIT will be useful.
Furthermore, for such an optimization to work (if it actually works) we must be able to detect that RyuJIT is active and load an alternative assembly. If not we will probably be decreasing the performance for non RyuJIT enabled instances.
Federico Lois - Totally Agree, in any case, both of the implementations should be done
Comment preview