Note, this post was written by Federico. In the previous post after inspecting the decompiled source using ILSpy we were able to uncover potential things we could do. In this fragment we have a pretty optimized method to compare an entire 4 bytes per loop. What if we could do that on 8 bytes?
To achieve that we will use a ulong instead of a uint. This type of optimization makes sense for 2 reasons.
Most of our users are already running RavenDB in x64 where the native word is 8 bytes and Voron is compiled on x64 only. But even if that were not true, since the late 2000’ most CPUs would have a 64 bytes L1 cache line with half a cycle cost for a hit. So even if you can’t handle 64 bits in one go and the JIT or processor have to issue 2 instructions you are still getting a L1 cache hit and no pipeline stall. Which is GREAT .
So without farther ado, this is the resulting code:
Ayende’s note: In the code, the lp += (IntPtr)8/8; is actually defined as lp += 1; What is actually happening is that we are increasing by 8 bytes (size of ulong), and this is how ILSpy decided to represent that for some reason.
The actual IL generated for this is good:
It is just that the translation here is kind of strange.
Therefore the question to ask here is: Will skipping over the parts of the memory block that is equal at a faster rate will compensate for the cost of doing a final check with 8 bytes instead of 4 bytes?
Well the answer is a resounding yes. It won’t have much impact in the first 32 bytes (around 3% or less). We won’t lose, but we won’t win much either. But after that it skyrocket.
// Bandwidth optimization kicks in
Size: 32 Original: 535 Optimized: 442 Gain: 5.01%
Size: 64 Original: 607 Optimized: 493 Gain: 7.08%
Size: 128 Original: 752 Optimized: 573 Gain: 11.77%
Size: 256 Original: 1,080 Optimized: 695 Gain: 35.69%
Size: 512 Original: 1,837 Optimized: 943 Gain: 74.40%
Size: 1,024 Original: 3,200 Optimized: 1,317 Gain: 122.25%
Size: 2,048 Original: 5,135 Optimized: 2,110 Gain: 123.13%
Size: 4,096 Original: 8,753 Optimized: 3,690 Gain: 117.29%
Those are real measurements. You can see that when bandwidth optimization kicks in the gains start to get really high. This means that changing the bandwidth size alone from 4 byte to 8 bytes got us an order of magnitude improvement stabilizing around 120%.
Not bad for 2 lines of work.
More posts in "Excerpts from the RavenDB Performance team report" series:
- (20 Feb 2015) Optimizing Compare – The circle of life (a post-mortem)
- (18 Feb 2015) JSON & Structs in Voron
- (13 Feb 2015) Facets of information, Part II
- (12 Feb 2015) Facets of information, Part I
- (06 Feb 2015) Do you copy that?
- (05 Feb 2015) Optimizing Compare – Conclusions
- (04 Feb 2015) Comparing Branch Tables
- (03 Feb 2015) Optimizers, Assemble!
- (30 Jan 2015) Optimizing Compare, Don’t you shake that branch at me!
- (29 Jan 2015) Optimizing Memory Comparisons, size does matter
- (28 Jan 2015) Optimizing Memory Comparisons, Digging into the IL
- (27 Jan 2015) Optimizing Memory Comparisons
- (26 Jan 2015) Optimizing Memory Compare/Copy Costs
- (23 Jan 2015) Expensive headers, and cache effects
- (22 Jan 2015) The long tale of a lambda
- (21 Jan 2015) Dates take a lot of time
- (20 Jan 2015) Etags and evil code, part II
- (19 Jan 2015) Etags and evil code, Part I
- (16 Jan 2015) Voron vs. Esent
- (15 Jan 2015) Routing