Excerpts from the RavenDB Performance team reportOptimizing Memory Comparisons, size does matter
Note, this post was written by Federico. In the previous post after inspecting the decompiled source using ILSpy we were able to uncover potential things we could do. In this fragment we have a pretty optimized method to compare an entire 4 bytes per loop. What if we could do that on 8 bytes?
To achieve that we will use a ulong instead of a uint. This type of optimization makes sense for 2 reasons.
Most of our users are already running RavenDB in x64 where the native word is 8 bytes and Voron is compiled on x64 only. But even if that were not true, since the late 2000’ most CPUs would have a 64 bytes L1 cache line with half a cycle cost for a hit. So even if you can’t handle 64 bits in one go and the JIT or processor have to issue 2 instructions you are still getting a L1 cache hit and no pipeline stall. Which is GREAT .
So without farther ado, this is the resulting code:
Ayende’s note: In the code, the lp += (IntPtr)8/8; is actually defined as lp += 1; What is actually happening is that we are increasing by 8 bytes (size of ulong), and this is how ILSpy decided to represent that for some reason.
The actual IL generated for this is good:
It is just that the translation here is kind of strange.
Therefore the question to ask here is: Will skipping over the parts of the memory block that is equal at a faster rate will compensate for the cost of doing a final check with 8 bytes instead of 4 bytes?
Well the answer is a resounding yes. It won’t have much impact in the first 32 bytes (around 3% or less). We won’t lose, but we won’t win much either. But after that it skyrocket.
// Bandwidth optimization kicks in
Size: 32 Original: 535 Optimized: 442 Gain: 5.01%
Size: 64 Original: 607 Optimized: 493 Gain: 7.08%
Size: 128 Original: 752 Optimized: 573 Gain: 11.77%
Size: 256 Original: 1,080 Optimized: 695 Gain: 35.69%
Size: 512 Original: 1,837 Optimized: 943 Gain: 74.40%
Size: 1,024 Original: 3,200 Optimized: 1,317 Gain: 122.25%
Size: 2,048 Original: 5,135 Optimized: 2,110 Gain: 123.13%
Size: 4,096 Original: 8,753 Optimized: 3,690 Gain: 117.29%
Those are real measurements. You can see that when bandwidth optimization kicks in the gains start to get really high. This means that changing the bandwidth size alone from 4 byte to 8 bytes got us an order of magnitude improvement stabilizing around 120%.
Not bad for 2 lines of work.
More posts in "Excerpts from the RavenDB Performance team report" series:
- (20 Feb 2015) Optimizing Compare – The circle of life (a post-mortem)
- (18 Feb 2015) JSON & Structs in Voron
- (13 Feb 2015) Facets of information, Part II
- (12 Feb 2015) Facets of information, Part I
- (06 Feb 2015) Do you copy that?
- (05 Feb 2015) Optimizing Compare – Conclusions
- (04 Feb 2015) Comparing Branch Tables
- (03 Feb 2015) Optimizers, Assemble!
- (30 Jan 2015) Optimizing Compare, Don’t you shake that branch at me!
- (29 Jan 2015) Optimizing Memory Comparisons, size does matter
- (28 Jan 2015) Optimizing Memory Comparisons, Digging into the IL
- (27 Jan 2015) Optimizing Memory Comparisons
- (26 Jan 2015) Optimizing Memory Compare/Copy Costs
- (23 Jan 2015) Expensive headers, and cache effects
- (22 Jan 2015) The long tale of a lambda
- (21 Jan 2015) Dates take a lot of time
- (20 Jan 2015) Etags and evil code, part II
- (19 Jan 2015) Etags and evil code, Part I
- (16 Jan 2015) Voron vs. Esent
- (15 Jan 2015) Routing
Comments
I guess I should wait for the rest of the series, but I am assuming that you are limiting this to architectures where alignment is not _required_, and misalignment of either lhs or rhs is not that costly.
Alex, Yes, and no. We are currently looking at x64 systems, yes. Which are very forgiving for alignment errors, but we actually can't enforce alignment. We make sure to do at least alignment on 2, but we need to pack data tightly, and more than that would be very wasteful, so we don't bother.
@alex While there are architectures where alignment is required (older ARM) the new ARM11 and Cortex-A/R already support unaligned access. However, I seriously doubt we will run the RavenDB engine on such a processor :)
I haven't done the benchmarks yet to corroborate it (so I am still performing a couple of operations to work in aligned fashion in the current code), but there are plenty of reports that the performance hit for newer architectures is a myth.
http://lemire.me/blog/archives/2012/05/31/data-alignment-for-speed-myth-or-reality/
{quote}On the Sandy Bridge, there is no performance penalty for reading or writing misaligned memory operands, except for the fact that it uses more cache banks so that the risk of cache conflicts is higher when the operand is misaligned. Store-to-load forwarding also works with misaligned operands in most cases.{/quote} http://www.agner.org/optimize/blog/read.php?i=142&v=t
Needless to say that I will get rid of those instructions when that is benchmarked.
Did you benchmark it against a P/Invoke call to memcmp ? In my own benchmark memcmp is a bit faster.
Thomas, Note that you have to be careful whatever you are benching it in x86 or x64
Good point... Also, I just realized that my test was run with optimizations disabled, which made the results useless.
I tested both, and here's the result:
32 bits: memcmp: 00:00:00.1036329 custom: 00:00:00.0593725
64 bits: memcmp: 00:00:00.0462231 custom: 00:00:00.0501936
So on 64 bits, memcmp is slightly faster, but on 32 bits the custom comparison is clearly faster...
Here's the code for my benchmark: https://gist.github.com/thomaslevesque/558763165d7e49341076
Thomas, What sizes are you testing this on? In smaller size (very common for us), our optmized code is really much faster than memcmp. On larger stuff, memcmp is somewhat faster.
I tested on rather large arrays (4MB), but I get the same results with 4K. Anyway, my implementation isn't exactly the same as yours (since I don't have your full implementation), which might explain the difference.
@Thomas There are at least 2 reasons why you are getting better results. I will also speak about similar mistakes I made. The perils of microbenchmarking :)
The first is that you are reusing your buffer. That means that for smaller arrays you have an L1/L2 cache hit which in real life is a condition that would never hold. That also means that you are not checking the case of 4mb buffers, your are checking the case of the position where the buffers differ. Which is actually the worst case scenario every single time (they are equal). That will skew your distribution and show the case where P/Invoke will really shine. But that is not a measurement of your application behavior. You can't beat SSE2 operations in .Net YET :D.
The second is that you are working with too small iterations, therefore the jitter caused by CPU scheduling, the GC, or even other processes will be very high.
In a couple of posts (don't remember which one) we work out those details on how to perform an amortized analysis based on some more reasonable assumptions. Which are the optimization benchmark tests we used, before unleash the optimized algorithm on RavenDB itself.
@Thomas I have been working on your code and devised a couple of ideas based on it. They will be added in a future commit. :)
This is the final tally:
Size: 2 Optimized: 193 Native: 448 Gain: 132,1244% Size: 4 Optimized: 261 Native: 426 Gain: 63,21839% Size: 8 Optimized: 277 Native: 388 Gain: 40,0722% Size: 16 Optimized: 317 Native: 427 Gain: 34,70031% Size: 32 Optimized: 351 Native: 457 Gain: 30,19943% Size: 64 Optimized: 411 Native: 497 Gain: 20,92458% Size: 128 Optimized: 544 Native: 628 Gain: 15,44118% Size: 256 Optimized: 887 Native: 949 Gain: 6,989849% Size: 512 Optimized: 1530 Native: 1552 Gain: 1,437914% Size: 1024 Optimized: 2961 Native: 3141 Gain: 6,07903% Size: 2048 Optimized: 4505 Native: 4418 Gain: -1,93119% Size: 4096 Optimized: 6760 Native: 6841 Gain: 1,19822%
As you see the native implementation will still suffer when faced with a non-worst case scenario (you don't need to look into all the bytes to decide). If you know you are comparing lots of statistically equal memory, you would certainly not use our version, our is tuned to the case we expect many of those to be different (usual case when traversing the trees).
In your benchmark, even though your results are correct for the worst case, most of the jitter was caused by the test harness. It accounted for more than 50% of the whole cost of execution t.Run(). I will talk about it in a later post but the .Start() .Stop() causes lots of jittering in the results. And those errors are additive.
Having said that, I also discovered a source of jittering in my own benchmark code too which penalized the native version more than it should be. It didn't changed radically the results, but for the sake of completeness I have to say it.
Thomas, Both 4MB & 4KB are huge, when talking about smaller, we talk about 8 bytes, 32 bytes, etc. Think comparing keys in index
@Federico. At the peril of focusing too much on a single microbenchmark. It does seem (on at least my system, and likely on many modern Intel processors), that a little misalignment is not that bad. Also at a certain size of the compared buffer (around 256/512) it makes more sense to incur the cost of a call to memcmp (which will likely be doing REP SCAS* ops).
See some results vs. the "MemoryUtils.Compare" version I just copied from the Voron source: https://gist.github.com/anonymous/67f270d9ce353ea19d1a
@alex Actually I saw this comment today. I benchmarked your routine against our latest one (not in the code yet https://gist.github.com/anonymous/d113f06e6b007e87173b ).
The results:
Size: 2 Optimized: 231 Alt: 245 Native: 514 Gain: -5,714285% Size: 4 Optimized: 250 Alt: 226 Native: 368 Gain: 10,61947% Size: 8 Optimized: 275 Alt: 276 Native: 342 Gain: -0,3623188% Size: 16 Optimized: 291 Alt: 317 Native: 383 Gain: -8,201891% Size: 32 Optimized: 318 Alt: 339 Native: 396 Gain: -6,194693% Size: 64 Optimized: 378 Alt: 401 Native: 457 Gain: -5,73566% Size: 128 Optimized: 541 Alt: 556 Native: 576 Gain: -2,697843% Size: 256 Optimized: 880 Alt: 879 Native: 929 Gain: 0,1137614% Size: 512 Optimized: 1633 Alt: 1669 Native: 1665 Gain: -2,156979%
As you can see we followed a similar path.
About the REP SCAS, havent looked into memcmp itself, but benchmarks suggest that when you account for the P/Invoke the difference is not good enough to surpass a managed version for less that 64kb. We still call memcmp after 512 because the performance is similar and if there is a change in memcmp for specific architectures we are going to pick it if we need it.
@Federico. I see that it is indeed the same approach with some optimizations along the decision path. Good to know this approach holds.
Regarding the P/Invoke overhead: what I am seeing is that when the average size to compare is above around 1K (i.e. well before 64K) memcmp is faster than 8 byte (unaligned) compares in a loop.
The issue is that we always have measurement bias, basically the computer :)
We had to retire some optimizations that worked great on older hardware but not on newer and viceversa. Testing optimizations at this level requires to be very careful, you should not overfit for a particular architecture because you will pay on the rest.
We had an optimization that was AWESOME on first generation i5/i7 and Core2 but on the fourth generation i7 saying it was bad is an understatement, it was awful.
Knowing those differences is because we go full memcmp at the 512 threshold. We prefer to err in the low-side but not on the upper where it hurt more.
Comment preview