Excerpts from the RavenDB Performance team reportDo you copy that?

time to read 4 min | 655 words

Note, this post was written by Federico. This series relies heavily on the optimization work we did on the optimizing memory comparisons, if you haven’t read it, I suggest you start there instead.

If memory compare is the bread, memory copy is the butter in an storage engine. Sooner or later we will need to send the data to disk, in the case of Voron we use memory mapped files so we are doing the copy ourselves. Voron also uses a copy-on-write methodology to deal with tree changes, so there are plenty of copy commands around. Smile

Differently from the memory compare case where an optimized version already existed. In the case of memory copy we relied on a p/invoke call to memcpy because we usually move lots of memory around and it is hard to compete in the general case with an assembler coded version.  Scrap that, it is not hard, it is extremely hard!!! Don’t underestimate the impact that SSE extensions and access to the prefetching operation can have on memcpy. [1]

However, usually not all memory copies are created equally and there is plenty opportunity to do some smart copy; our code wasn’t exactly an exception to the rule. The first work involved isolating the places where the actual “big copies” happen, especially where the cost of actually doing a p/invoke call gets diluted by the sheer amount of data copied [2] in the statistically representative case. You guessed right, for that we used our FreeDB example and the results were very encouraging, there  were a couple of instances of “big copies”. In those cases using the P/Invoke memcpy was not an option, but for the rest we had plenty of different alternatives to try.

The usual suspects to take over our P/Invoke implementation for small copies would be Buffer.BlockCopy, the MSIL cpblk operation and Buffer.Memcpy which is internal, but who cares, we can still clone it Smile.

The general performance landscape for all our alternatives is:

image

What we can get from this is: Buffer.Memcpy should be the base for any optimization effort until we hit the 2048 bytes where all behave more or less in the same way. If  we have to choose between Buffer.BlockCopy and memcpy though, we will select the latter because when running in 32 bits the former is pretty bad. [3]

Having said that, the real eye opener here is the for-loop which is always a bad bet against Buffer.Memcpy. Specially because that’s usual strategy followed when copying less than 32 bytes.

There is also another interesting tidbit here, Buffer.Memcpy has some pretty nasty discontinuities around 16 bytes and 48 bytes.

Size: 14 Memcpy: 128 Stdlib: 362 BlockCopy: 223 ForLoop: 213
Size: 16 Memcpy: 126 Stdlib: 336 BlockCopy: 220 ForLoop: 235
Size: 18 Memcpy: 170 Stdlib: 369 BlockCopy: 262 ForLoop: 303
Size: 20 Memcpy: 160 Stdlib: 368 BlockCopy: 247 ForLoop: 304
Size: 22 Memcpy: 165 Stdlib: 399 BlockCopy: 245 ForLoop: 312
 
Size: 44 Memcpy: 183 Stdlib: 499 BlockCopy: 257 ForLoop: 626
Size: 46 Memcpy: 181 Stdlib: 563 BlockCopy: 264 ForLoop: 565
Size: 48 Memcpy: 272 Stdlib: 391 BlockCopy: 257 ForLoop: 587
Size: 50 Memcpy: 342 Stdlib: 447 BlockCopy: 290 ForLoop: 674
Size: 52 Memcpy: 294 Stdlib: 561 BlockCopy: 269 ForLoop: 619

What would you do here?

[1] With RyuJIT there are new vectorised operations (SIMD). We will certainly look for opportunities to implement a fully managed  version of memcpy if possible.

[2] For a detailed analysis of the cost of P/Invoke and the Managed C++/CLI Interface you can go the this article: http://www.codeproject.com/Articles/253444/PInvoke-Performance

[3] For detailed metrics check: http://code4k.blogspot.com.ar/2010/10/high-performance-memcpy-gotchas-in-c.html

More posts in "Excerpts from the RavenDB Performance team report" series:

  1. (20 Feb 2015) Optimizing Compare – The circle of life (a post-mortem)
  2. (18 Feb 2015) JSON & Structs in Voron
  3. (13 Feb 2015) Facets of information, Part II
  4. (12 Feb 2015) Facets of information, Part I
  5. (06 Feb 2015) Do you copy that?
  6. (05 Feb 2015) Optimizing Compare – Conclusions
  7. (04 Feb 2015) Comparing Branch Tables
  8. (03 Feb 2015) Optimizers, Assemble!
  9. (30 Jan 2015) Optimizing Compare, Don’t you shake that branch at me!
  10. (29 Jan 2015) Optimizing Memory Comparisons, size does matter
  11. (28 Jan 2015) Optimizing Memory Comparisons, Digging into the IL
  12. (27 Jan 2015) Optimizing Memory Comparisons
  13. (26 Jan 2015) Optimizing Memory Compare/Copy Costs
  14. (23 Jan 2015) Expensive headers, and cache effects
  15. (22 Jan 2015) The long tale of a lambda
  16. (21 Jan 2015) Dates take a lot of time
  17. (20 Jan 2015) Etags and evil code, part II
  18. (19 Jan 2015) Etags and evil code, Part I
  19. (16 Jan 2015) Voron vs. Esent
  20. (15 Jan 2015) Routing