Excerpts from the RavenDB Performance team reportDo you copy that?
If memory compare is the bread, memory copy is the butter in an storage engine. Sooner or later we will need to send the data to disk, in the case of Voron we use memory mapped files so we are doing the copy ourselves. Voron also uses a copy-on-write methodology to deal with tree changes, so there are plenty of copy commands around.
Differently from the memory compare case where an optimized version already existed. In the case of memory copy we relied on a p/invoke call to memcpy because we usually move lots of memory around and it is hard to compete in the general case with an assembler coded version. Scrap that, it is not hard, it is extremely hard!!! Don’t underestimate the impact that SSE extensions and access to the prefetching operation can have on memcpy. 
However, usually not all memory copies are created equally and there is plenty opportunity to do some smart copy; our code wasn’t exactly an exception to the rule. The first work involved isolating the places where the actual “big copies” happen, especially where the cost of actually doing a p/invoke call gets diluted by the sheer amount of data copied  in the statistically representative case. You guessed right, for that we used our FreeDB example and the results were very encouraging, there were a couple of instances of “big copies”. In those cases using the P/Invoke memcpy was not an option, but for the rest we had plenty of different alternatives to try.
The usual suspects to take over our P/Invoke implementation for small copies would be Buffer.BlockCopy, the MSIL cpblk operation and Buffer.Memcpy which is internal, but who cares, we can still clone it .
The general performance landscape for all our alternatives is:
What we can get from this is: Buffer.Memcpy should be the base for any optimization effort until we hit the 2048 bytes where all behave more or less in the same way. If we have to choose between Buffer.BlockCopy and memcpy though, we will select the latter because when running in 32 bits the former is pretty bad. 
Having said that, the real eye opener here is the for-loop which is always a bad bet against Buffer.Memcpy. Specially because that’s usual strategy followed when copying less than 32 bytes.
There is also another interesting tidbit here, Buffer.Memcpy has some pretty nasty discontinuities around 16 bytes and 48 bytes.
Size: 14 Memcpy: 128 Stdlib: 362 BlockCopy: 223 ForLoop: 213 Size: 16 Memcpy: 126 Stdlib: 336 BlockCopy: 220 ForLoop: 235 Size: 18 Memcpy: 170 Stdlib: 369 BlockCopy: 262 ForLoop: 303 Size: 20 Memcpy: 160 Stdlib: 368 BlockCopy: 247 ForLoop: 304 Size: 22 Memcpy: 165 Stdlib: 399 BlockCopy: 245 ForLoop: 312 Size: 44 Memcpy: 183 Stdlib: 499 BlockCopy: 257 ForLoop: 626 Size: 46 Memcpy: 181 Stdlib: 563 BlockCopy: 264 ForLoop: 565 Size: 48 Memcpy: 272 Stdlib: 391 BlockCopy: 257 ForLoop: 587 Size: 50 Memcpy: 342 Stdlib: 447 BlockCopy: 290 ForLoop: 674 Size: 52 Memcpy: 294 Stdlib: 561 BlockCopy: 269 ForLoop: 619
What would you do here?
 With RyuJIT there are new vectorised operations (SIMD). We will certainly look for opportunities to implement a fully managed version of memcpy if possible.
 For a detailed analysis of the cost of P/Invoke and the Managed C++/CLI Interface you can go the this article: http://www.codeproject.com/Articles/253444/PInvoke-Performance
 For detailed metrics check: http://code4k.blogspot.com.ar/2010/10/high-performance-memcpy-gotchas-in-c.html