Excerpts from the RavenDB Performance team report: Do you copy that?

architecture (618) rss
bugs (451) rss
challanges (123) rss
community (381) rss
databases (481) rss
design (896) rss
development (647) rss
hibernating-practices (72) rss
miscellaneous (592) rss
performance (397) rss
programming (1093) rss
raven (1459) rss
ravendb.net (545) rss
reviews (184) rss

2025
- August (6)
- July (7)
- June (7)
- May (10)
- April (10)
- March (10)
- February (7)
- January (12)
2024
- December (3)
- November (2)
- October (1)
- September (3)
- August (5)
- July (10)
- June (4)
- May (6)
- April (2)
- March (8)
- February (2)
- January (14)
2023
- December (4)
- October (4)
- September (6)
- August (12)
- July (5)
- June (15)
- May (3)
- April (11)
- March (5)
- February (5)
- January (8)
2022
- December (5)
- November (7)
- October (7)
- September (9)
- August (10)
- July (15)
- June (12)
- May (9)
- April (14)
- March (15)
- February (13)
- January (16)
2021
- December (23)
- November (20)
- October (16)
- September (6)
- August (16)
- July (11)
- June (16)
- May (4)
- April (10)
- March (11)
- February (15)
- January (14)
2020
- December (10)
- November (13)
- October (15)
- September (6)
- August (9)
- July (9)
- June (17)
- May (15)
- April (14)
- March (21)
- February (16)
- January (13)
2019
- December (17)
- November (14)
- October (16)
- September (10)
- August (8)
- July (16)
- June (11)
- May (13)
- April (18)
- March (12)
- February (19)
- January (23)
2018
- December (15)
- November (14)
- October (19)
- September (18)
- August (23)
- July (20)
- June (20)
- May (23)
- April (15)
- March (23)
- February (19)
- January (23)
2017
- December (21)
- November (24)
- October (22)
- September (21)
- August (23)
- July (21)
- June (24)
- May (21)
- April (21)
- March (23)
- February (20)
- January (23)
2016
- December (17)
- November (18)
- October (22)
- September (18)
- August (23)
- July (22)
- June (17)
- May (24)
- April (16)
- March (16)
- February (21)
- January (21)
2015
- December (5)
- November (10)
- October (9)
- September (17)
- August (20)
- July (17)
- June (4)
- May (12)
- April (9)
- March (8)
- February (25)
- January (17)
2014
- December (22)
- November (19)
- October (21)
- September (37)
- August (24)
- July (23)
- June (13)
- May (19)
- April (24)
- March (23)
- February (21)
- January (24)
2013
- December (23)
- November (29)
- October (27)
- September (26)
- August (24)
- July (24)
- June (23)
- May (25)
- April (26)
- March (24)
- February (24)
- January (21)
2012
- December (19)
- November (22)
- October (27)
- September (24)
- August (30)
- July (23)
- June (25)
- May (23)
- April (25)
- March (25)
- February (28)
- January (24)
2011
- December (17)
- November (14)
- October (24)
- September (28)
- August (27)
- July (30)
- June (19)
- May (16)
- April (30)
- March (23)
- February (11)
- January (26)
2010
- December (29)
- November (28)
- October (35)
- September (33)
- August (44)
- July (17)
- June (20)
- May (53)
- April (29)
- March (35)
- February (33)
- January (36)
2009
- December (37)
- November (35)
- October (53)
- September (60)
- August (66)
- July (29)
- June (24)
- May (52)
- April (63)
- March (35)
- February (53)
- January (50)
2008
- December (58)
- November (65)
- October (46)
- September (48)
- August (96)
- July (87)
- June (45)
- May (51)
- April (52)
- March (70)
- February (43)
- January (49)
2007
- December (100)
- November (52)
- October (109)
- September (68)
- August (80)
- July (56)
- June (150)
- May (115)
- April (73)
- March (124)
- February (102)
- January (68)
2006
- December (95)
- November (53)
- October (120)
- September (57)
- August (88)
- July (54)
- June (103)
- May (89)
- April (84)
- March (143)
- February (78)
- January (64)
2005
- December (70)
- November (97)
- October (91)
- September (61)
- August (74)
- July (92)
- June (100)
- May (53)
- April (42)
- March (41)
- February (84)
- January (31)
2004
- December (49)
- November (26)
- October (26)
- September (6)
- April (10)

RavenDB Workshops - Deep dive into practical use of Document Data Modeling

Feb 06 2015

Excerpts from the RavenDB Performance team reportDo you copy that?

time to read 4 min | 655 words

Note, this post was written by Federico. This series relies heavily on the optimization work we did on the optimizing memory comparisons, if you haven’t read it, I suggest you start there instead.

If memory compare is the bread, memory copy is the butter in an storage engine. Sooner or later we will need to send the data to disk, in the case of Voron we use memory mapped files so we are doing the copy ourselves. Voron also uses a copy-on-write methodology to deal with tree changes, so there are plenty of copy commands around. Smile

Differently from the memory compare case where an optimized version already existed. In the case of memory copy we relied on a p/invoke call to memcpy because we usually move lots of memory around and it is hard to compete in the general case with an assembler coded version. Scrap that, it is not hard, it is extremely hard!!! Don’t underestimate the impact that SSE extensions and access to the prefetching operation can have on memcpy. [1]

However, usually not all memory copies are created equally and there is plenty opportunity to do some smart copy; our code wasn’t exactly an exception to the rule. The first work involved isolating the places where the actual “big copies” happen, especially where the cost of actually doing a p/invoke call gets diluted by the sheer amount of data copied [2] in the statistically representative case. You guessed right, for that we used our FreeDB example and the results were very encouraging, there were a couple of instances of “big copies”. In those cases using the P/Invoke memcpy was not an option, but for the rest we had plenty of different alternatives to try.

The usual suspects to take over our P/Invoke implementation for small copies would be Buffer.BlockCopy, the MSIL cpblk operation and Buffer.Memcpy which is internal, but who cares, we can still clone it Smile .

The general performance landscape for all our alternatives is:

What we can get from this is: Buffer.Memcpy should be the base for any optimization effort until we hit the 2048 bytes where all behave more or less in the same way. If we have to choose between Buffer.BlockCopy and memcpy though, we will select the latter because when running in 32 bits the former is pretty bad. [3]

Having said that, the real eye opener here is the for-loop which is always a bad bet against Buffer.Memcpy. Specially because that’s usual strategy followed when copying less than 32 bytes.

There is also another interesting tidbit here, Buffer.Memcpy has some pretty nasty discontinuities around 16 bytes and 48 bytes.

Size: 14 Memcpy: 128 Stdlib: 362 BlockCopy: 223 ForLoop: 213
Size: 16 Memcpy: 126 Stdlib: 336 BlockCopy: 220 ForLoop: 235
Size: 18 Memcpy: 170 Stdlib: 369 BlockCopy: 262 ForLoop: 303
Size: 20 Memcpy: 160 Stdlib: 368 BlockCopy: 247 ForLoop: 304
Size: 22 Memcpy: 165 Stdlib: 399 BlockCopy: 245 ForLoop: 312
 
Size: 44 Memcpy: 183 Stdlib: 499 BlockCopy: 257 ForLoop: 626
Size: 46 Memcpy: 181 Stdlib: 563 BlockCopy: 264 ForLoop: 565
Size: 48 Memcpy: 272 Stdlib: 391 BlockCopy: 257 ForLoop: 587
Size: 50 Memcpy: 342 Stdlib: 447 BlockCopy: 290 ForLoop: 674
Size: 52 Memcpy: 294 Stdlib: 561 BlockCopy: 269 ForLoop: 619

What would you do here?

[1] With RyuJIT there are new vectorised operations (SIMD). We will certainly look for opportunities to implement a fully managed version of memcpy if possible.

[2] For a detailed analysis of the cost of P/Invoke and the Managed C++/CLI Interface you can go the this article: http://www.codeproject.com/Articles/253444/PInvoke-Performance

[3] For detailed metrics check: http://code4k.blogspot.com.ar/2010/10/high-performance-memcpy-gotchas-in-c.html

Tweet Share Share 11 comments

Tags:

Comments

06 Feb 2015
11:20 AM

alex

@Federico: it appears that a similar strategy to the memcmp problem is a winner here. This obviously very closely mirrors what Buffer.Memcpy is doing, without caring about alignment. However instead of the P/Invoke to memcpy for larger sizes, at least on my system a delegate call to a generated CPBLK IL instruction is substantially faster (up to at least average buffer sizes of 8K).

Again, I just grabbed the latest copy from Voron's MemUtil source and added that to a little bench: https://gist.github.com/anonymous/ae90b03efd9082d781f1

From the results you can see that <disclaimer> on my system in this microbench,</disclaimer> for small sizes the switch over the first 16 bytes, accessing bytes in memory sequential order and not doing any jumps, a tight loop for sizes < 1024 and CPBLK for larger buffers is a winner on x64.

Given the bad press for CPBLK on x86, I found the results there rather surprising. It starts to be faster than anything else at around an average buffer size of 24 bytes.

06 Feb 2015
11:47 AM

alex

Also, interestingly, vectorised implementations are not necessarily superior, because it appears new intel processors are heavily optimized to do fast 'rep movsb' as a preferred copy implementation. Refer to a discussion (with some comparisons) here: https://forums.handmadehero.org/index.php/forum?view=topic&catid=4&id=142

06 Feb 2015
12:35 PM

Federico Lois

@Alex, I wrote this 3 weeks ago. Lots of things happened since then. I like the CPBLK alternative because until 512 in my tests here (I will have to confirm that in fourth generation i7) is faster than the alternatives. And if that continue to be true in i7 we will surely use that for lower than 512 bytes.

However, the current measurements with the very latest version (committed in my branch yesterday) shows this:

1024 MemUtils.Copy : 841 ms, 8,50 GB/s, 1,50 times fastest MemUtils.Current : 658 ms, 10,87 GB/s, 1,18 times fastest CPBLK Raw : 774 ms, 9,24 GB/s, 1,38 times fastest CPBLK : 639 ms, 11,19 GB/s, 1,14 times fastest memcpy : 559 ms, 12,79 GB/s, 1,00 times fastest

4096 MemUtils.Copy : 857 ms, 17,81 GB/s, 1,35 times fastest MemUtils.Current : 691 ms, 22,09 GB/s, 1,08 times fastest CPBLK Raw : 884 ms, 17,26 GB/s, 1,39 times fastest CPBLK : 860 ms, 17,75 GB/s, 1,35 times fastest memcpy : 637 ms, 23,96 GB/s, 1,00 times fastest

16384 MemUtils.Copy : 1078 ms, 28,32 GB/s, 1,06 times fastest MemUtils.Current : 1054 ms, 28,96 GB/s, 1,03 times fastest CPBLK Raw : 1416 ms, 21,56 GB/s, 1,39 times fastest CPBLK : 1404 ms, 21,74 GB/s, 1,38 times fastest memcpy : 1021 ms, 29,90 GB/s, 1,00 times fastest

You have to make a very small change to achieve this and I have already wrote a blog post about this yesterday night, but we will have to ask Oren when it is scheduled to be published ;)

06 Feb 2015
12:39 PM

Ayende Rahien

Alex, The post Federico is talking about is here: http://ayende.com/blog/170114/excerpts-from-the-ravendb-performance-team-report-optimizing-compare-the-circle-of-life-a-post-mortem?key=deee125618ca474d84ba5711d1751c75

It is set to be published in two weeks, but you can get a sneak pick to it now.

06 Feb 2015
13:39 PM

Federico Lois

@Alex I have been testing your benchmark a bit and there is something fishy it in. Somehow even if I copy the whole CPBLK code I cannot achieve the same time for 4 to 24 bytes. Therefore the must be some cache effects there that we are missing, I believe the problem is that for less than 4096 bytes you are using fixed buffers, therefore all is loaded into L1 cache.

But even if I get rid of the cache effects, the code is consistently better in my own test harness (which accounts for L1/L2 eviction). I am investigating the generated assembly as we speak.

This code is what we want to test when RyuJIT is released: apparently the fastest version of aligned memcpy known to man. :) http://www.wenda.io/questions/115387/very-fast-memcpy-for-image-processing.html

06 Feb 2015
14:11 PM

alex

Thanks for the sneak preview. Using this change (i.e. adding these attributes to memcpy) and modifying "MemUtilsCopy" to mirror the "MemUtils.Compare" from the sneak preview, I see memcpy and "MemUtilsCopy" perform roughly the same as the "CPBLK Raw" variant on all sizes. The optimization for smaller buffer sizes (<1024) in the "CPBLK" variant is still the fastest up to around 1K sizes though on my system (Core i7-3615 QM). See https://gist.github.com/anonymous/b9c983d144776cd67ca8

So clearly results will vary a bit with different architectures.

btw, MS releasing CoreCLR into the open is pretty awesome. This for sure will give me some hours of fun while browsing through some interesting stuff.

06 Feb 2015
14:16 PM

alex

@federico. regarding cache effects: if I increase the minimum size of the buffers to 8 * 4096, I still get the same results.

06 Feb 2015
14:28 PM

alex

or more extreme, increasing to 128 * 4096, there is still the same relative speeds, although overall performance drops significantly

06 Feb 2015
14:47 PM

Federico Lois

Found out the difference. It seems that the direct call to the unmanaged memcpy causes the call to incur in using 3 registers more. That believe it or not causes the JIT to generate a sub-par switch statement (in order to avoid using more registers) and causes the difference. I fixed it calling BulkCopy which is managed instead.

http://i.imgur.com/KorbNMq.png

These are the results on our test harness that ensures the CPU cannot guess the memory location or use the L1/L2 cache (that's why megabytes per second is smaller) and also pays the cost of fixing the array to be "consistent" with BlockCopy which doesn't require it but has to do it internally.

I am still waiting for the benchmarks on the i7 to know where to do the cut-over... As you can see the 1024 threshold is not always as efficient, so we are going to choose something in the middle. ;)

07 Feb 2015
16:13 PM

alex

@Federico, that image looks decidedly odd. Why would throughput drop that dramatically after around 1K and again after 64K?

Check out this beautiful sigmoid, which does what I would expect (increase with buffer size and start to plateau). http://i.imgur.com/Fl9wcB5.png.

09 Feb 2015
16:20 PM

Federico Lois

Cache boundaries can explain that. We dont have the ability to tell the processor that we are not going to touch the written memory, therefore it will evict the L1/L2/L3 lines and polute the cache. The read/write cycle of big chunks of memory will in the end can cause that difference.

Our benchmark may be wrong (nobody is perfect) but I would certainly distrust a benchmark that does not shows those boundaries. Not having those discontinuiting means that you cannot compare 4K with 16K and so on, because the general scenario is different for each one of those. Ensuring that the CPU must read to cache (cannot use cached data) is a good thing.

BTW I didnt consider it odd, because I have seen other benchmarks of memcpy that has exactly the same behavior.

Comment preview

Comments have been closed on this topic.

Markdown turns plain text formatting into fancy HTML formatting.

Phrase Emphasis

*italic*   **bold**
_italic_   __bold__

Links

Inline:

An [example](http://url.com/ "Title")

Reference-style labels (titles are optional):

An [example][id]. Then, anywhere
else in the doc, define the link:
  [id]: http://example.com/  "Title"

Images

Inline (titles are optional):

![alt text](/path/img.jpg "Title")

Reference-style:

![alt text][id]
[id]: /url/to/img.jpg "Title"

Headers

Setext-style:

Header 1
========
Header 2
--------

atx-style (closing #'s are optional):

# Header 1 #
## Header 2 ##
###### Header 6

Lists

Ordered, without paragraphs:

1.  Foo
2.  Bar

Unordered, with paragraphs:

*   A list item.
    With multiple paragraphs.
*   Bar

You can nest them:

*   Abacus
    * answer
*   Bubbles
    1.  bunk
    2.  bupkis
        * BELITTLER
    3. burper
*   Cunning

Blockquotes

> Email-style angle brackets
> are used for blockquotes.
> > And, they can be nested.
> #### Headers in blockquotes
> 
> * You can quote a list.
> * Etc.

Horizontal Rules

Three or more dashes or asterisks:

---
* * *
- - - -

Manual Line Breaks

End a line with two or more spaces:

Roses are red,   
Violets are blue.

Fenced Code Blocks

Code blocks delimited by 3 or more backticks or tildas:

```
This is a preformatted
code block
```

Header IDs

Set the id of headings with {#<id>} at end of heading line:

## My Heading {#myheading}

Tables

Fruit    |Color
---------|----------
Apples   |Red
Pears	 |Green
Bananas  |Yellow

Definition Lists

Term 1
: Definition 1
Term 2
: Definition 2

Footnotes

Body text with a footnote [^1]
[^1]: Footnote text here

Abbreviations

MDD <- will have title
*[MDD]: MarkdownDeep

Oren Eini

Oren Eini

CEO of RavenDB

Excerpts from the RavenDB Performance team reportDo you copy that?

More posts in "Excerpts from the RavenDB Performance team report" series:

Comments

Comment preview

FUTURE POSTS

RECENT SERIES

RECENT COMMENTS

Syndication

Main feed
Comments feed

Oren Eini

CEO of RavenDB

Related posts that you may find interesting:

More posts in "Excerpts from the RavenDB Performance team report" series:

Comments

Comment preview

Markdown formatting

Phrase Emphasis

Links

Images

Headers

Lists

Blockquotes

Horizontal Rules

Manual Line Breaks

Fenced Code Blocks

Header IDs

Tables

Definition Lists

Footnotes

Abbreviations

FUTURE POSTS

RECENT SERIES

RECENT COMMENTS

Syndication