Excerpts from the RavenDB Performance team report: Optimizing Compare, Don’t you shake that branch at me!

architecture (601) rss
bugs (449) rss
challanges (123) rss
community (373) rss
databases (481) rss
design (893) rss
development (639) rss
hibernating-practices (71) rss
miscellaneous (592) rss
performance (393) rss
programming (1080) rss
raven (1434) rss
ravendb.net (517) rss
reviews (184) rss

2025
- April (7)
- March (10)
- February (7)
- January (12)
2024
- December (3)
- November (2)
- October (1)
- September (3)
- August (5)
- July (10)
- June (4)
- May (6)
- April (2)
- March (8)
- February (2)
- January (14)
2023
- December (4)
- October (4)
- September (6)
- August (12)
- July (5)
- June (15)
- May (3)
- April (11)
- March (5)
- February (5)
- January (8)
2022
- December (5)
- November (7)
- October (7)
- September (9)
- August (10)
- July (15)
- June (12)
- May (9)
- April (14)
- March (15)
- February (13)
- January (16)
2021
- December (23)
- November (20)
- October (16)
- September (6)
- August (16)
- July (11)
- June (16)
- May (4)
- April (10)
- March (11)
- February (15)
- January (14)
2020
- December (10)
- November (13)
- October (15)
- September (6)
- August (9)
- July (9)
- June (17)
- May (15)
- April (14)
- March (21)
- February (16)
- January (13)
2019
- December (17)
- November (14)
- October (16)
- September (10)
- August (8)
- July (16)
- June (11)
- May (13)
- April (18)
- March (12)
- February (19)
- January (23)
2018
- December (15)
- November (14)
- October (19)
- September (18)
- August (23)
- July (20)
- June (20)
- May (23)
- April (15)
- March (23)
- February (19)
- January (23)
2017
- December (21)
- November (24)
- October (22)
- September (21)
- August (23)
- July (21)
- June (24)
- May (21)
- April (21)
- March (23)
- February (20)
- January (23)
2016
- December (17)
- November (18)
- October (22)
- September (18)
- August (23)
- July (22)
- June (17)
- May (24)
- April (16)
- March (16)
- February (21)
- January (21)
2015
- December (5)
- November (10)
- October (9)
- September (17)
- August (20)
- July (17)
- June (4)
- May (12)
- April (9)
- March (8)
- February (25)
- January (17)
2014
- December (22)
- November (19)
- October (21)
- September (37)
- August (24)
- July (23)
- June (13)
- May (19)
- April (24)
- March (23)
- February (21)
- January (24)
2013
- December (23)
- November (29)
- October (27)
- September (26)
- August (24)
- July (24)
- June (23)
- May (25)
- April (26)
- March (24)
- February (24)
- January (21)
2012
- December (19)
- November (22)
- October (27)
- September (24)
- August (30)
- July (23)
- June (25)
- May (23)
- April (25)
- March (25)
- February (28)
- January (24)
2011
- December (17)
- November (14)
- October (24)
- September (28)
- August (27)
- July (30)
- June (19)
- May (16)
- April (30)
- March (23)
- February (11)
- January (26)
2010
- December (29)
- November (28)
- October (35)
- September (33)
- August (44)
- July (17)
- June (20)
- May (53)
- April (29)
- March (35)
- February (33)
- January (36)
2009
- December (37)
- November (35)
- October (53)
- September (60)
- August (66)
- July (29)
- June (24)
- May (52)
- April (63)
- March (35)
- February (53)
- January (50)
2008
- December (58)
- November (65)
- October (46)
- September (48)
- August (96)
- July (87)
- June (45)
- May (51)
- April (52)
- March (70)
- February (43)
- January (49)
2007
- December (100)
- November (52)
- October (109)
- September (68)
- August (80)
- July (56)
- June (150)
- May (115)
- April (73)
- March (124)
- February (102)
- January (68)
2006
- December (95)
- November (53)
- October (120)
- September (57)
- August (88)
- July (54)
- June (103)
- May (89)
- April (84)
- March (143)
- February (78)
- January (64)
2005
- December (70)
- November (97)
- October (91)
- September (61)
- August (74)
- July (92)
- June (100)
- May (53)
- April (42)
- March (41)
- February (84)
- January (31)
2004
- December (49)
- November (26)
- October (26)
- September (6)
- April (10)

Couchbase vs RavenDB Performance at Rakuten Kobo Whitepaper

Jan 30 2015

Excerpts from the RavenDB Performance team reportOptimizing Compare, Don’t you shake that branch at me!

time to read 5 min | 883 words

Note, this post was written by Federico. In the previous post after inspecting the decompiled source using ILSpy we were able to uncover potential things we could do.

By now we already squeeze almost all the obvious inefficiencies that we had uncovered through static analysis of the decompiled code, so now we will need another strategy. For that we need to analyze the behavior in runtime in the average case. We did something like that when in this post when we made an example using a 16 bytes compare with equal arrays.

To achieve that analysis live we will need to somehow know the size of the typical memory block while running the test under a line-by-line profiler run. We built a modified version of the code that stored the size of the memory chunk to compare and then we built an histogram with that (that’s why exact replicability matters). From our workload the histogram showed that there were a couple of clusters for the length of the memory to be compared. The first cluster was near 0 bytes but not exactly 0. The other cluster was centered around 12 bytes, which makes sense as the keys of the documents were around 11 bytes. This gave us a very interesting insight. Armed with that knowledge we made our first statistical based optimization.

You can notice the if statement at the start of the method, which is a pretty standard bounding condition. If the memory blocks are empty, therefore they are equal. In a normal environment such check is so simple that nobody would bother, but in our case when we are measuring the runtime in the nanoseconds, 3 extra instructions and a potential branch-miss do count.

That code looks like this:

That means that not only I am making the check, we are also forcing a short-jump every single time it happens. But our histogram also tells us that memory blocks of 0 size almost never happen. So we are paying with 3 instructions and a branch for something that almost never happen. But we also knew that there was a cluster near the 0 that we could exploit. The problem is that we would be paying 3 cycles (1 nanosecond in our idealized processor) per branch. As our average is 97.5 nanoseconds, we have 1% improvement in almost any call (except the statistically unlikely case) if we are able to get rid of it.

Resistance is futile, that branch needs to go. Smile

In C and Assembler and almost any low level architecture like GPUs, there are 2 common approaches to optimize this kind of scenarios.

The ostrich method. (Hide your head in the sand and pray it just work).
Use a lookup table.

The first is simple, if you don’t check and the algorithm can deal with the condition in the body, zero instructions always beats more than zero instruction (this case is a corner case anyways, no damage is dealt). This one is usually used in massive parallel computing where the cost of instructions is negligible while memory access is not. But it has its uses in more traditional superscalar and branch-predicting architectures (you just don’t have so much instructions budget to burn).

The second is more involved. You need to be able to “index” somehow the input and pay with less instructions than do the actual branches (at a minimum of 1 nanosecond each aka 3 instructions of our idealized processor). Then create a branch table and jump to the appropriate index which itself will jump to the proper code block using just 2 instructions.

Note: Branch tables are very well explained at http://en.wikipedia.org/wiki/Branch_table. If you made it that far you should read it, don’t worry I will wait.

As the key take away if your algorithm have a sequence of 0..n, you are in the best world, you already have your index. Which we did Smile .

I know what you are thinking: Will the C# JIT compiler be smart enough to convert such a pattern into a branch table?

The short answer is yes, if we give it a bit of help. The if-then-elseif pattern won’t cut it, but what about switch-case?

The compiler will create a switch opcode, in short our branch table, if our values are small and contiguous.

Therefore that is what we did. The impact? Big, but this is just starting. Here is what this looks like in our code:

I’ll talk about the details of branch tables in C# more in the next post, but I didn’t want to leave you hanging too much.

Tweet Share Share 15 comments

Tags:

Comments

30 Jan 2015
16:29 PM

Catalin Pop

So what's the point of the these optimizations of 1%, 3% ... etc, do they have a sizable impact on the Raven?

Are you preparing for some benchmarking competition against other products, or is this just a push squeeze the most out of Raven?

30 Jan 2015
20:30 PM

Ayende Rahien

Catalin, A 3% here, and a 2% there adds up to quite a lot. Especially when we are talking about making optimizations to methods that are calls tens of millions of times. In the case of compare, that is called all the time in our code (for every get request, for every put, for every scan, pretty much all the time). Even small improvements will yield good return.

The difference between 200 req/sec and 206 req/sec isn't that big, but it means another half a million more requests a day per node, for example. And the total result isn't just a 3% improvement overall.

31 Jan 2015
03:31 AM

Federico Lois

@Catalin just to put it in context. A bulk insert of 50000 small documents will cause (if I don't recall wrongly, I am being conservative here) more or less 3M calls to this function. Or about half a second every 50000 elements. And it is not even a much used function when bulk inserting because it is a very optimized path which avoid searching the trees as much as possible.

31 Jan 2015
10:06 AM

tobi

Are you sure the switch provided a speedup? Last time I looked small switches were compiled to if-else by the JIT. The threshold was 2 or 3 so I'm not sure what you have been getting here.

Also, a table-based switch has quite a few instructions in the header. At the very least there is a branch for the default case. It has an indirect branch (to a dynamic location). Those are not good for performance. In all cases it is more expensive than a single predicted branch.

A 100% predicted branch (like the n == 0) branch should cost 1-2 cycles that maybe even overlap with other work so they are nearly free.

The methodology of looking at IL to determine what to do is really unsound. Look at the generated x64.

As an idea you could have a switch with, say, 20 cases for [0, 19] and for each of those you hard-code the comparison in a loop-free way. That might hurt because of code size though.

31 Jan 2015
10:20 AM

Ayende Rahien

Tobi, No, we are pretty sure that this was responsible for a significant speed up :-) And the JIT doesn't turn switch to if/else. See: http://www.dotnetperls.com/if-switch-performance

And code size is an issue that we need to take into account. In particular, we want to make sure that the entire method can fit inside L1 easily, otherwise we will need to move it back and forth when it is called.

31 Jan 2015
10:23 AM

Ayende Rahien

Tobi, Note that the difference is whatever you have a jump table that is full or sparse. A sparse jump table would likely go into if/else mode. But a full jump table is more efficent and will be the one used.

31 Jan 2015
11:16 AM

tobi

OK, the switch is only being removed if there are at most two cases (tested it just now).

Still, the switch header is 8 instructions including one direct and one indirect branch. Wherever the speedup came from it is unexplained why the switch would cause it. The switch is inferior by every measure. Any idea why it would be faster in this particular case?

31 Jan 2015
11:55 AM

Ayende Rahien

Tobi, You might be counting instructions, but you are ignoring the fact that we are reducing the number of branches that the CPU has to deal with

31 Jan 2015
12:25 PM

tobi

On my count the branch count increased from one to two including one indirect branch.

31 Jan 2015
12:34 PM

Ayende Rahien

Tobi, Where do you see that? The compiler is probably doing something like this:

jumpTable[ n ] ();

No branching at all.

31 Jan 2015
12:35 PM

Ayende Rahien

Tobi, Note that what we are counting is the conditional branching. That is what we care about, because it means that the CPU need to go into speculative mode, and maybe throw stuff out if the prediction turned out to be wrong

31 Jan 2015
12:39 PM

tobi

In order to answer such question I advise looking at thex86 code. There you see:

[code]

if ((uint)n >= 2) goto default; //branch

var switchOffset = table[n];

var switchAddr = switchOffset + 0x123456;

goto switchAddr; //branch

[/code]

A jump to a dynamic address required prediction. That's why it counts as a dynamic branch and not as a jump. It stalls the speculation pipeline.

01 Feb 2015
23:35 PM

Federico Lois

@tobi the stalls do exists as you noted, however there is a very important distinction. We are optimizing for RavenDB not for a general purpose memcmp routine. So you have to assess the behavior based on your typical workload.

The compare is usually used to decide in what direction we will be moving on the trees, therefore it has 2 important characteristics:

Keys tend to be small (to very small) - 16/32 bytes / 4 bytes.
Keys tend to be similar in their branching structure in sequential calls. "users/xxx" and "users/yyy" will be very near in bytes.

While this may change in the (near) future due to polyglot persistence. Today we can exploit that, and that is why the optimized method is so counterintuitive (assembler wise).

On the other hand, even Thomas' method (http://ayende.com/blog/169828/excerpts-from-the-ravendb-performance-team-report-optimizing-memory-comparisons-size-does-matter) which was the inspiration for a new iteration cannot exploit the clusters near 0. I will probably write a follow up about that too.

About the switch, in assembler it is pretty tight. While we use the assembler output for optimization purposes, I wasn't showing the assembler output because the serie was long enough even without looking at the different assembler alternatives, but for reference this is the switch: http://imgur.com/j2xqlfs

02 Feb 2015
03:52 AM

Federico Lois

@tobi One thing I didnt mention but makes sense to, is the following; as long as you dont touch memory (L1, L2, etc) you have a lot of register instructions to burn. Not that many as in GPUs but the budget is not negligible and you can exploit that.

For example, while the switch itself may be composed of 8 instructions and also a potential branch misprediction, note that all of them are register based. That means that for each L1-hit memory read, you still have 4 cycles to burn (non-math register ops typically are in the 1-cycle cost range).

Both memcmp and memcpy are memory read/write intensive, those reads will hide many branch mispredictions and/or extra operations when performed on registers. Finding ways to exploiting that imbalance is the trick we used here.

Needless to say, I am hating not having access to an intrisec to do prefetching enough to actually suggest that as a future enhancement of the CLR. https://github.com/dotnet/roslyn/issues/166#issuecomment-72403213

02 Feb 2015
04:22 AM

Federico Lois

It should have read: "That means that for each L1-miss memory read"

Comment preview

Comments have been closed on this topic.

Markdown turns plain text formatting into fancy HTML formatting.

Phrase Emphasis

*italic*   **bold**
_italic_   __bold__

Links

Inline:

An [example](http://url.com/ "Title")

Reference-style labels (titles are optional):

An [example][id]. Then, anywhere
else in the doc, define the link:
  [id]: http://example.com/  "Title"

Images

Inline (titles are optional):

![alt text](/path/img.jpg "Title")

Reference-style:

![alt text][id]
[id]: /url/to/img.jpg "Title"

Headers

Setext-style:

Header 1
========
Header 2
--------

atx-style (closing #'s are optional):

# Header 1 #
## Header 2 ##
###### Header 6

Lists

Ordered, without paragraphs:

1.  Foo
2.  Bar

Unordered, with paragraphs:

*   A list item.
    With multiple paragraphs.
*   Bar

You can nest them:

*   Abacus
    * answer
*   Bubbles
    1.  bunk
    2.  bupkis
        * BELITTLER
    3. burper
*   Cunning

Blockquotes

> Email-style angle brackets
> are used for blockquotes.
> > And, they can be nested.
> #### Headers in blockquotes
> 
> * You can quote a list.
> * Etc.

Horizontal Rules

Three or more dashes or asterisks:

---
* * *
- - - -

Manual Line Breaks

End a line with two or more spaces:

Roses are red,   
Violets are blue.

Fenced Code Blocks

Code blocks delimited by 3 or more backticks or tildas:

```
This is a preformatted
code block
```

Header IDs

Set the id of headings with {#<id>} at end of heading line:

## My Heading {#myheading}

Tables

Fruit    |Color
---------|----------
Apples   |Red
Pears	 |Green
Bananas  |Yellow

Definition Lists

Term 1
: Definition 1
Term 2
: Definition 2

Footnotes

Body text with a footnote [^1]
[^1]: Footnote text here

Abbreviations

MDD <- will have title
*[MDD]: MarkdownDeep

Oren Eini

Oren Eini

CEO of RavenDB

Excerpts from the RavenDB Performance team reportOptimizing Compare, Don’t you shake that branch at me!

More posts in "Excerpts from the RavenDB Performance team report" series:

Comments

Comment preview

FUTURE POSTS

RECENT SERIES

RECENT COMMENTS

Syndication

Main feed
Comments feed

Oren Eini

CEO of RavenDB

More posts in "Excerpts from the RavenDB Performance team report" series:

Comments

Comment preview

Markdown formatting

Phrase Emphasis

Links

Images

Headers

Lists

Blockquotes

Horizontal Rules

Manual Line Breaks

Fenced Code Blocks

Header IDs

Tables

Definition Lists

Footnotes

Abbreviations

FUTURE POSTS

RECENT SERIES

RECENT COMMENTS

Syndication