Excerpts from the RavenDB Performance team report: Comparing Branch Tables

Feb 04 2015

Excerpts from the RavenDB Performance team reportComparing Branch Tables

time to read 5 min | 875 words

Note, this post was written by Federico. In the previous post after inspecting the decompiled source using ILSpy we were able to uncover potential things we could do.

In the last post we found out that the housekeeping of our last loop is non negligible, but we also know it’s execution is bounded by n. From our static analysis we know that n cannot be bigger than a word or 8 bytes.

To build an efficient loop what we can do is exploit our knowledge on how to “hint” the compiler to build a branch table and build customized code for each one of the cases.

We ended up paying some housekeeping because the code gets unwieldy pretty fast. We know we can still go that road if we need to.

The performance of this loop trades comparisons with branch tables and gets rid of the comparison of n (notice that we have an endless loop here). But to squeeze the latest few nanoseconds there is a very important tidbit that could easily pass inadvertent.

Mileage will vary in real workloads but let’s do a couple of assumptions just for the purpose of illustrate the optimization.

Let’s say that in half of the requests the memory blocks are equal and on the other half they are not. That means that half of our request will be our worst case scenario, good news is that we have optimized the hell out of it. But what about the other half? For simplicity we will assume that if 2 memory blocks are different there is an uniform probability of it to be different at position n.

That means:

If we are comparing 3 bytes memory blocks there is a chance of 1/3 that byte-1 is different and 2/3 that it’s not.
If byte-1 of the memory blocks are equal, then there is 1/3 of a chance that byte-2 will be different.
And so on.

For the mathematically inclined, this looks like a geometric distribution (http://en.wikipedia.org/wiki/Geometric_distribution).

Or the probability distribution of the number X of Bernoulli trials needed to get a desired outcome, supported on the set { 1, 2, 3, ...}

So let’s say that we have X compares to do in order to rule out equality of the memory blocks (our worst case scenario).

From Wikipedia:

In our example, the probability p = 0.66. Why? Because that is the probability of “successes” until we “fail” … You will notice that success and fail are inverted in our case, because what matter to us is the failure. But mathematically speaking using the other probability would be wrong, it is just a matter of interpretation. Smile

Now the important tidbit here is, no matter how long is the memory block (or lower the p-value), the cumulative distribution function will always grow pretty fast. To us, that means that the chances to find a difference in the first X bytes if the memory blocks are different is pretty high. In the context of Voron, let’s say we are looking for a particular key out there in a tree. If the keys are spread apart we are going to be able to rule them out pretty fast until we are in the neighborhood of the key, where we will need to check more bytes to know for sure.

The code to achieve this optimization is quite simple, we already used it to optimize the last-loop. Smile

So we just copy the switch block into the entrance of method. If the size of the memory block is 0 to 3, we will have a fast-return. In the case of bigger memory blocks, we will consume the first 4 bytes (32 bits) before starting to test with words.

If I were to modify the original implementation and just add this optimization. What’s would the speed up be under this theoretical workload? (I will publish that in the comments section in a couple of days J so make your educated guess).

Ayende’s note: That said, there is actually another thing to consider here. It is very common for us to have keys with shared prefixes. For example, “users/1”, “users/2”, etc. In fact, we are trying hard to optimize for that route because that gives us a better structure for the internal B+Tree that we use in Voron.

Tweet Share Share 2 comments

Tags:

More posts in "Excerpts from the RavenDB Performance team report" series:

(20 Feb 2015) Optimizing Compare – The circle of life (a post-mortem)
(18 Feb 2015) JSON & Structs in Voron
(13 Feb 2015) Facets of information, Part II
(12 Feb 2015) Facets of information, Part I
(06 Feb 2015) Do you copy that?
(05 Feb 2015) Optimizing Compare – Conclusions
(04 Feb 2015) Comparing Branch Tables
(03 Feb 2015) Optimizers, Assemble!
(30 Jan 2015) Optimizing Compare, Don’t you shake that branch at me!
(29 Jan 2015) Optimizing Memory Comparisons, size does matter
(28 Jan 2015) Optimizing Memory Comparisons, Digging into the IL
(27 Jan 2015) Optimizing Memory Comparisons
(26 Jan 2015) Optimizing Memory Compare/Copy Costs
(23 Jan 2015) Expensive headers, and cache effects
(22 Jan 2015) The long tale of a lambda
(21 Jan 2015) Dates take a lot of time
(20 Jan 2015) Etags and evil code, part II
(19 Jan 2015) Etags and evil code, Part I
(16 Jan 2015) Voron vs. Esent
(15 Jan 2015) Routing

Comments

04 Feb 2015
10:55 AM

Alwin

Maybe a naive question: what if you would compare the 2 strings backwards? It seems to me you want to find out as quicly as possible if the two strings are different. And the end of the string keys are more likely to be different than the start.

04 Feb 2015
11:19 AM

Ayende Rahien

Alwin, It isn't just enough to know true/false for equal, we also need to know greater than / less than

Comment preview

Comments have been closed on this topic.

Oren Eini

Oren Eini

CEO of RavenDB