Excerpts from the RavenDB Performance team reportComparing Branch Tables

time to read 5 min | 875 words

Note, this post was written by Federico. In the previous post after inspecting the decompiled source using ILSpy  we were able to uncover potential things we could do.

In the last post we found out that the housekeeping of our last loop is non negligible, but we also know it’s execution is bounded by n. From our static analysis we know that n cannot be bigger than a word or 8 bytes.

To build an efficient loop what we can do is exploit our knowledge on how to “hint” the compiler to build a branch table and build customized code for each one of the cases.

image

We ended up paying some housekeeping because the code gets unwieldy pretty fast. We know we can still go that road if we need to.

The performance of this loop trades comparisons with branch tables and gets rid of the comparison of n (notice that we have an endless loop here). But to squeeze the latest few nanoseconds there is a very important tidbit that could easily pass inadvertent.

Mileage will vary in real workloads but let’s do a couple of assumptions just for the purpose of illustrate the optimization.

Let’s say that in half of the requests the memory blocks are equal and on the other half they are not. That means that half of our request will be our worst case scenario, good news is that we have optimized the hell out of it. But what about the other half? For simplicity we will assume that if 2 memory blocks are different there is an uniform probability of it to be different at position n.

That means:

  • If we are comparing 3 bytes memory blocks there is a chance of 1/3 that byte-1 is different and 2/3 that it’s not.
  • If byte-1 of the memory blocks are equal, then there is 1/3 of a chance that byte-2 will be different.
  •  And so on.

For the mathematically inclined, this looks like a geometric distribution (http://en.wikipedia.org/wiki/Geometric_distribution).

image

Or the probability distribution of the number X of Bernoulli trials needed to get a desired outcome, supported on the set { 1, 2, 3, ...}

So let’s say that we have X compares to do in order to rule out equality of the memory blocks (our worst case scenario).

From Wikipedia:

image

In our example, the probability p = 0.66.  Why? Because that is the probability of “successes” until we “fail” … You will notice that success and fail are inverted in our case, because what matter to us is the failure. But mathematically speaking using the other probability would be wrong, it is just a matter of interpretation. Smile

image

Now the important tidbit here is, no matter how long is the memory block (or lower the p-value), the cumulative distribution function will always grow pretty fast. To us, that means that the chances to find a difference in the first X bytes if the memory blocks are different is pretty high. In the context of Voron, let’s say we are looking for a particular key out there in a tree. If the keys are spread apart we are going to be able to rule them out pretty fast until we are in the neighborhood of the key, where we will need to check more bytes to know for sure.

The code to achieve this optimization is quite simple, we already used it to optimize the last-loop. Smile

So we just copy the switch block into the entrance of method. If the size of the memory block is 0 to 3, we will have a fast-return. In the case of bigger memory blocks, we will consume the first 4 bytes (32 bits) before starting to test with words.

If I were to modify the original implementation and just add this optimization. What’s would the speed up be under this theoretical workload? (I will publish that in the comments section in a couple of days J so make your educated guess).

Ayende’s note: That said, there is actually another thing to consider here. It is very common for us to have keys with shared prefixes. For example, “users/1”, “users/2”, etc. In fact, we are trying hard to optimize for that route because that gives us a better structure for the internal B+Tree that we use in Voron.

More posts in "Excerpts from the RavenDB Performance team report" series:

  1. (20 Feb 2015) Optimizing Compare – The circle of life (a post-mortem)
  2. (18 Feb 2015) JSON & Structs in Voron
  3. (13 Feb 2015) Facets of information, Part II
  4. (12 Feb 2015) Facets of information, Part I
  5. (06 Feb 2015) Do you copy that?
  6. (05 Feb 2015) Optimizing Compare – Conclusions
  7. (04 Feb 2015) Comparing Branch Tables
  8. (03 Feb 2015) Optimizers, Assemble!
  9. (30 Jan 2015) Optimizing Compare, Don’t you shake that branch at me!
  10. (29 Jan 2015) Optimizing Memory Comparisons, size does matter
  11. (28 Jan 2015) Optimizing Memory Comparisons, Digging into the IL
  12. (27 Jan 2015) Optimizing Memory Comparisons
  13. (26 Jan 2015) Optimizing Memory Compare/Copy Costs
  14. (23 Jan 2015) Expensive headers, and cache effects
  15. (22 Jan 2015) The long tale of a lambda
  16. (21 Jan 2015) Dates take a lot of time
  17. (20 Jan 2015) Etags and evil code, part II
  18. (19 Jan 2015) Etags and evil code, Part I
  19. (16 Jan 2015) Voron vs. Esent
  20. (15 Jan 2015) Routing