Excerpts from the RavenDB Performance team reportOptimizing Memory Compare/Copy Costs
Note, this post was written by Federico. Where I had notes or stuff to extend, I explicitly marked it as such.
TLDR: Optimizing at this level is really hard. To achieve gains of 20%+ for Compare and from 200% to 6% in Copy (depending on the workload) we will need to dig very deep at the IL level.
Another area we looked deeply into is, how do we move and access the memory. This type of optimization work is especially relevant if you are using Voron to handle big workloads. With small databases the improvements can be felt, but where they shine is when dealing with multi-gigabyte databases or high-throughput key-value retrieves and put workloads (did anyone thought Bulk-Inserts?).
Using FreeDB as in this previous post we build an experiment which we could use to pinpoint the pain points making sure we could also repro it every single time (no indexes, no extra call around). Under the microscope 2 pain points were evident when dealing with memory. Comparing and moving memory around.
We usually compare memory straight from the mmaped unmanaged memory when looking for a particular document in Voron Trees; and to copy from and to Voron pages when storing and retrieving documents. These are very core operations for any storage engine, Voron is not an special case. Before we started the optimization effort we already had a pretty optimized routine.
What this method does is:
- If the memory blocks have zero size, there is no doubt they are equal.
- If the memory blocks are bigger than the size of a word (32 bits) we do a pre-compare over the aligned memory blocks (for performance) in order to rule out all the equals.
- As we cannot use words to calculate the output (handling the Endianness would cost us), we do a byte by byte comparison for the final check.
For our insert workload we were roughly in the 97.5 nanoseconds per memory compare in average. To put in context, if each assembler instruction could be executed in exactly 1 cycle (which usually is not true) then 3 instruction is an entire nanosecond, therefore our average instruction budget is 291 instructions. Remember this idealized processor, we will use this same comparison later for more complex analysis.
Memory compares can be of different sizes that is why controlling the environment is very important for this type of optimization work.
To deal with that and we were using many tricks from the optimization book. From ensuring that memory alignment is optimal to batch compares with bigger primitive sizes to pointer arithmetic. At first sight this one is the kind of method you won't optimize at all, it is pretty damn tight.
Ayende’s node – We have already done a optimization step on memory comparisons. We initially just shelled out to the native memcmp method, but the cost of doing a pinvoke call ended up being noticable, and we wrote the previously optimized version (and had several rounds of that) to alleviate that cost.
However, we took to the challenge because the payoff can be huge. For a very small bulk insert of 50,000 documents inserted in an empty database, we are talking about in the ballpark of 5 million compares (yeah you read it right). Even if we manage to squeeze 1% off, the sheer volume of calls will make it worthwhile. To achieve that we had to do the unthinkable, we had to resort to dig into the MSIL that method was generating. Armed with ILSpy we found out we may have a way to shave off some inefficiencies.
Here is the what this look like when we start actually putting analysis to action. You can see the method code (after decompilation, so we can be closer to the IL) as well as the issues that were discovered in the process.
Because of the size of the method the fastest way was to resort to use a C# decompile, even though we then matched it with the generated IL. The trick to use the C# decompiled version requires that we use a decompiler that is not too smart when dealing with the code. If the decompiler would have understood what was the original code intention and acted upon it, we would have never spotted some of the optimizations at this level. For example, the last loop decompiled with JetBrains dotPeek would look like this:
Always keep around an old version of a decompiler just in case you may need it .
Ayende’s note: In most cases, you can select the level of details that a decompiler can give you. With Reflector, for example, you can select how deeply it will decompile things, but even so, doing stupid decompilation can be very beneficial by showing us what is actually going on.
Understanding where the inefficiencies may be, is one thing, being able to prove them is another matter. And we will tackle all of them in future posts.
We will also leave the memcpy analysis for later because it builds on the optimizations used in memcmp and also requires a deep analysis of the Buffer.Memcpy method already available in the .Net Framework (for internal use of course).
If what we did to the poor Etags was evil. You are now arriving at the gates of the underworld.
Ayende’s note: This is a pretty complex topic, and it goes on for quite a while. In order to maintain interest, and to avoid having people getting lost in the details, I broke it apart for several posts. In the meantime, given the details in this post, how would you suggest improving this?
More posts in "Excerpts from the RavenDB Performance team report" series:
- (20 Feb 2015) Optimizing Compare – The circle of life (a post-mortem)
- (18 Feb 2015) JSON & Structs in Voron
- (13 Feb 2015) Facets of information, Part II
- (12 Feb 2015) Facets of information, Part I
- (06 Feb 2015) Do you copy that?
- (05 Feb 2015) Optimizing Compare – Conclusions
- (04 Feb 2015) Comparing Branch Tables
- (03 Feb 2015) Optimizers, Assemble!
- (30 Jan 2015) Optimizing Compare, Don’t you shake that branch at me!
- (29 Jan 2015) Optimizing Memory Comparisons, size does matter
- (28 Jan 2015) Optimizing Memory Comparisons, Digging into the IL
- (27 Jan 2015) Optimizing Memory Comparisons
- (26 Jan 2015) Optimizing Memory Compare/Copy Costs
- (23 Jan 2015) Expensive headers, and cache effects
- (22 Jan 2015) The long tale of a lambda
- (21 Jan 2015) Dates take a lot of time
- (20 Jan 2015) Etags and evil code, part II
- (19 Jan 2015) Etags and evil code, Part I
- (16 Jan 2015) Voron vs. Esent
- (15 Jan 2015) Routing
The IL is not that interesting in my mind. It is emitted literally from the C# code in most cases.
The x86 disassembly is more interesting. IL concepts such as locals and stack slots are mostly meaningless since they do not exist in the compiled x86. There is no such thing as a "stack allocation" for an IL local. The local disappears in the SSA graph of the JIT. In SSA form every expression becomes a local.
Getting good code out of the poor .NET JIT often amounts to jiggling things around until the output happens to be the best achievable one.
Be aware that you will need to redo this perf work on RyuJIT. I tested it in detail. It has different strength and weaknesses than the old JIT.
Sorry, I know it is out of scope but I would like to advertise for memcmp64.asm in the Agner Fog library. See
It is using all available registers in the current processor and a duff device. Comparing 32,16,8,4,2,1 bytes whatever is left.
I had expected the c# compiler to do loop unrolling but your own benchmarks probably shows that you have to do all the optimazation by yourself.
When you choose to use unsafe maybe the core-application belongs to C++ or C anyway. The new C++ 14 seems interesting with a lot of STL's.
Carsten, Note that we are optimizing very specific things. For the most part, we are very happy with what we can do with the managed parts. Also, note that asmlib would probably enatails a pinvoke overhead that would kill any benefits we can achieve.
I know but I like the asm-library anyway :-)
You have to move all the core programming code to C++ or use pinvoke e.g. only when your are comparing a very big block which you expect to be equal like when comparing the databasereplicas after a crash.
How does the original c#-code look like?
I found a link about the "(intptr)1 / 1" https://github.com/icsharpcode/ILSpy/issues/271
Is there a 1/1 in the IL-code?
Carsten, You may like it, but it isn't relevant. And while C++ is nice, it impose a very big cost in development, we can do more, faster and more efficient in managed code, because a lot of the complexity isn't always in our face.
Sorry about the duplicate.
It seems that I have changed time-zone the first is 01/26/2015 02:26 PM by Carsten Hansen
The second is: 1/26/2015 4:26:06 PM +02:00 by Carsten Hansen
It is probably due to my old keyboard/mouse but I did not notice a double click.
The Captha is gone.
I found some links about P/Invoke overhead. The problem is not the call but marshalling using none native types. See https://msdn.microsoft.com/en-us/library/ms235282.aspx
Quote-begin PInvoke has an overhead of between 10 and 30 x86 instructions per call. In addition to this fixed cost, marshaling creates additional overhead. There is no marshaling cost between blittable types that have the same representation in managed and unmanaged code. For example, there is no cost to translate between int and Int32.
For better performance, have fewer PInvoke calls that marshal as much data as possible, instead of more calls that marshal less data per call. Quote-end
Or managed C++ interopt https://msdn.microsoft.com/en-us/library/ky8kkddw.aspx
Quote-begin In other words, C++ Interop uses the fastest possible method of data marshaling, whereas P/Invoke uses the most robust method. This means that C++ Interop (in a fashion typical for C++) provides optimal performance by default, and the programmer is responsible for addressing cases where this behavior is not safe or appropriate. Quote-end
See also: http://www.codeproject.com/Articles/253444/PInvoke-Performance
Quote-begin Certainly not what I would have expected. According to my research, I was expecting to see the C++/CLI interface be at least an order of magnitude faster - as it does less error checking than a P/Invoke call. Even in the case of a million function calls, the C++/CLI interface is barely faster than using P/Invoke.
As we would expect, the cost of calling any native function from a CLI application is very high if we are calling it many times - in the case of using a very talkative API, it may even be worth writing a second C++ API that takes an aggregated set of parameters and calls the functions many times - the managed to unmanaged boundary is expensive. Quote-end
@tobi while it is generally true that often optimizing code becomes a jiggling game. It is also truth with optimizing compilers. There are plenty examples of optimization tricks needed for compilers to be able to know they can safely unroll a loop and/or generate especially crafted code (less instructions, more efficient instructions, less data hazzards, etc). This one is just an example: http://www.danielvik.com/2010/02/fast-memcpy-in-c.html
And to avoid only talking about C/C++, that also happens in GPU land where given the fast processors it is even more relevant than here.
Don't forget, when working at the nanosecond level, even a single instruction that can be more efficient will have a huge impact. For instance, in the case of that constant that becomes a local variable in the stack, the difference is huge. I will talk about that in a later post, but the opcode for such is 2 x64 assembler instructions against a single instruction with an especially crafted opcode (using constant addressing).
The JIT compiler does not have all the time in the world, so he wont be able to know for sure and therefore cannot emit such crafted code if we don't hint him in that direction.
There are also other cases where 2 equivalent instructions will create 2 different sets of equivalent IL. On one the JIT does a great job and the in the other one a very subpar job. I will also talk about that.
Knowing what IL you are generating does matters. And in fact, you can hint the compiler in the right direction if you know what you are looking for at the x64 assembler level.
About RyuJIT, we do consider important to optimize later for it but until it is released, optimizing for a moving target is useless. In fact, for RyuJIT an special area we are looking to exploit is the ability to use the SSE intrinsec to do the work. That would trigger great gains for big buffers compares (more on that on a latter post too).
@Carsten While the assembler version will give you the most efficient routine, calling that from C# does have a huge impact that may never be overcome (even when dealing with a lot of work). That will be evident on the memcpy analysis.