GoTo based optimizations
One of the things that we did recently was go over our internal data structures in RavenDB and see if we can optimize them. Some of those changes are pretty strange if you aren’t following what is actually going on. Here is an example:
Before | After |
What is the point in this kind of change? Well, let us look at the actual assembly generated by this, shall we?
Before | After |
As you can see, the second option is much shorter, and in the common case, it involves no actual jumping. This ends up being extremely efficient. Note that because we return a value from the ThrowForEmptyStack, the assembly generated is extremely short, since we can rely on the caller to clean us up.
This was run in release mode, CoreCLR, x64. I got the assembly from the debugger, so it is possible that there are some optimizations that hasn’t been applied because the debugger is attached, but it is fairly closed to what should happen for real, I think. Note that the ThrowForEmptyStack is inlined, even though it is an exception only method. If we use [MethodImpl(MethodImplOptions.NoInlining)], it will stop it, but the goto version will still generate better code.
The end result is that we are running much less code, and that makes me happy. In general, a good guide for assembly reading is that shorter == faster, and if you are reading assembly, you are very likely in optimization mode, or debugging the compiler.
I’m pretty sure that the 2.0 release of CoreCLR already fixed this kind of issues, by the way, and it should allow us to write more idiomatic code that generates very tight machine code.
Comments
In your after.asm file, the
je
instruction jump to the address00007FFB4DC92015
, which is not present in your file.Did you try to disassemble at this address to the next ret ?
Would you get similar short hot-path assembly with the following:
I think this would be cleaner to read than
goto
where your if statement is above and your main path is so short. I do favor goto when you have many exit paths and common cleanup for all of them though.Also, I think Guillaume is correct: the code for
ThrowForEmptyStack()
in the after.asm is at address00007FFB4DC92015
, which is not included in your disassembly. However, the critical path is short, which is the important thing, so I don't think it's super relevant what happens after that address. It does mean that your analysis that "Note that because we return a value from the ThrowForEmptyStack, the assembly generated is extremely short", since you are not examining the code for the return from that point.@Guillaume the code must live somewhere it doesnt disapear, the point is that the CPU decoder wont have to decode it (which is the actual trick). This is exactly what the CPU sees/works-on, the extra 88 bytes are not seen by the CPU even if eventually it can jump predict it, the decoder has no way to know there is a jump until the instruction is decoded.
Also if it is an small method will live entirely in a single L1 cache miss (instead of multiple ones). We are talking about sub 20 nanoseconds operations, the code layout is probably the 30% to 40% of the total execution cost (if not more). Think about it, an L1 cache line is 64 bytes, now you have poluted 2 cache lines with useless code, taking 4 cache lines instead of half a cache line (essentially 1).
@Stuart, you are right but this is a single tree we planted on our backyard, the real code is far more complex that the demonstration we cooked to show how the optimization operates. Think about the impact in code readability when now instead of 1 condition (precondition) you now have 3. Or what if those operations happen inside a loop, you still have to stop to throw and avoiding poluting your loop is essential at this level.
This is an actual example of real database code: https://github.com/Corvalius/ravendb/commit/f502c192c49d3939b8fb9ca1a21553ac4fbc941b#diff-448d65505ebe234daaa8954fdca2bf1aR283
Now you will see also that we even had put the
return true
andreturn false
as extremes here. This code is so hot that with just that change we were able to get 1% out (1200ms) on a 1Gb data processing. That is an entire second.@Stuart also something I forgot to mention. We are kinda positive that this is solved in CoreCLR 2.0, the JIT now has the ability to mark code as cold; and will do so when you are only throwing in your method. I know because I reported the issue myself. Changing the logic means that when CoreCLR 2.0 is released and our code is migrated to run there we can just Grep for Error* and just put the calls back into their actual position. The revert now is pretty easy and mostly error free, not so if we change the logic.
The return case will still be necessary though.
@Federico - wow! I've certainly seen cases where common exit point improves performance and/or readability, especially with a complex return. I would never have expected that common exit for
return true;
would affect performance.And I completely agree - when you have multiple pre-conditions and common cleanup,
goto
is a very effective way for improving readability. I was simply commenting on this example, which I did not realize was so heavily simplified.This is why GCC has a
__builtin_expect
macro. Unfortunately, that doesn't exist for C#.@dhasenan I have been pushing that (among other JIT improvements) pretty hard, to the point that I believe we will see it sooner than later. :)
Isn't this some sort of tail call optimization ?
Onur, No, this isn't tail call. It is just avoid the need of the CPU to go over that code.
This particular case might be even written as one liner.
return _size > 0 ? _array[_size - 1] : ThrowForEmptyStack();
. Though not sure whether the asm will be any better.Hi, Can you please elaborate on how do you view this assembly? I've created a consoleapplication and put your code there, then I looked at the exe file via ildasm and I believe it's not what you mean, because it's only the intermediate file... So, where is the assembly file created?
hagai luger, Run it in the VS debugger, the go to Debug > Assembly
Comment preview