Filtering negative numbers, fastAVX
In the previous post I discussed how we can optimize the filtering of negative numbers by unrolling the loop, looked into branchless code and in general was able to improve performance by up to 15% from the initial version we started with. We pushed as much as we could on what can be done using scalar code. Now it is the time to open a whole new world and see what we can do when we implement this challenge using vector instructions.
The key problem with such tasks is that SIMD, AVX and their friends were designed by… an interesting process using a perspective that makes sense if you can see in a couple of additional dimensions. I assume that at least some of that is implementation constraints, but the key issue is that when you start using SIMD, you realize that you don’t have general-purpose instructions. Instead, you have a lot of dedicated instructions that are doing one thing, hopefully well, and it is your role to compose them into something that would make sense. Oftentimes, you need to turn the solution on its head in order to successfully solve it using SIMD. The benefit, of course, is that you can get quite an amazing boost in speed when you do this.
The algorithm we use is basically to scan the list of entries and copy to the start of the list only those items that are positive. How can we do that using SIMD? The whole point here is that we want to be able to operate on multiple data, but this particular task isn’t trivial. I’m going to show the code first, then discuss what it does in detail:
We start with the usual check (if you’ll recall, that ensures that the JIT knows to elide some range checks, then we load the PremuteTable. For now, just assume that this is magic (and it is). The first interesting thing happens when we start iterating over the loop. Unlike before, now we do that in chunks of 4 int64 elements at a time. Inside the loop, we start by loading a vector of int64 and then we do the first odd thing. We call ExtractMostSignificantBits(), since the sign bit is used to mark whether a number if negative or not. That means that I can use a single instruction to get an integer with the bits set for all the negative numbers. That is particularly juicy for what we need, since there is no need for comparisons, etc.
If the mask we got is all zeroes, it means that all the numbers we loaded to the vector are positives, so we can write them as-is to the output and move to the next part. Things get interesting when that isn’t the case.
We load a permute value using some shenanigans (we’ll touch on that shortly) and call the PermuteVar8x32() method. The idea here is that we pack all the non-negative numbers to the start of the vector, then we write the vector to the output. The key here is that when we do that, we increment the output index only by the number of valid values. The rest of this method just handles the remainder that does not fit into a vector.
The hard part in this implementation was to figure out how to handle the scenario where we loaded some negative numbers. We need a way to filter them, after all. But there is no SIMD instruction that allows us to do so. Luckily, we have the Avx2.PermuteVar8x32() method to help here. To confuse things, we don’t actually want to deal with 8x32 values. We want to deal with 4x64 values. There is Avx2.Permute4x64() method, and it will work quite nicely, with a single caveat. This method assumes that you are going to pass it a constant value. We don’t have such a constant, we need to be able to provide that based on whatever the masked bits will give us.
So how do we deal with this issue of filtering with SIMD? We need to move all the values we care about to the front of the vector. We have the method to do that, PermuteVar8x32() method, and we just need to figure out how to actually make use of this. PermuteVar8x32() accepts an input vector as well as a vector of the premutation you want to make. In this case, we are basing this on the 4 top bits of the 4 elements vector of int64. As such, there are a total of 16 options available to us. We have to deal with 32bits values rather than 64bits, but that isn’t that much of a problem.
Here is the premutation table that we’ll be using:
What you can see here is that when we have a 1 in the bits (shown in comments) we’ll not copy that to the vector. Let’s take a look at the entry of 0101, which may be caused by the following values [1,-2,3,-4].
When we look at the right entry at index #5 in the table: 2,3,6,7,0,0,0,0
What does this mean? It means that we want to put the 2nd int64 element in the source vector and move it as the first element of the destination vector, take the 3rd element from the source as the second element in the destination and discard the rest (marked as 0,0,0,0 in the table).
This is a bit hard to follow because we have to compose the value out of the individual 32 bits words, but it works quite well. Or, at least, it would work, but not as efficiently. This is because we would need to load the PermuteTableInts into a variable and access it, but there are better ways to deal with it. We can also ask the JIT to embed the value directly. The problem is that the pattern that the JIT recognizes is limited to ReadOnlySpan<byte>, which means that the already non-trivial int32 table got turned into this:
This is the exact same data as before, but using ReadOnlySpan<byte> means that the JIT can package that inside the data section and treat it as a constant value.
The code was heavily optimized, to the point where I noticed a JIT bug where these two versions of the code give different assembly output:
Here is what we get out:
This looks like an unintended consequence of Roslyn and the JIT each doing their (separate jobs), but not reaching the end goal. Constant folding looks like it is done mostly by Roslyn, but it does a scan like that from the left, so it wouldn’t convert $A * 4 * 8 to $A * 32. That is because it stopped evaluating the constants when it found a variable. When we add parenthesis, we isolate the value and now understand that we can fold it.
Speaking of assembly, here is the annotated assembly version of the code:
And after all of this work, where are we standing?
Method | N | Mean | Error | StdDev | Ratio | RatioSD | Code Size |
---|---|---|---|---|---|---|---|
FilterCmp | 23 | 285.7 ns | 3.84 ns | 3.59 ns | 1.00 | 0.00 | 411 B |
FilterCmp_NoRangeCheck | 23 | 272.6 ns | 3.98 ns | 3.53 ns | 0.95 | 0.01 | 397 B |
FilterCmp_Unroll_8 | 23 | 261.4 ns | 1.27 ns | 1.18 ns | 0.91 | 0.01 | 672 B |
FilterCmp_Avx | 23 | 261.6 ns | 1.37 ns | 1.28 ns | 0.92 | 0.01 | 521 B |
FilterCmp | 1047 | 758.7 ns | 1.51 ns | 1.42 ns | 1.00 | 0.00 | 411 B |
FilterCmp_NoRangeCheck | 1047 | 756.8 ns | 1.83 ns | 1.53 ns | 1.00 | 0.00 | 397 B |
FilterCmp_Unroll_8 | 1047 | 640.4 ns | 1.94 ns | 1.82 ns | 0.84 | 0.00 | 672 B |
FilterCmp_Avx | 1047 | 426.0 ns | 1.62 ns | 1.52 ns | 0.56 | 0.00 | 521 B |
FilterCmp | 1048599 | 502,681.4 ns | 3,732.37 ns | 3,491.26 ns | 1.00 | 0.00 | 411 B |
FilterCmp_NoRangeCheck | 1048599 | 499,472.7 ns | 6,082.44 ns | 5,689.52 ns | 0.99 | 0.01 | 397 B |
FilterCmp_Unroll_8 | 1048599 | 425,800.3 ns | 352.45 ns | 312.44 ns | 0.85 | 0.01 | 672 B |
FilterCmp_Avx | 1048599 | 218,075.1 ns | 212.40 ns | 188.29 ns | 0.43 | 0.00 | 521 B |
FilterCmp | 33554455 | 29,820,978.8 ns | 73,461.68 ns | 61,343.83 ns | 1.00 | 0.00 | 411 B |
FilterCmp_NoRangeCheck | 33554455 | 29,471,229.2 ns | 73,805.56 ns | 69,037.77 ns | 0.99 | 0.00 | 397 B |
FilterCmp_Unroll_8 | 33554455 | 29,234,413.8 ns | 67,597.45 ns | 63,230.70 ns | 0.98 | 0.00 | 672 B |
FilterCmp_Avx | 33554455 | 28,498,115.4 ns | 71,661.94 ns | 67,032.62 ns | 0.96 | 0.00 | 521 B |
So it seems that the idea of using SIMD instruction has a lot of merit. Moving from the original code to the final version, we see that we can complete the same task in up to half the time.
I’m not quite sure why we aren’t seeing the same sort of performance on the 32M, but I suspect that this is likely because we far exceed the CPU cache and we have to fetch it all from memory, so that is as fast as it can go.
If you are interested in learning more, Lemire solves the same problem in SVE (SIMD for ARM) and Paul has a similar approach in Rust.
If you can think of further optimizations, I would love to hear your ideas.
More posts in "Filtering negative numbers, fast" series:
- (15 Sep 2023) Beating memcpy()
- (13 Sep 2023) AVX
- (12 Sep 2023) Unroll
- (11 Sep 2023) Scalar
Comments
I'm not so sure the IL difference is a JIT bug. In the parenthesized case, the constant multiplication occurs prior to the variable multiplication so it can be folded. In the non-parenthesized case, the variable multiplication occurs first, so you could have an overflow after i * 4 and the multiplication by 8 would never occur. There is also the scenario that i *4 does not overflow, but multiplying the result by 8 does, which would result in an observable difference in behavior. Since you are not in a checked context (as far as I can tell), this doesn't hold as much water, but it makes me wonder if it is the intended behavior.
Nice trick with the 32 bit shuffle in pairs vs 64bits shuffle. I wish I'd thought of that myself.
I've played a bit to replace
Avx2.PermuteVar8x32
withAvx2.Permute4x64
and it was not that difficult. The Control byte is the index array with 4 2 bit indexes, with least significant 2 bits corresponding to the firstlong
in the Vector256, and the most significant 2 bits the lastlong
.Unfortunately it seems to be a bit slower in some benchmarks, but perhaps somewhat more maintainable. The PermTable preserves all elements, so it can be also used to sorting.
These are my results:
Chris,
The .NET team confirmed that this is Roslyn that does the constant folding.
But I still consider that a bug given that in all scenarios, there is no difference.
For checked context, I don't think it would matter, you just need to check the end result, not the intermediate. But I'm not running checked here.
Catalin,
What you aren't seeing here is
Avx2.Permute4x64
not actually being something that you can use in this context.Check this out:https://github.com/dotnet/runtime/blob/main/src/libraries/System.Private.CoreLib/src/System/Runtime/Intrinsics/X86/Avx2.cs#L2172``` Permute4x64(Vector256<long> value, [ConstantExpected] byte control) ~~~ The issue is that this is not a single instruction.Rather, this will go to a method, since you aren't passing a constant, you are passing a variable.That is why I didn't use this method.
I just checked the disassembly and the
Avx2.Permute4x64
does not get translated into an instruction but a method call. Ouch. So that's the reason for the slowdown seen in the benchmark.So I tried to optimize the
vx2.Permute4x64
a bit more (Seems I'm just not happy with not being able to use the instruction set in a straightforward way) and came up with this:Even though the code size is 25% bigger that the Avx version using
Avx2.PermuteVar8x32
the performance of the two versions seem to be identical on my machine (depending on the run some version wins by some % in some benchmarks and looses in others), but overall there's no clear winner, it seems to be 50/50. I guess the fact that not having to load the extra 256bits permute vector offsets the bigger code size making the two versions similar.Anyway, Thank you for this series of posts, it was a fun learning experience for me.
One would think that this:
```csharp public static ReadOnlySpan<byte> PermuteTable4x64 => new byte[] ...
would be better.
Thanks for this series of posts.
I'd love to read/hear your perspective in a future post or series of how you monitor/model in production the effects where many Xeon processors will clock themselves down for heat/power if you use too many SIMD/AVX instructions in a short period of time, and what your real world experiences are in modeling those tradeoffs in your products? https://stackoverflow.com/questions/56852812/simd-instructions-lowering-cpu-frequency/56861355#56861355
Best, Dave M.
David M,
As the post you linked says, only AVX5215 Integer and FP Multiply cause the processor to switch to L2 (Lowest frequencies level). There's no AVX integer an no AVX FP multiply being used in the code in these blog posts. Also the transition to L1 requires heavy 256 bit instructions as well like FP and Integer Multiply, also not used in the blog posts.
From the article:
"Furthermore, you never have to be worried about light 256-bit wide instructions either, since they also don't cause downclocking. " "If you aren't using a lot of vectorized FP math, you aren't likely to be using heavy instructions, so this would apply to you{"
I doubt any generic database engine can generate the workloads necessary to throttle a Xeon processor, because they are not Vector math heavy, the vectorized workloads are small and usually focused on data movement (not expensive computations) and usually employ just a few SIMD instructions because they don't have SIMD dense code.
Paurl,
I have no idea if the JIT would threat that as a readonly data in this case.
Note that we aren't talking about code hygiene. We are talking specifically about the pattern that the JIT understand and optimize.
David,
My reaction to that is that this either means that we cannot use AVX512 or that we should get better hardware. A lot of those issues are greatly reduced / not applicable to modern systems. See here, for example, looks like Rocket Lake doesn't have this: https://travisdowns.github.io/blog/2020/08/19/icl-avx512-freq.html
Comment preview