Filtering negative numbers, fast: AVX

architecture (623) rss
bugs (451) rss
community (382) rss
databases (481) rss
design (899) rss
development (654) rss
hibernating-practices (73) rss
miscellaneous (592) rss
performance (397) rss
programming (1104) rss
raven (1471) rss
ravendb.net (558) rss
reviews (184) rss

2025
- October (4)
- September (10)
- August (6)
- July (7)
- June (7)
- May (10)
- April (10)
- March (10)
- February (7)
- January (12)
2024
- December (3)
- November (2)
- October (1)
- September (3)
- August (5)
- July (10)
- June (4)
- May (6)
- April (2)
- March (8)
- February (2)
- January (14)
2023
- December (4)
- October (4)
- September (6)
- August (12)
- July (5)
- June (15)
- May (3)
- April (11)
- March (5)
- February (5)
- January (8)
2022
- December (5)
- November (7)
- October (7)
- September (9)
- August (10)
- July (15)
- June (12)
- May (9)
- April (14)
- March (15)
- February (13)
- January (16)
2021
- December (23)
- November (20)
- October (16)
- September (6)
- August (16)
- July (11)
- June (16)
- May (4)
- April (10)
- March (11)
- February (15)
- January (14)
2020
- December (10)
- November (13)
- October (15)
- September (6)
- August (9)
- July (9)
- June (17)
- May (15)
- April (14)
- March (21)
- February (16)
- January (13)
2019
- December (17)
- November (14)
- October (16)
- September (10)
- August (8)
- July (16)
- June (11)
- May (13)
- April (18)
- March (12)
- February (19)
- January (23)
2018
- December (15)
- November (14)
- October (19)
- September (18)
- August (23)
- July (20)
- June (20)
- May (23)
- April (15)
- March (23)
- February (19)
- January (23)
2017
- December (21)
- November (24)
- October (22)
- September (21)
- August (23)
- July (21)
- June (24)
- May (21)
- April (21)
- March (23)
- February (20)
- January (23)
2016
- December (17)
- November (18)
- October (22)
- September (18)
- August (23)
- July (22)
- June (17)
- May (24)
- April (16)
- March (16)
- February (21)
- January (21)
2015
- December (5)
- November (10)
- October (9)
- September (17)
- August (20)
- July (17)
- June (4)
- May (12)
- April (9)
- March (8)
- February (25)
- January (17)
2014
- December (22)
- November (19)
- October (21)
- September (37)
- August (24)
- July (23)
- June (13)
- May (19)
- April (24)
- March (23)
- February (21)
- January (24)
2013
- December (23)
- November (29)
- October (27)
- September (26)
- August (24)
- July (24)
- June (23)
- May (25)
- April (26)
- March (24)
- February (24)
- January (21)
2012
- December (19)
- November (22)
- October (27)
- September (24)
- August (30)
- July (23)
- June (25)
- May (23)
- April (25)
- March (25)
- February (28)
- January (24)
2011
- December (17)
- November (14)
- October (24)
- September (28)
- August (27)
- July (30)
- June (19)
- May (16)
- April (30)
- March (23)
- February (11)
- January (26)
2010
- December (29)
- November (28)
- October (35)
- September (33)
- August (44)
- July (17)
- June (20)
- May (53)
- April (29)
- March (35)
- February (33)
- January (36)
2009
- December (37)
- November (35)
- October (53)
- September (60)
- August (66)
- July (29)
- June (24)
- May (52)
- April (63)
- March (35)
- February (53)
- January (50)
2008
- December (58)
- November (65)
- October (46)
- September (48)
- August (96)
- July (87)
- June (45)
- May (51)
- April (52)
- March (70)
- February (43)
- January (49)
2007
- December (100)
- November (52)
- October (109)
- September (68)
- August (80)
- July (56)
- June (150)
- May (115)
- April (73)
- March (124)
- February (102)
- January (68)
2006
- December (95)
- November (53)
- October (120)
- September (57)
- August (88)
- July (54)
- June (103)
- May (89)
- April (84)
- March (143)
- February (78)
- January (64)
2005
- December (70)
- November (97)
- October (91)
- September (61)
- August (74)
- July (92)
- June (100)
- May (53)
- April (42)
- March (41)
- February (84)
- January (31)
2004
- December (49)
- November (26)
- October (26)
- September (6)
- April (10)

Sep 13 2023

Filtering negative numbers, fastAVX

time to read 8 min | 1588 words

In the previous post I discussed how we can optimize the filtering of negative numbers by unrolling the loop, looked into branchless code and in general was able to improve performance by up to 15% from the initial version we started with. We pushed as much as we could on what can be done using scalar code. Now it is the time to open a whole new world and see what we can do when we implement this challenge using vector instructions.

The key problem with such tasks is that SIMD, AVX and their friends were designed by… an interesting process using a perspective that makes sense if you can see in a couple of additional dimensions. I assume that at least some of that is implementation constraints, but the key issue is that when you start using SIMD, you realize that you don’t have general-purpose instructions. Instead, you have a lot of dedicated instructions that are doing one thing, hopefully well, and it is your role to compose them into something that would make sense. Oftentimes, you need to turn the solution on its head in order to successfully solve it using SIMD. The benefit, of course, is that you can get quite an amazing boost in speed when you do this.

The algorithm we use is basically to scan the list of entries and copy to the start of the list only those items that are positive. How can we do that using SIMD? The whole point here is that we want to be able to operate on multiple data, but this particular task isn’t trivial. I’m going to show the code first, then discuss what it does in detail:

We start with the usual check (if you’ll recall, that ensures that the JIT knows to elide some range checks, then we load the PremuteTable. For now, just assume that this is magic (and it is). The first interesting thing happens when we start iterating over the loop. Unlike before, now we do that in chunks of 4 int64 elements at a time. Inside the loop, we start by loading a vector of int64 and then we do the first odd thing. We call ExtractMostSignificantBits(), since the sign bit is used to mark whether a number if negative or not. That means that I can use a single instruction to get an integer with the bits set for all the negative numbers. That is particularly juicy for what we need, since there is no need for comparisons, etc.

If the mask we got is all zeroes, it means that all the numbers we loaded to the vector are positives, so we can write them as-is to the output and move to the next part. Things get interesting when that isn’t the case.

We load a permute value using some shenanigans (we’ll touch on that shortly) and call the PermuteVar8x32() method. The idea here is that we pack all the non-negative numbers to the start of the vector, then we write the vector to the output. The key here is that when we do that, we increment the output index only by the number of valid values. The rest of this method just handles the remainder that does not fit into a vector.

The hard part in this implementation was to figure out how to handle the scenario where we loaded some negative numbers. We need a way to filter them, after all. But there is no SIMD instruction that allows us to do so. Luckily, we have the Avx2.PermuteVar8x32() method to help here. To confuse things, we don’t actually want to deal with 8x32 values. We want to deal with 4x64 values. There is Avx2.Permute4x64() method, and it will work quite nicely, with a single caveat. This method assumes that you are going to pass it a constant value. We don’t have such a constant, we need to be able to provide that based on whatever the masked bits will give us.

So how do we deal with this issue of filtering with SIMD? We need to move all the values we care about to the front of the vector. We have the method to do that, PermuteVar8x32() method, and we just need to figure out how to actually make use of this. PermuteVar8x32() accepts an input vector as well as a vector of the premutation you want to make. In this case, we are basing this on the 4 top bits of the 4 elements vector of int64. As such, there are a total of 16 options available to us. We have to deal with 32bits values rather than 64bits, but that isn’t that much of a problem.

Here is the premutation table that we’ll be using:

What you can see here is that when we have a 1 in the bits (shown in comments) we’ll not copy that to the vector. Let’s take a look at the entry of 0101, which may be caused by the following values [1,-2,3,-4].

When we look at the right entry at index #5 in the table: 2,3,6,7,0,0,0,0

What does this mean? It means that we want to put the 2nd int64 element in the source vector and move it as the first element of the destination vector, take the 3rd element from the source as the second element in the destination and discard the rest (marked as 0,0,0,0 in the table).

This is a bit hard to follow because we have to compose the value out of the individual 32 bits words, but it works quite well. Or, at least, it would work, but not as efficiently. This is because we would need to load the PermuteTableInts into a variable and access it, but there are better ways to deal with it. We can also ask the JIT to embed the value directly. The problem is that the pattern that the JIT recognizes is limited to ReadOnlySpan<byte>, which means that the already non-trivial int32 table got turned into this:

This is the exact same data as before, but using ReadOnlySpan<byte> means that the JIT can package that inside the data section and treat it as a constant value.

The code was heavily optimized, to the point where I noticed a JIT bug where these two versions of the code give different assembly output:

Here is what we get out:

This looks like an unintended consequence of Roslyn and the JIT each doing their (separate jobs), but not reaching the end goal. Constant folding looks like it is done mostly by Roslyn, but it does a scan like that from the left, so it wouldn’t convert $A * 4 * 8 to $A * 32. That is because it stopped evaluating the constants when it found a variable. When we add parenthesis, we isolate the value and now understand that we can fold it.

Speaking of assembly, here is the annotated assembly version of the code:

And after all of this work, where are we standing?

Method	N	Mean	Error	StdDev	Ratio	RatioSD	Code Size
FilterCmp	23	285.7 ns	3.84 ns	3.59 ns	1.00	0.00	411 B
FilterCmp_NoRangeCheck	23	272.6 ns	3.98 ns	3.53 ns	0.95	0.01	397 B
FilterCmp_Unroll_8	23	261.4 ns	1.27 ns	1.18 ns	0.91	0.01	672 B
FilterCmp_Avx	23	261.6 ns	1.37 ns	1.28 ns	0.92	0.01	521 B

FilterCmp	1047	758.7 ns	1.51 ns	1.42 ns	1.00	0.00	411 B
FilterCmp_NoRangeCheck	1047	756.8 ns	1.83 ns	1.53 ns	1.00	0.00	397 B
FilterCmp_Unroll_8	1047	640.4 ns	1.94 ns	1.82 ns	0.84	0.00	672 B
FilterCmp_Avx	1047	426.0 ns	1.62 ns	1.52 ns	0.56	0.00	521 B

FilterCmp	1048599	502,681.4 ns	3,732.37 ns	3,491.26 ns	1.00	0.00	411 B
FilterCmp_NoRangeCheck	1048599	499,472.7 ns	6,082.44 ns	5,689.52 ns	0.99	0.01	397 B
FilterCmp_Unroll_8	1048599	425,800.3 ns	352.45 ns	312.44 ns	0.85	0.01	672 B
FilterCmp_Avx	1048599	218,075.1 ns	212.40 ns	188.29 ns	0.43	0.00	521 B

FilterCmp	33554455	29,820,978.8 ns	73,461.68 ns	61,343.83 ns	1.00	0.00	411 B
FilterCmp_NoRangeCheck	33554455	29,471,229.2 ns	73,805.56 ns	69,037.77 ns	0.99	0.00	397 B
FilterCmp_Unroll_8	33554455	29,234,413.8 ns	67,597.45 ns	63,230.70 ns	0.98	0.00	672 B
FilterCmp_Avx	33554455	28,498,115.4 ns	71,661.94 ns	67,032.62 ns	0.96	0.00	521 B

So it seems that the idea of using SIMD instruction has a lot of merit. Moving from the original code to the final version, we see that we can complete the same task in up to half the time.

I’m not quite sure why we aren’t seeing the same sort of performance on the 32M, but I suspect that this is likely because we far exceed the CPU cache and we have to fetch it all from memory, so that is as fast as it can go.

If you are interested in learning more, Lemire solves the same problem in SVE (SIMD for ARM) and Paul has a similar approach in Rust.

If you can think of further optimizations, I would love to hear your ideas.

Tweet Share Share 12 comments

Tags:

Comments

13 Sep 2023
13:32 PM

Chris B

I'm not so sure the IL difference is a JIT bug. In the parenthesized case, the constant multiplication occurs prior to the variable multiplication so it can be folded. In the non-parenthesized case, the variable multiplication occurs first, so you could have an overflow after i * 4 and the multiplication by 8 would never occur. There is also the scenario that i *4 does not overflow, but multiplying the result by 8 does, which would result in an observable difference in behavior. Since you are not in a checked context (as far as I can tell), this doesn't hold as much water, but it makes me wonder if it is the intended behavior.

13 Sep 2023
14:58 PM

Catalin Pop

Nice trick with the 32 bit shuffle in pairs vs 64bits shuffle. I wish I'd thought of that myself.

13 Sep 2023
19:25 PM

Catalin Pop

I've played a bit to replace Avx2.PermuteVar8x32 with Avx2.Permute4x64and it was not that difficult. The Control byte is the index array with 4 2 bit indexes, with least significant 2 bits corresponding to the first long in the Vector256, and the most significant 2 bits the last long.

Unfortunately it seems to be a bit slower in some benchmarks, but perhaps somewhat more maintainable. The PermTable preserves all elements, so it can be also used to sorting.

        public static ReadOnlySpan<byte> PermuteTable4x64 => new byte[]
        {
              228, //Perm Pattern: 3,2,1,0, Mask: 0000  --- Value is (3 << 6) + (2 << 4) + (1 << 2) + 0 = 228
               57, //Perm Pattern: 0,3,2,1, Mask: 0001
              120, //Perm Pattern: 1,3,2,0, Mask: 0010
               78, //Perm Pattern: 1,0,3,2, Mask: 0011
              180, //Perm Pattern: 2,3,1,0, Mask: 0100
              141, //Perm Pattern: 2,0,3,1, Mask: 0101
              156, //Perm Pattern: 2,1,3,0, Mask: 0110
              147, //Perm Pattern: 2,1,0,3, Mask: 0111
              228, //Perm Pattern: 3,2,1,0, Mask: 1000
              201, //Perm Pattern: 3,0,2,1, Mask: 1001
              216, //Perm Pattern: 3,1,2,0, Mask: 1010
              210, //Perm Pattern: 3,1,0,2, Mask: 1011
              228, //Perm Pattern: 3,2,1,0, Mask: 1100
              225, //Perm Pattern: 3,2,0,1, Mask: 1101
              228, //Perm Pattern: 3,2,1,0, Mask: 1110
              228, //Perm Pattern: 3,2,1,0, Mask: 1111
        };

        public static int FilterCmp_Avx4x64(Span<long> items)
        {
            var len = items.Length;
            if (len <= 0)
                return 0;

            ref var permuteStart = ref Unsafe.AsRef(PermuteTable[0]);
            int outputIdx = 0;
            int i = 0;
            ref var output = ref items[i];
            for (; i + Vector256<long>.Count <= len; i += Vector256<long>.Count)
            {
                var v = Vector256.LoadUnsafe(ref Unsafe.Add(ref output, i));
                var bits = v.ExtractMostSignificantBits();
                if (bits == 0) // do we have _any_ negatives here?
                {
                    v.StoreUnsafe(ref Unsafe.Add(ref output, outputIdx));
                    outputIdx += Vector256<long>.Count;
                    continue;
                }
                // complex case, we have to deal with some negatives
                byte permute = PermuteTable4x64[(int)bits];
                var m = Avx2.Permute4x64(v, permute);
                m.StoreUnsafe(ref Unsafe.Add(ref output, outputIdx));
                outputIdx += 4 - BitOperations.PopCount(bits);
            }

            // remainder, do that in a scalar fashion
            for (; i < len; i++)
            {
                ref var cur = ref Unsafe.Add(ref output, i);
                if (cur < 0)
                    continue;
                Unsafe.Add(ref output, outputIdx++) = cur;

            }
            return outputIdx;
        }

These are my results:

| Method         | N        | Mean            | Error         | StdDev          | Median          | Ratio | RatioSD | Code Size |
|--------------- |--------- |----------------:|--------------:|----------------:|----------------:|------:|--------:|----------:|
| Filter_Avx     | 23       |        356.1 ns |       6.85 ns |        16.80 ns |        352.0 ns |  1.00 |    0.00 |     521 B |
| Filter_Avx4x64 | 23       |        367.7 ns |       7.17 ns |         7.04 ns |        366.7 ns |  1.03 |    0.05 |     570 B |
|                |          |                 |               |                 |                 |       |         |           |
| Filter_Avx     | 1047     |        681.1 ns |      13.37 ns |        20.42 ns |        678.2 ns |  1.00 |    0.00 |     521 B |
| Filter_Avx4x64 | 1047     |        658.7 ns |      11.95 ns |         9.98 ns |        657.3 ns |  0.96 |    0.03 |     570 B |
|                |          |                 |               |                 |                 |       |         |           |
| Filter_Avx     | 1048599  |    559,978.1 ns |  11,316.62 ns |    32,651.03 ns |    548,882.5 ns |  1.00 |    0.00 |     521 B |
| Filter_Avx4x64 | 1048599  |    562,650.5 ns |  10,021.46 ns |    21,356.54 ns |    556,194.7 ns |  1.00 |    0.06 |     570 B |
|                |          |                 |               |                 |                 |       |         |           |
| Filter_Avx     | 33554455 | 28,276,500.8 ns | 559,137.58 ns |   621,479.93 ns | 28,263,209.4 ns |  1.00 |    0.00 |     521 B |
| Filter_Avx4x64 | 33554455 | 31,121,091.4 ns | 620,432.02 ns | 1,739,757.89 ns | 30,963,700.0 ns |  1.08 |    0.08 |     570 B |

13 Sep 2023
19:52 PM

Oren Eini

Chris,

The .NET team confirmed that this is Roslyn that does the constant folding.

But I still consider that a bug given that in all scenarios, there is no difference.

For checked context, I don't think it would matter, you just need to check the end result, not the intermediate. But I'm not running checked here.

13 Sep 2023
19:55 PM

Oren Eini

Catalin,

What you aren't seeing here is Avx2.Permute4x64 not actually being something that you can use in this context.Check this out:https://github.com/dotnet/runtime/blob/main/src/libraries/System.Private.CoreLib/src/System/Runtime/Intrinsics/X86/Avx2.cs#L2172

``` Permute4x64(Vector256<long> value, [ConstantExpected] byte control) ~~~ The issue is that this is not a single instruction.Rather, this will go to a method, since you aren't passing a constant, you are passing a variable.That is why I didn't use this method.

13 Sep 2023
20:03 PM

Catalin Pop

I just checked the disassembly and the Avx2.Permute4x64does not get translated into an instruction but a method call. Ouch. So that's the reason for the slowdown seen in the benchmark.

14 Sep 2023
07:03 AM

Catalin Pop

So I tried to optimize the vx2.Permute4x64a bit more (Seems I'm just not happy with not being able to use the instruction set in a straightforward way) and came up with this:

        public static int FilterCmp_Avx4x64(Span<long> items)
        {
            var len = items.Length;
            if (len <= 0)
                return 0;

            ref var permuteStart = ref Unsafe.AsRef(PermuteTable[0]);
            int outputIdx = 0;
            int i = 0;
            ref var output = ref items[i];
            for (; i + Vector256<long>.Count <= len; i += Vector256<long>.Count)
            {
                var v = Vector256.LoadUnsafe(ref Unsafe.Add(ref output, i));
                var bits = v.ExtractMostSignificantBits();
                if (bits == 0) // do we have _any_ negatives here?
                {
                    v.StoreUnsafe(ref Unsafe.Add(ref output, outputIdx));
                    outputIdx += Vector256<long>.Count;
                    continue;
                }
                // complex case, we have to deal with some negatives
                //byte permute = PermuteTable4x64[(int)bits];
                var m = bits switch {
                    0b0000 => Avx2.Permute4x64(v, 228), //Perm Pattern: 3,2,1,0, Mask: 0000
                    0b0001 => Avx2.Permute4x64(v,  57), //Perm Pattern: 0,3,2,1, Mask: 0001
                    0b0010 => Avx2.Permute4x64(v, 120), //Perm Pattern: 1,3,2,0, Mask: 0010
                    0b0011 => Avx2.Permute4x64(v,  78), //Perm Pattern: 1,0,3,2, Mask: 0011
                    0b0100 => Avx2.Permute4x64(v, 180), //Perm Pattern: 2,3,1,0, Mask: 0100
                    0b0101 => Avx2.Permute4x64(v, 141), //Perm Pattern: 2,0,3,1, Mask: 0101
                    0b0110 => Avx2.Permute4x64(v, 156), //Perm Pattern: 2,1,3,0, Mask: 0110
                    0b0111 => Avx2.Permute4x64(v, 147), //Perm Pattern: 2,1,0,3, Mask: 0111
                    0b1000 => Avx2.Permute4x64(v, 228), //Perm Pattern: 3,2,1,0, Mask: 1000
                    0b1001 => Avx2.Permute4x64(v, 201), //Perm Pattern: 3,0,2,1, Mask: 1001
                    0b1010 => Avx2.Permute4x64(v, 216), //Perm Pattern: 3,1,2,0, Mask: 1010
                    0b1011 => Avx2.Permute4x64(v, 210), //Perm Pattern: 3,1,0,2, Mask: 1011
                    0b1100 => Avx2.Permute4x64(v, 228), //Perm Pattern: 3,2,1,0, Mask: 1100
                    0b1101 => Avx2.Permute4x64(v, 225), //Perm Pattern: 3,2,0,1, Mask: 1101
                    0b1110 => Avx2.Permute4x64(v, 228), //Perm Pattern: 3,2,1,0, Mask: 1110
                    0b1111 => Avx2.Permute4x64(v, 228), //Perm Pattern: 3,2,1,0, Mask: 1111
                };
                //var m = Avx2.Permute4x64(v, permute);
                m.StoreUnsafe(ref Unsafe.Add(ref output, outputIdx));
                outputIdx += 4 - BitOperations.PopCount(bits);
            }

            // remainder, do that in a scalar fashion
            for (; i < len; i++)
            {
                ref var cur = ref Unsafe.Add(ref output, i);
                if (cur < 0)
                    continue;
                Unsafe.Add(ref output, outputIdx++) = cur;

            }
            return outputIdx;
        }

Even though the code size is 25% bigger that the Avx version using Avx2.PermuteVar8x32 the performance of the two versions seem to be identical on my machine (depending on the run some version wins by some % in some benchmarks and looses in others), but overall there's no clear winner, it seems to be 50/50. I guess the fact that not having to load the extra 256bits permute vector offsets the bigger code size making the two versions similar.

| Method          | N        | Mean            | Error         | StdDev        | Ratio | Code Size |
|---------------- |--------- |----------------:|--------------:|--------------:|------:|----------:|
| Filter_Baseline | 23       |        363.7 ns |       5.16 ns |       4.83 ns |  1.00 |     411 B |
| Filter_Avx      | 23       |        344.1 ns |       3.03 ns |       2.53 ns |  0.95 |     521 B |
| Filter_Avx4x64  | 23       |        344.8 ns |       2.65 ns |       2.35 ns |  0.95 |     702 B |
|                 |          |                 |               |               |       |           |
| Filter_Baseline | 1047     |      1,040.8 ns |      10.14 ns |       9.48 ns |  1.00 |     411 B |
| Filter_Avx      | 1047     |        565.5 ns |       1.94 ns |       1.81 ns |  0.54 |     521 B |
| Filter_Avx4x64  | 1047     |        585.0 ns |       4.80 ns |       4.26 ns |  0.56 |     702 B |
|                 |          |                 |               |               |       |           |
| Filter_Baseline | 1048599  |    797,403.5 ns |   2,998.88 ns |   2,805.16 ns |  1.00 |     411 B |
| Filter_Avx      | 1048599  |    486,082.8 ns |   5,278.33 ns |   4,937.35 ns |  0.61 |     521 B |
| Filter_Avx4x64  | 1048599  |    482,969.0 ns |   3,632.77 ns |   3,398.09 ns |  0.61 |     702 B |
|                 |          |                 |               |               |       |           |
| Filter_Baseline | 33554455 | 31,475,673.3 ns | 231,737.00 ns | 205,428.81 ns |  1.00 |     411 B |
| Filter_Avx      | 33554455 | 26,957,754.2 ns | 177,501.92 ns | 138,581.82 ns |  0.86 |     521 B |
| Filter_Avx4x64  | 33554455 | 26,980,242.0 ns | 147,098.51 ns | 130,399.00 ns |  0.86 |     702 B |

Anyway, Thank you for this series of posts, it was a fun learning experience for me.

14 Sep 2023
08:56 AM

Paulo Morgado

One would think that this:

```csharp public static ReadOnlySpan<byte> PermuteTable4x64 => new byte[] ...

would create a new array every time and this:

```csharp
private static readonly byte[] permuteTable4x64 => new byte[]
...
public static ReadOnlySpan<byte> PermuteTable4x64 => permuteTable4x64 ;

would be better.

14 Sep 2023
13:40 PM

David M

Thanks for this series of posts.

I'd love to read/hear your perspective in a future post or series of how you monitor/model in production the effects where many Xeon processors will clock themselves down for heat/power if you use too many SIMD/AVX instructions in a short period of time, and what your real world experiences are in modeling those tradeoffs in your products? https://stackoverflow.com/questions/56852812/simd-instructions-lowering-cpu-frequency/56861355#56861355

Best, Dave M.

14 Sep 2023
17:07 PM

Catalin Pop

David M,

As the post you linked says, only AVX5215 Integer and FP Multiply cause the processor to switch to L2 (Lowest frequencies level). There's no AVX integer an no AVX FP multiply being used in the code in these blog posts. Also the transition to L1 requires heavy 256 bit instructions as well like FP and Integer Multiply, also not used in the blog posts.

From the article:

"Furthermore, you never have to be worried about light 256-bit wide instructions either, since they also don't cause downclocking. " "If you aren't using a lot of vectorized FP math, you aren't likely to be using heavy instructions, so this would apply to you{"

I doubt any generic database engine can generate the workloads necessary to throttle a Xeon processor, because they are not Vector math heavy, the vectorized workloads are small and usually focused on data movement (not expensive computations) and usually employ just a few SIMD instructions because they don't have SIMD dense code.

14 Sep 2023
18:34 PM

Oren Eini

Paurl,

I have no idea if the JIT would threat that as a readonly data in this case.

Note that we aren't talking about code hygiene. We are talking specifically about the pattern that the JIT understand and optimize.

14 Sep 2023
18:50 PM

Oren Eini

David,

My reaction to that is that this either means that we cannot use AVX512 or that we should get better hardware. A lot of those issues are greatly reduced / not applicable to modern systems. See here, for example, looks like Rocket Lake doesn't have this: https://travisdowns.github.io/blog/2020/08/19/icl-avx512-freq.html

Comment preview

Comments have been closed on this topic.

Markdown turns plain text formatting into fancy HTML formatting.

Phrase Emphasis

*italic*   **bold**
_italic_   __bold__

Links

Inline:

An [example](http://url.com/ "Title")

Reference-style labels (titles are optional):

An [example][id]. Then, anywhere
else in the doc, define the link:
  [id]: http://example.com/  "Title"

Images

Inline (titles are optional):

![alt text](/path/img.jpg "Title")

Reference-style:

![alt text][id]
[id]: /url/to/img.jpg "Title"

Headers

Setext-style:

Header 1
========
Header 2
--------

atx-style (closing #'s are optional):

# Header 1 #
## Header 2 ##
###### Header 6

Lists

Ordered, without paragraphs:

1.  Foo
2.  Bar

Unordered, with paragraphs:

*   A list item.
    With multiple paragraphs.
*   Bar

You can nest them:

*   Abacus
    * answer
*   Bubbles
    1.  bunk
    2.  bupkis
        * BELITTLER
    3. burper
*   Cunning

Blockquotes

> Email-style angle brackets
> are used for blockquotes.
> > And, they can be nested.
> #### Headers in blockquotes
> 
> * You can quote a list.
> * Etc.

Horizontal Rules

Three or more dashes or asterisks:

---
* * *
- - - -

Manual Line Breaks

End a line with two or more spaces:

Roses are red,   
Violets are blue.

Fenced Code Blocks

Code blocks delimited by 3 or more backticks or tildas:

```
This is a preformatted
code block
```

Header IDs

Set the id of headings with {#<id>} at end of heading line:

## My Heading {#myheading}

Tables

Fruit    |Color
---------|----------
Apples   |Red
Pears	 |Green
Bananas  |Yellow

Definition Lists

Term 1
: Definition 1
Term 2
: Definition 2

Footnotes

Body text with a footnote [^1]
[^1]: Footnote text here

Abbreviations

MDD <- will have title
*[MDD]: MarkdownDeep

Oren Eini

Oren Eini

CEO of RavenDB

Filtering negative numbers, fastAVX

More posts in "Filtering negative numbers, fast" series:

Comments

Comment preview

FUTURE POSTS

RECENT SERIES

RECENT COMMENTS

Syndication

Main feed
Comments feed

Oren Eini

CEO of RavenDB

Related posts that you may find interesting:

More posts in "Filtering negative numbers, fast" series:

Comments

Comment preview

Markdown formatting

Phrase Emphasis

Links

Images

Headers

Lists

Blockquotes

Horizontal Rules

Manual Line Breaks

Fenced Code Blocks

Header IDs

Tables

Definition Lists

Footnotes

Abbreviations

FUTURE POSTS

RECENT SERIES

RECENT COMMENTS

Syndication