Filtering negative numbers, fastScalar
While working deep on the guts of RavenDB, I found myself with a seemingly simple task. Given a list of longs, I need to filter out all negative numbers as quickly as possible.
The actual scenario is that we run a speculative algorithm, given a potentially large list of items, we check if we can fulfill the request in an optimal fashion. However, if that isn’t possible, we need to switch to a slower code path that does more work.
Conceptually, this looks something like this:
That is the setup for this story. The problem we have now is that we now need to filter the results we pass to the RunManually() method.
There is a problem here, however. We marked the entries that we already used in the list by negating them. The issue is that RunManually() does not allow negative values, and its internal implementation is not friendly to ignoring those values.
In other words, given a Span<long>, I need to write the code that would filter out all the negative numbers. Everything else about the list of numbers should remain the same (the order of elements, etc).
From a coding perspective, this is as simple as:
Please note, just looking at this code makes me cringe a lot. This does the work, but it has an absolutely horrible performance profile. It allocates multiple arrays, uses a lambda, etc.
We don’t actually care about the entries here, so we are free to modify them without allocating a new value. As such, let’s see what kind of code we can write to do this work in an efficient manner. Here is what I came up with:
The way this works is that we scan through the list, skipping writing the negative lists, so we effectively “move down” all the non-negative lists on top of the negative ones. This has a cost of O(N) and will modify the entire array, the final output is the number of valid items that we have there.
In order to test the performance, I wrote the following harness:
We compare 1K, 1M and 32M elements arrays, each of which has about 0.5% negative, randomly spread across the range. Because we modify the values directly, we need to sprinkle the negatives across the array on each call. In this case, I’m testing two options for this task, one that uses a direct comparison (shown above) and one that uses bitwise or, like so:
I’m testing the cost of sprinkling negatives as well, since that has to be done before each benchmark call (since we modify the array during the call, we need to “reset” its state for the next one).
Given the two options, before we discuss the results, what would you expect to be the faster option? How would the size of the array matter here?
I really like this example, because it is simple, there isn’t any real complexity in what we are trying to do. And there is a very straightforward implementation that we can use as our baseline. That also means that I get to analyze what is going on at a very deep level. You might have noticed the disassembler attribute on the benchmark code, we are going to dive deep into that. For the same reason, we aren’t using exactly 1K, 1M, or 32M arrays, but slightly higher than that, so we’ll have to deal with remainders later on.
Let’s first look at what the JIT actually did here. Here is the annotated assembly for the FilterCmp function:
For the FilterOr, the code is pretty much the same, except that the key part is:
As you can see, the cmp option is slightly smaller, in terms of code size. In terms of performance, we have:
Method | N | Mean |
---|---|---|
FilterOr | 1047 | 745.6 ns |
FilterCmp | 1047 | 745.8 ns |
— | – | – |
FilterOr | 1048599 | 497,463.6 ns |
FilterCmp | 1048599 | 498,784.8 ns |
— | – | – |
FilterOr | 33554455 | 31,427,660.7 ns |
FilterCmp | 33554455 | 30,024,102.9 ns |
The costs are very close to one another, with Or being very slightly faster on low numbers, and Cmp being slightly faster on the larger sizes. Note that the difference level between them is basically noise. They have the same performance.
The question is, can we do better here?
Looking at the assembly, there is an extra range check in the main loop that the JIT couldn’t elide (the call to items[output++]). Can we do something about it, and would it make any difference in performance? Here is how I can remove the range check:
Here I’m telling the JIT: “I know what I’m doing”, and it shows.
Let’s look at the assembly changes between those two methods, first the prolog:
Here you can see what we are actually doing here. Note the last 4 instructions, we have a range check for the items, and then we have another check for the loop. The first will get you an exception, the second will just skip the loop. In both cases, we test the exact same thing. The JIT had a chance to actually optimize that, but didn’t.
Here is a funny scenario where adding code may reduce the amount of code generated. Let’s do another version of this method:
In this case, I added a check to handle the scenario of items being empty. What can the JIT do with this now? It turns out, quite a lot. We dropped 10 bytes from the method, which is a nice result of our diet. Here is the annotated version of the assembly:
A lot of the space savings in this case come from just not having to do a range check, but you’ll note that we still do an extra check there (lines 12..13), even though we already checked that. I think that the JIT knows that the value is not zero at this point, but has to consider that the value may be negative.
If we’ll change the initial guard clause to: items.Length <= 0, what do you think will happen? At this point, the JIT is smart enough to just elide everything, we are at 55 bytes of code and it is a super clean assembly (not a sentence I ever thought I would use). I’ll spare you going through more assembly listing, but you can find the output here.
And after all of that, where are we at?
Method | N | Mean | Error | StdDev | Ratio | RatioSD | Code Size |
---|---|---|---|---|---|---|---|
FilterCmp | 23 | 274.5 ns | 1.91 ns | 1.70 ns | 1.00 | 0.00 | 411 B |
FilterCmp_NoRangeCheck | 23 | 269.7 ns | 1.33 ns | 1.24 ns | 0.98 | 0.01 | 397 B |
FilterCmp | 1047 | 744.5 ns | 4.88 ns | 4.33 ns | 1.00 | 0.00 | 411 B |
FilterCmp_NoRangeCheck | 1047 | 745.8 ns | 3.44 ns | 3.22 ns | 1.00 | 0.00 | 397 B |
FilterCmp | 1048599 | 502,608.6 ns | 3,890.38 ns | 3,639.06 ns | 1.00 | 0.00 | 411 B |
FilterCmp_NoRangeCheck | 1048599 | 490,669.1 ns | 1,793.52 ns | 1,589.91 ns | 0.98 | 0.01 | 397 B |
FilterCmp | 33554455 | 30,495,286.6 ns | 602,907.86 ns | 717,718.92 ns | 1.00 | 0.00 | 411 B |
FilterCmp_NoRangeCheck | 33554455 | 29,952,221.2 ns | 442,176.37 ns | 391,977.84 ns | 0.99 | 0.02 | 397 B |
There is a very slight benefit to the NoRangeCheck, but even when we talk about 32M items, we aren’t talking about a lot of time.
The question what can we do better here?
More posts in "Filtering negative numbers, fast" series:
- (15 Sep 2023) Beating memcpy()
- (13 Sep 2023) AVX
- (12 Sep 2023) Unroll
- (11 Sep 2023) Scalar
Comments
This looks pretty well optimised already, but it also looks like a workload that should fair well with SIMD - guessing that's the next step?
Cocowalla,
Yep, wait for tomorrow's post :-)
Jason,
Unrolling is the next post in the series. And yes, that is the next natural step. What I find amazing is the difference between the original code and what we end up when we optimize.
Comment preview