Filtering negative numbers, fast: Beating memcpy()

Sep 15 2023

Filtering negative numbers, fastBeating memcpy()

time to read 9 min | 1686 words

In the previous post, I was able to utilize AVX to get some nice speedups. In general, I was able to save up to 57%(!) of the runtime in processing arrays of 1M items. That is really amazing, if you think about it. But my best effort only gave me a 4% improvement when using 32M items.

I decided to investigate what is going on in more depth, and I came up with the following benchmark. Given that I want to filter negative numbers, what would happen if the only negative number in the array was the first one?

In other words, let’s see what happens when we could write this algorithm as the following line of code:

array[1..].CopyTo(array);

The idea here is that we should measure the speed of raw memory copy and see how that compares to our code.

Before we dive into the results, I want to make a few things explicit. We are dealing here with arrays of long, when I’m talking about an array with 1M items, I’m actually talking about an 8MB buffer, and for the 32M items, we are talking about 256MB of memory.

I’m running these benchmarks on the following machine:

    AMD Ryzen 9 5950X 16-Core Processor

    Base speed:    3.40 GHz
     L1 cache:    1.0 MB
     L2 cache:    8.0 MB
     L3 cache:    64.0 MB

    Utilization    9%
     Speed    4.59 GHz

In other words, when we look at this, the 1M items (8MB) can fit into L2 (barely, but certainly backed by the L3 cache. For the 32M items (256MB), we are far beyond what can fit in the cache, so we are probably dealing with memory bandwidth issues.

I wrote the following functions to test it out:

Let’s look at what I’m actually testing here.

CopyTo() – using the span native copying is the most ergonomic way to do things, I think.
MemoryCopy() – uses a built-in unsafe API in the framework. That eventually boils down to a heavily optimized routine, which… calls to Memove() if the buffer overlaps (as they do in this case).
MoveMemory() – uses a pinvoke to call to the Windows API to do the moving of memory for us.

Here are the results for the 1M case (8MB):

Method	N	Mean	Error	StdDev	Ratio
FilterCmp	1048599	441.4 us	1.78 us	1.58 us	1.00
FilterCmp_Avx	1048599	141.1 us	2.70 us	2.65 us	0.32
CopyTo	1048599	872.8 us	11.27 us	10.54 us	1.98
MemoryCopy	1048599	869.7 us	7.29 us	6.46 us	1.97
MoveMemory	1048599	126.9 us	0.28 us	0.25 us	0.29

We can see some real surprises here. I’m using the FilterCmp (the very basic implementation) that I wrote.

I cannot explain why CopyTo() and MemoryMove() are so slow.

What is really impressive is that the FilterCmp_Avx() and MoveMemory() are so close in performance, and so much faster. To put it in another way, we are already at a stage where we are within shouting distance from the MoveMemory() performance. That is.. really impressive.

That said, what happens with 32M (256MB) ?

Method	N	Mean	Error	StdDev	Ratio
FilterCmp	33554455	22,763.6 us	157.23 us	147.07 us	1.00
FilterCmp_Avx	33554455	20,122.3 us	214.10 us	200.27 us	0.88
CopyTo	33554455	27,660.1 us	91.41 us	76.33 us	1.22
MemoryCopy	33554455	27,618.4 us	136.16 us	127.36 us	1.21
MoveMemory	33554455	20,152.0 us	166.66 us	155.89 us	0.89

Now we are faster in the FilterCmp_Avx than MoveMemory. That is… a pretty big wow, and a really nice close for this blog post series, right? Except that we won’t be stopping here.

The way the task I set out works, we are actually filtering just the first item out, and then we are basically copying the memory. Let’s do some math: 256MB in 20.1ms means 12.4 GB/sec!

On this system, I have the following memory setup:

    64.0 GB

    Speed:    2133 MHz
     Slots used:    4 of 4
     Form factor:    DIMM
     Hardware reserved:    55.2 MB

I’m using DDR4 memory, so I can expect a maximum speed of 17GB/sec. In theory, I might be able to get more if the memory is located on different DIMMs, but for the size in question, that is not likely.

I’m going to skip the training montage of VTune, understanding memory architecture and figuring out what is actually going on.

Let’s drop everything and look at what we have with just AVX vs. MoveMemory:

Method	N	Mean	Error	StdDev	Median	Ratio
FilterCmp_Avx	1048599	141.6 us	2.28 us	2.02 us	141.8 us	1.12
MoveMemory	1048599	126.8 us	0.25 us	0.19 us	126.8 us	1.00

FilterCmp_Avx	33554455	21,105.5 us	408.65 us	963.25 us	20,770.4 us	1.08
MoveMemory	33554455	20,142.5 us	245.08 us	204.66 us	20,108.2 us	1.00

The new baseline is MoveMemory, and in this run, we can see that the AVX code is slightly slower.

It’s sadly not uncommon to see numbers shift by those ranges when we are testing such micro-optimizations, mostly because we are subject to so many variables that can affect performance. In this case, I dropped all the other benchmarks, which may have changed things.

At any rate, using those numbers, we have 12.4GB/sec for MoveMemory() and 11.8GB/sec for the AVX version. The hardware maximum speed is 17GB/sec. So we are quite close to what can be done.

For that matter, I would like to point out that the trivial code completed the task in 11GB/sec, so that is quite respectable and shows that the issue here is literally getting the memory fast enough to the CPU.

Can we do something about that? I made a pretty small change to the AVX version, like so:

What are we actually doing here? Instead of loading the value and immediately using it, we are loading the next value, then we are executing the loop and when we iterate again, we will start loading the next value and process the current one. The idea is to parallelize load and compute at the instruction level.

Sadly, that didn’t seem to do the trick. I saw a 19% additional cost for that version compared to the vanilla AVX one on the 8MB run and a 2% additional cost on the 256MB run.

I then realized that there was one really important test that I had to also make, and wrote the following:

In other words, let's test the speed of moving memory and filling memory as fast as we possibly can. Here are the results:

Method	N	Mean	Error	StdDev	Ratio	RatioSD	Code Size
MoveMemory	1048599	126.8 us	0.36 us	0.33 us	0.25	0.00	270 B
FillMemory	1048599	513.5 us	10.05 us	10.32 us	1.00	0.00	351 B

MoveMemory	33554455	20,022.5 us	395.35 us	500.00 us	1.26	0.02	270 B
FillMemory	33554455	15,822.4 us	19.85 us	17.60 us	1.00	0.00	351 B

This is really interesting, for a small buffer (8MB), MoveMemory is somehow faster. I don’t have a way to explain that, but it has been a pretty consistent result in my tests.

For the large buffer (256MB), we are seeing results that make more sense to me.

MoveMemory – 12.5 GB / sec
FIllMemory – 15.8 GB / sec

In other words, for MoveMemory, we are both reading and writing, so we are paying for memory bandwidth in both directions. For filling the memory, we are only writing, so we can get better performance (no need for reads).

In other words, we are talking about hitting the real physical limits of what the hardware can do. There are all sorts of tricks that one can pull, but when we are this close to the limit, they are almost always context-sensitive and dependent on many factors.

To conclude, here are my final results:

Method	N	Mean	Error	StdDev	Ratio	RatioSD	Code Size
FilterCmp_Avx	1048599	307.9 us	6.15 us	12.84 us	0.99	0.05	270 B
FilterCmp_Avx_Next	1048599	308.4 us	6.07 us	9.26 us	0.99	0.03	270 B
CopyTo	1048599	1,043.7 us	15.96 us	14.93 us	3.37	0.11	452 B
ArrayCopy	1048599	1,046.7 us	15.92 us	14.89 us	3.38	0.14	266 B
UnsafeCopy	1048599	309.5 us	6.15 us	8.83 us	1.00	0.04	133 B
MoveMemory	1048599	310.8 us	6.17 us	9.43 us	1.00	0.00	270 B

FilterCmp_Avx	33554455	24,013.1 us	451.09 us	443.03 us	0.98	0.02	270 B
FilterCmp_Avx_Next	33554455	24,437.8 us	179.88 us	168.26 us	1.00	0.01	270 B
CopyTo	33554455	32,931.6 us	416.57 us	389.66 us	1.35	0.02	452 B
ArrayCopy	33554455	32,538.0 us	463.00 us	433.09 us	1.33	0.02	266 B
UnsafeCopy	33554455	24,386.9 us	209.98 us	196.42 us	1.00	0.01	133 B
MoveMemory	33554455	24,427.8 us	293.75 us	274.78 us	1.00	0.00	270 B

As you can see, just the AVX version is comparable or (slightly) beating the MoveMemory function.

I tried things like prefetching memory, both the next item, the next cache item and from the next page, using non-temporal load and stores and many other things, but this is a pretty tough challenge.

What is really interesting is that the original, very simple and obvious implementation, clocked at 11 GB/sec. After pulling pretty much all the stops and tricks, I was able to hit 12.5 GB / sec.

I don’t think anyone can look / update / understand the resulting code in any way without going through deep meditation. That is not a bad result at all. And along the way, I learned quite a bit about how the lowest level of the machine architecture is working.

Tweet Share Share 3 comments

Tags:

Comments

15 Sep 2023
14:33 PM

cocowalla

I just wanted to say thanks for this series, I really love your C# performance posts that get right into the details!

15 Sep 2023
15:18 PM

uri

very interesting, you came very close to the hardware limitations. the next step however, is specific design cpus / hardware for these specific tasks. there are hardware accelerators, specific cpu designs that the cloud / big companies are using.

18 Sep 2023
05:40 AM

Oren Eini

Uri,

At that point, you need to tailor the code, yes. Note that this gets to the specific L1/L2 caches, AMD vs. intel vs. ARM, etc.

The more specific you can get, the better you can tailor. But the underlying issue remains, there is a cost/benefit analysis that you have to make.

And one of the most dangerous things you can do is over tailoring, to the point where you need to run this software on 486 (and a specific model of that) to get the right results.

Comment preview

Comments have been closed on this topic.

Oren Eini

Oren Eini

CEO of RavenDB