In the previous post, I was able to utilize AVX to get some nice speedups. In general, I was able to save up to 57%(!) of the runtime in processing arrays of 1M items. That is really amazing, if you think about it. But my best effort only gave me a 4% improvement when using 32M items.
I decided to investigate what is going on in more depth, and I came up with the following benchmark. Given that I want to filter negative numbers, what would happen if the only negative number in the array was the first one?
In other words, let’s see what happens when we could write this algorithm as the following line of code:
The idea here is that we should measure the speed of raw memory copy and see how that compares to our code.
Before we dive into the results, I want to make a few things explicit. We are dealing here with arrays of long, when I’m talking about an array with 1M items, I’m actually talking about an 8MB buffer, and for the 32M items, we are talking about 256MB of memory.
I’m running these benchmarks on the following machine:
AMD Ryzen 9 5950X 16-Core Processor
Base speed: 3.40 GHz
L1 cache: 1.0 MB
L2 cache: 8.0 MB
L3 cache: 64.0 MB
Speed 4.59 GHz
In other words, when we look at this, the 1M items (8MB) can fit into L2 (barely, but certainly backed by the L3 cache. For the 32M items (256MB), we are far beyond what can fit in the cache, so we are probably dealing with memory bandwidth issues.
I wrote the following functions to test it out:
Let’s look at what I’m actually testing here.
- CopyTo() – using the span native copying is the most ergonomic way to do things, I think.
- MemoryCopy() – uses a built-in unsafe API in the framework. That eventually boils down to a heavily optimized routine, which… calls to Memove() if the buffer overlaps (as they do in this case).
- MoveMemory() – uses a pinvoke to call to the Windows API to do the moving of memory for us.
Here are the results for the 1M case (8MB):
|FilterCmp||1048599||441.4 us||1.78 us||1.58 us||1.00|
|FilterCmp_Avx||1048599||141.1 us||2.70 us||2.65 us||0.32|
|CopyTo||1048599||872.8 us||11.27 us||10.54 us||1.98|
|MemoryCopy||1048599||869.7 us||7.29 us||6.46 us||1.97|
|MoveMemory||1048599||126.9 us||0.28 us||0.25 us||0.29|
We can see some real surprises here. I’m using the FilterCmp (the very basic implementation) that I wrote.
I cannot explain why CopyTo() and MemoryMove() are so slow.
What is really impressive is that the FilterCmp_Avx() and MoveMemory() are so close in performance, and so much faster. To put it in another way, we are already at a stage where we are within shouting distance from the MoveMemory() performance. That is.. really impressive.
That said, what happens with 32M (256MB) ?
|FilterCmp||33554455||22,763.6 us||157.23 us||147.07 us||1.00|
|FilterCmp_Avx||33554455||20,122.3 us||214.10 us||200.27 us||0.88|
|CopyTo||33554455||27,660.1 us||91.41 us||76.33 us||1.22|
|MemoryCopy||33554455||27,618.4 us||136.16 us||127.36 us||1.21|
|MoveMemory||33554455||20,152.0 us||166.66 us||155.89 us||0.89|
Now we are faster in the FilterCmp_Avx than MoveMemory. That is… a pretty big wow, and a really nice close for this blog post series, right? Except that we won’t be stopping here.
The way the task I set out works, we are actually filtering just the first item out, and then we are basically copying the memory. Let’s do some math: 256MB in 20.1ms means 12.4 GB/sec!
On this system, I have the following memory setup:
Speed: 2133 MHz
Slots used: 4 of 4
Form factor: DIMM
Hardware reserved: 55.2 MB
I’m using DDR4 memory, so I can expect a maximum speed of 17GB/sec. In theory, I might be able to get more if the memory is located on different DIMMs, but for the size in question, that is not likely.
I’m going to skip the training montage of VTune, understanding memory architecture and figuring out what is actually going on.
Let’s drop everything and look at what we have with just AVX vs. MoveMemory:
|FilterCmp_Avx||1048599||141.6 us||2.28 us||2.02 us||141.8 us||1.12|
|MoveMemory||1048599||126.8 us||0.25 us||0.19 us||126.8 us||1.00|
|FilterCmp_Avx||33554455||21,105.5 us||408.65 us||963.25 us||20,770.4 us||1.08|
|MoveMemory||33554455||20,142.5 us||245.08 us||204.66 us||20,108.2 us||1.00|
The new baseline is MoveMemory, and in this run, we can see that the AVX code is slightly slower.
It’s sadly not uncommon to see numbers shift by those ranges when we are testing such micro-optimizations, mostly because we are subject to so many variables that can affect performance. In this case, I dropped all the other benchmarks, which may have changed things.
At any rate, using those numbers, we have 12.4GB/sec for MoveMemory() and 11.8GB/sec for the AVX version. The hardware maximum speed is 17GB/sec. So we are quite close to what can be done.
For that matter, I would like to point out that the trivial code completed the task in 11GB/sec, so that is quite respectable and shows that the issue here is literally getting the memory fast enough to the CPU.
Can we do something about that? I made a pretty small change to the AVX version, like so:
What are we actually doing here? Instead of loading the value and immediately using it, we are loading the next value, then we are executing the loop and when we iterate again, we will start loading the next value and process the current one. The idea is to parallelize load and compute at the instruction level.
Sadly, that didn’t seem to do the trick. I saw a 19% additional cost for that version compared to the vanilla AVX one on the 8MB run and a 2% additional cost on the 256MB run.
I then realized that there was one really important test that I had to also make, and wrote the following:
In other words, let's test the speed of moving memory and filling memory as fast as we possibly can. Here are the results:
|MoveMemory||1048599||126.8 us||0.36 us||0.33 us||0.25||0.00||270 B|
|FillMemory||1048599||513.5 us||10.05 us||10.32 us||1.00||0.00||351 B|
|MoveMemory||33554455||20,022.5 us||395.35 us||500.00 us||1.26||0.02||270 B|
|FillMemory||33554455||15,822.4 us||19.85 us||17.60 us||1.00||0.00||351 B|
This is really interesting, for a small buffer (8MB), MoveMemory is somehow faster. I don’t have a way to explain that, but it has been a pretty consistent result in my tests.
For the large buffer (256MB), we are seeing results that make more sense to me.
- MoveMemory – 12.5 GB / sec
- FIllMemory – 15.8 GB / sec
In other words, for MoveMemory, we are both reading and writing, so we are paying for memory bandwidth in both directions. For filling the memory, we are only writing, so we can get better performance (no need for reads).
In other words, we are talking about hitting the real physical limits of what the hardware can do. There are all sorts of tricks that one can pull, but when we are this close to the limit, they are almost always context-sensitive and dependent on many factors.
To conclude, here are my final results:
|FilterCmp_Avx||1048599||307.9 us||6.15 us||12.84 us||0.99||0.05||270 B|
|FilterCmp_Avx_Next||1048599||308.4 us||6.07 us||9.26 us||0.99||0.03||270 B|
|CopyTo||1048599||1,043.7 us||15.96 us||14.93 us||3.37||0.11||452 B|
|ArrayCopy||1048599||1,046.7 us||15.92 us||14.89 us||3.38||0.14||266 B|
|UnsafeCopy||1048599||309.5 us||6.15 us||8.83 us||1.00||0.04||133 B|
|MoveMemory||1048599||310.8 us||6.17 us||9.43 us||1.00||0.00||270 B|
|FilterCmp_Avx||33554455||24,013.1 us||451.09 us||443.03 us||0.98||0.02||270 B|
|FilterCmp_Avx_Next||33554455||24,437.8 us||179.88 us||168.26 us||1.00||0.01||270 B|
|CopyTo||33554455||32,931.6 us||416.57 us||389.66 us||1.35||0.02||452 B|
|ArrayCopy||33554455||32,538.0 us||463.00 us||433.09 us||1.33||0.02||266 B|
|UnsafeCopy||33554455||24,386.9 us||209.98 us||196.42 us||1.00||0.01||133 B|
|MoveMemory||33554455||24,427.8 us||293.75 us||274.78 us||1.00||0.00||270 B|
As you can see, just the AVX version is comparable or (slightly) beating the MoveMemory function.
I tried things like prefetching memory, both the next item, the next cache item and from the next page, using non-temporal load and stores and many other things, but this is a pretty tough challenge.
What is really interesting is that the original, very simple and obvious implementation, clocked at 11 GB/sec. After pulling pretty much all the stops and tricks, I was able to hit 12.5 GB / sec.
I don’t think anyone can look / update / understand the resulting code in any way without going through deep meditation. That is not a bad result at all. And along the way, I learned quite a bit about how the lowest level of the machine architecture is working.