Let us see what would happen if we will write it from scratch, sans Linq.
The code is still pretty small and idiomatic, but not using Linq gave us some interesting numbers. 10.4 seconds to run (so comparable to the parallel Linq), but we also allocated 2.9 GB (down from 3.49 GB) and our peek working set didn’t exceed 30 MB.
Taking the next step and paralleling this approach:
We now have 8 seconds, 3.49 GB of allocations and peak working set of 50 MB. That is good, but we can do better.
Now, instead of using a dictionary of long to long, we’re using a dedicated class, and the key is the string representation of the number. Most of the time, it should save us the need to parse the long. It also means that the number of dictionary operations we need to do is reduced.
This dropped the runtime to 10.2 seconds (compared to 10.4 seconds for the previous single threaded impl). That is good, but this is just the first stage, what I really want to do is save on all those expensive dictionary calls when running in parallel.
Here is the parallel version:
And that one runs at 4.1 seconds, allocates 3 GB and has a peek working set of 48 MB.
We are now close to 8 times faster than the initial version. But we can probably still do better. I’ll go over that in my next post.
More posts in "Making code faster" series:
- (24 Nov 2016) Micro optimizations and parallel work
- (23 Nov 2016) Specialization make it faster still
- (22 Nov 2016) That pesky dictionary
- (21 Nov 2016) Streamlining the output
- (18 Nov 2016) Pulling out the profiler
- (17 Nov 2016) I like my performance unsafely
- (16 Nov 2016) Going down the I/O chute
- (15 Nov 2016) Starting from scratch
- (14 Nov 2016) The obvious costs
- (11 Nov 2016) The interview question