After looking at the profiler results, I realized that we are actually spending a considerable amount of time just writing the output to a file. That didn’t really matter when our code run in 30+ seconds, spending another 100 – 200 ms to write the results was just noise, but when our code is doing that in under a second, that a considerable cost.
I’m running this code on a different machine, so we can’t directly compare. The performance of the initial version is:
38,478 ms and allocated 7,612,741 kb with peak working set of 874,660 kb
And the speed of the latest version is:
842 ms and allocated 208,435 kb with peak working set of 375,452 kb
So we are 45 times faster than the initial version.
The problem is that doing this in parallel takes quite a lot and mask some inefficiencies, so I decided to change it back to using a single threaded approach. Which gives:
1,498 ms and allocated 123,787 kb with peak working set of 319,436 kb
Merely 25 times faster than the original version.
And now let us focus on the output.
This is pretty simple code, but it hides a lot of inefficiencies, in particular, it is doing a lot of allocations as it format the string. We can do much better.
Merely changing the WriteLine to:
Saved us close to 200 ms (!), so there is a lot of space to improve here. Again, this is mostly an issue of creating highly specific code to solve this exact scenario. Here is what I did:
I wrote a simple function to format the number into a buffer, then change the summary line to write a single line into a prepared buffer (and skip all the static stuff), and write the to the file file in one shot.
And the results are:
1,191 ms and allocated 16,942 kb with peak working set of 311,432 kb
You might have noticed that I have two copies of the WriteFormattedInt, this is to skip the implicit cast to long, and yes, it matters, by about 50 ms in my tests. But this version also reduces the number of allocations we have by over 100 MB! So this is great.
And here are the profiler results on analyzing this method:
This function is now almost 7 times faster! That is pretty awesome, and even talking about single threaded performance, we are looking at 32 times better than the original version.
Trying the parallel version give me:
731 ms and allocated 101,565 kb with peak working set of 381,224 kb
And a total improvement of 52 times! But we can do even more… I’ll talk about it in the next post.
More posts in "Making code faster" series:
- (24 Nov 2016) Micro optimizations and parallel work
- (23 Nov 2016) Specialization make it faster still
- (22 Nov 2016) That pesky dictionary
- (21 Nov 2016) Streamlining the output
- (18 Nov 2016) Pulling out the profiler
- (17 Nov 2016) I like my performance unsafely
- (16 Nov 2016) Going down the I/O chute
- (15 Nov 2016) Starting from scratch
- (14 Nov 2016) The obvious costs
- (11 Nov 2016) The interview question