Making code fasterStarting from scratch
After introducing the problem and doing some very obvious things, we have managed to get to 9 seconds instead of 30. That is pretty awesome, but we can do better.
Let us see what would happen if we will write it from scratch, sans Linq.
The code is still pretty small and idiomatic, but not using Linq gave us some interesting numbers. 10.4 seconds to run (so comparable to the parallel Linq), but we also allocated 2.9 GB (down from 3.49 GB) and our peek working set didn’t exceed 30 MB.
Taking the next step and paralleling this approach:
We now have 8 seconds, 3.49 GB of allocations and peak working set of 50 MB. That is good, but we can do better.
Now, instead of using a dictionary of long to long, we’re using a dedicated class, and the key is the string representation of the number. Most of the time, it should save us the need to parse the long. It also means that the number of dictionary operations we need to do is reduced.
This dropped the runtime to 10.2 seconds (compared to 10.4 seconds for the previous single threaded impl). That is good, but this is just the first stage, what I really want to do is save on all those expensive dictionary calls when running in parallel.
Here is the parallel version:
And that one runs at 4.1 seconds, allocates 3 GB and has a peek working set of 48 MB.
We are now close to 8 times faster than the initial version. But we can probably still do better. I’ll go over that in my next post.
More posts in "Making code faster" series:
- (24 Nov 2016) Micro optimizations and parallel work
- (23 Nov 2016) Specialization make it faster still
- (22 Nov 2016) That pesky dictionary
- (21 Nov 2016) Streamlining the output
- (18 Nov 2016) Pulling out the profiler
- (17 Nov 2016) I like my performance unsafely
- (16 Nov 2016) Going down the I/O chute
- (15 Nov 2016) Starting from scratch
- (14 Nov 2016) The obvious costs
- (11 Nov 2016) The interview question
Comments
I wonder what improvements (if any) you'd see by making FastRecord a struct.
Wow, had no idea about Interlocked - thanks. I guess a tiny fraction could be saved by not parsing the 'id' to 'long'.
It looks to me that you are getting ready to hit the point where you trade readability for performance. I'm interested to see how far you are going to go with it and how far would an interviewee have to go to be considered successful.
What about using dedicate Dictionary for each thread (thus avoiding all threads contests), and then merge those dictionaries together?
reading a file will take the majority of time, parsing the strings isn't significant compared to io - why do you think running this in parallel would be the way to go? just read line by line and do the parsing immediately and you shold get more or less same speed without allocating so much memory.
Just playing with this now, feels like the class is a bit overkill. Replaced it with long[1] to see what would happen.
Took: 3,791 ms Allocated: 3,006,652 kb Peak Working Set: 39,604 kb
And with each successive iteration, as performance increases, readability decreases.
@JustPassingBy: Since you don't lock on our add (line "value[0]+=duration;") e.g. by using Interlocked.Add, the results of your code will be inaccurate.
A lot depends on your hardware as well. On my INTEL Core i5 750 2.66G 8MB BOX LGA1156 home cpu and classic HDD it produced this result:
Took: 21.794 ms and allocated 2.976.442 kb with peak working set of 35.096 kb
Only after applying parallel solution, I managed to drop it down significantly. To:
Took: 8.865 ms and allocated 3.030.442 kb with peak working set of 43.132 kb
@Urs: My results were consistent with previous solutions.
Comment preview