After introducing the problem and doing some very obvious things, then doing some pretty non obvious things and even writing our own I/O routines we ended up with an implementation that is 17 times faster than the original one.
And yet we can still do better. But at this point, we need to go native and use a bit of unsafe code. We’ll start by implementing a naïve native record parser, like so:
This is pretty much the same as before, but now we are dealing with pointers. How do we use this?
We memory map the file, and then we go over it, doing no allocations at all throughout.
This give us 1 second to process the file, 126 MB allocated (probably in the dictionary) and a peak working set of 320 MB.
We are now 30 times faster than the initial implementation, and I wonder if I can do more… ? We can do that by going parallel, which give us the following code:
This is pretty ugly, but basically we are using 4 threads to run it, and we are giving each one of them a range of the file, as well as their own dedicated records dictionary. After we are done, we need to merge the records to a single dictionary, and that is it.
Using this approach, we can get down to 663 ms run time, 184 MB of allocations and 364 MB peak working set.
So we are now about 45(!) times faster than the original version. We are almost done, but on my next post, I’m going to go ahead and pull the profiler and see if we can squeeze anything else out of it.
More posts in "Making code faster" series:
- (24 Nov 2016) Micro optimizations and parallel work
- (23 Nov 2016) Specialization make it faster still
- (22 Nov 2016) That pesky dictionary
- (21 Nov 2016) Streamlining the output
- (18 Nov 2016) Pulling out the profiler
- (17 Nov 2016) I like my performance unsafely
- (16 Nov 2016) Going down the I/O chute
- (15 Nov 2016) Starting from scratch
- (14 Nov 2016) The obvious costs
- (11 Nov 2016) The interview question