Making code fasterGoing down the I/O chute
After introducing the problem and doing some very obvious things, and then doing some pretty non obvious things we have managed to get to 1/8 of the initial time of the original implementation.
But we can do better still. So far, we relied heavily on the File.ReadLines method, which handle quite a lot of the parsing complexity for us. However, that would still allocate a string per line, and our parsing relied on us splitting the strings again, meaning more allocations.
We can take advantage of our knowledge of the file to do better. The code size blows up, but it is mostly very simple. We create a dedicated record reader class, which will read each line from the file, with a minimum of allocations.
There is a non trivial amount of stuff going on here. We start by noting that the size in character of the data is fixed, so we can compute the size of a record very easily. Each record is exactly 50 bytes long.
The key parts here is that we are allocating a single buffer variable, which will hold the line characters. Then we just wrote our own date and integer parsing routines that are very trivial, specific to our case and most importantly, don’t require us to allocate additional strings.
Using this code is done with:
So we are back to single threaded mode. Running this code gives us a runtime of 1.7 seconds, 126 MB allocated and a peak working set of 35 MB.
We are now about 2.5 times faster than previous parallel version, and over 17 times faster than the original version.
Making this code parallel is fairly trivial now, divide the file into sections and have a record reader on each section, but is there really much point at this stage?
More posts in "Making code faster" series:
- (24 Nov 2016) Micro optimizations and parallel work
- (23 Nov 2016) Specialization make it faster still
- (22 Nov 2016) That pesky dictionary
- (21 Nov 2016) Streamlining the output
- (18 Nov 2016) Pulling out the profiler
- (17 Nov 2016) I like my performance unsafely
- (16 Nov 2016) Going down the I/O chute
- (15 Nov 2016) Starting from scratch
- (14 Nov 2016) The obvious costs
- (11 Nov 2016) The interview question
Comments
why do you think reading file in parallel is faster than doing this sequentially? After all, you've got only 1 disk doing the io and it wont become faster from parallel.
@Rafal it will be faster if the disk is ssd.
I played with this code a couple days ago, went as far as block reading the 50 chars at time myself. I sort of stopped when I noticed that there are not any good built-in functions to parse date and numbers from a char[], which feels a bit stupid.
Now it looks like my Turbo Pascal programs 30 years ago with blockread and own parsing.
Niklaus Wirth the creator of Pascal has made a law: "The hope is that the progress in hardware will cure all software ills. However, a critical observer may observe that software manages to outgrow hardware in size and sluggishness"
https://en.wikipedia.org/wiki/Wirth%27s_law
Make a reasonable guess at the number of keys in the dict while newing it to avoid resizing it or getting to many hash collisions.
At this point, why retain the FastRecord class?
If you're single threaded, a Dictionary<long, long> would work fine and save another chunk of allocations.
Chris, Actually, no. A dictionary with a long value will require me to do two dictionary operations each time. One to find the value, the second to update it.
However, with the FastRecord, I only need to do a single dictionary search.
I guess the code would be faster by using member fields for the local variables day, month etc. in the ParseTime() method, and also for the "i" local variable in the ParseInt() method. This way the variables don't have to be put on the stack on every call. Or am I wrong here?
Urs, Stack allocation is _cheap_, and it means that the code can be used in multi threading safely, etc. Cheap doesn't mean free, and some testing indicate that this can help a tiny bit, but it means that you now have a larger object and worry about threading.
@Oren, of course you're right, and it is very messy and dirty, and I would usually never do things like this, but in this special case I would still favor my solution, if testing shows it makes a noticeable difference. The class isn't thread safe anyway since you use a StreamReader, so I don't think it is of any concern here.
BTW I think another tiny bit could be squeezed out if you get rid of the for loop in the ParseInt method and instead code 2 separate methods ParseInt2 and ParseInt4 which handle the two cases which you actually have (2 and 4 digits) "hard-coded". And, while you're at it, replace "val *=10" with "val = (val << 3) + (val << 1)"
For when a book on improving performance on .NET? ;) Specially if around DB access (N+1 calls), web requests (http), and some low level stuff (I/O, allocations, custom parsing, pointers)...
Andre, Those are very different things. Reduce network I/O (as in less db cals) is very different from reducing output size (compressing images and minifying assets) which is very different from low level optimizations. You pretty much need to be able to cross several such levels for high perf code, but it won't make much sense to mix such concerns.
Comment preview