Making code fasterI like my performance unsafely
After introducing the problem and doing some very obvious things, then doing some pretty non obvious things and even writing our own I/O routines we ended up with an implementation that is 17 times faster than the original one.
And yet we can still do better. But at this point, we need to go native and use a bit of unsafe code. We’ll start by implementing a naïve native record parser, like so:
This is pretty much the same as before, but now we are dealing with pointers. How do we use this?
We memory map the file, and then we go over it, doing no allocations at all throughout.
This give us 1 second to process the file, 126 MB allocated (probably in the dictionary) and a peak working set of 320 MB.
We are now 30 times faster than the initial implementation, and I wonder if I can do more… ? We can do that by going parallel, which give us the following code:
This is pretty ugly, but basically we are using 4 threads to run it, and we are giving each one of them a range of the file, as well as their own dedicated records dictionary. After we are done, we need to merge the records to a single dictionary, and that is it.
Using this approach, we can get down to 663 ms run time, 184 MB of allocations and 364 MB peak working set.
So we are now about 45(!) times faster than the original version. We are almost done, but on my next post, I’m going to go ahead and pull the profiler and see if we can squeeze anything else out of it.
More posts in "Making code faster" series:
- (24 Nov 2016) Micro optimizations and parallel work
- (23 Nov 2016) Specialization make it faster still
- (22 Nov 2016) That pesky dictionary
- (21 Nov 2016) Streamlining the output
- (18 Nov 2016) Pulling out the profiler
- (17 Nov 2016) I like my performance unsafely
- (16 Nov 2016) Going down the I/O chute
- (15 Nov 2016) Starting from scratch
- (14 Nov 2016) The obvious costs
- (11 Nov 2016) The interview question
Comments
The first post in the series states that this is used as an interview question. What sort of solution is reasonable in the time frame given? I think there comes a point where you sacrifice readability and maintainability for speed and performance.
JustPassingBy, Just changing the linq statement to a for loop with a dictionary would do more than x2 performance boost. Note that we are now in the < 1sec range.
In an interview, being able to do the changes in the obvious section would be good enough, most times, and being able to discuss the options of doing the other things.
Hello Oren. At first, I want to note, that your solution in the first post, took on my computer 44 seconds with 4,6 GB of allocations and peak working set of 1,3 GB. I played wit it and come up with a solution, which is very similar to one in your previous post. I just did not use array of chars, but array of bytes, I use just ints, not longs, and Dictionary<int, int> instead of FastRecord. https://gist.github.com/satano/f390189ee4c88dfeadac3fce98eab1c7](https://gist.github.com/satano/f390189ee4c88dfeadac3fce98eab1c7) It now runs 2 seconds, with allocations of 65 MB and peak working set of 39 MB.
I tried this your unsafe solution. The allocation and peak working set is similar to yours (which is higher than mine). But I was surprised, that on my computer, it is not faster. Instead it is about 100 - 200 ms slower than my solution. Do you have any idea why could be this?
Stano, Are you memory starved? It might be related to paging / swapping or something like that
@Oren: I get that is it super performant now, but for me personally that comes at the sacrifice of the code readability and maintainability. Nonetheless, this series has taught me a lot - not sure I would have thought about working down to the byte level in your previous post. But taking your ideas and writing a slightly different implementation I managed:-
Code: https://gist.github.com/anonymous/7ec170ae4ee70125b391a674faa03c28
Managed to shave it down to 500ms:-
Code: https://gist.github.com/anonymous/e5607eecf2ff37b8874951827fb57186
Looking forward to the next post.
JustPassingBy,
You have this:
https://gist.github.com/anonymous/e5607eecf2ff37b8874951827fb57186#file-byteitharder-cs-L32
Which is a single array that is concurrently accessed by multiple threads in a thread unsafe manner.
You are also assuming that the number of records is the max id size, which is false.
@Oren:
Yeah, that's what I thought. But I switched to that implementation 3-4 iterations ago and it has yet to break on me. I must have used that same logic at least a 1000 times now and it hasn't misbehaved. Maybe I am just getting lucky?
Not sure where you see this?
JustPassingBy, That really depends on what you mean by fail on you. This shouldn't generate an error, just wrong results.
As for the error in the ids, open the file, copy the first 10 lines and run your code on just those. You'll see it.
@Oren:
Using your solution from the first post in the series as the correct output and comparing the last 20 or so results from my current solution they all match.
Yep. I completely missed that. Thanks for highlighting it.
If, in the context of a practical application, the given input file is processed only once, run-time would be dominated by I/O and running things in parallel would not make much difference. But, because we are running the solutions multiple times and the file contents is fully in page cache, parallel processing would seem to be a good choice to obtain the "best benchmark numbers".
I created a 'no input validation' performed variant of the solution posted before in introducing the problem, that can be run in a single thread or in parallel. It either uses a 2-level (jagged array) "trie" or a 3-level one for id/duration storage. The 2-level one is a tiny bit faster, the 3-level one uses a bit less memory. Running the different implementations ("Reference" being the linqy code in the input ZIP archive):
So for the wall-time best case the improvement is a factor of 319, and for memory allocation the best case improvement is a factor of 5211. Performing a significant amount of validation on the input only adds around 50 ms or 20% to the run time, so that would be worth it in my opinion.
Updated code for the new options (no validation, 2-level & 3-level tries, parallel). Requires code editing to switch between the two map options.
Well it would be nice to have a comment edit option :D The second "2-Level Trie without Validation" should read "3-Level Trie without Validation".
@alex, I like your approach. It is the kind of code I would write (nasty one :D)
@Federico, thanks. I also tried other a few other trie designs for storage, but given the distribution of keys for this specific dataset 2-3 levels seemed to be optimal.
If you liked this approach, you might want to check the solution with awesome parallel processing bit magic I proposed for an earlier Etag parsing blog post challenge here and here.
Comment preview