Making code fasterStreamlining the output
After looking at the profiler results, I realized that we are actually spending a considerable amount of time just writing the output to a file. That didn’t really matter when our code run in 30+ seconds, spending another 100 – 200 ms to write the results was just noise, but when our code is doing that in under a second, that a considerable cost.
I’m running this code on a different machine, so we can’t directly compare. The performance of the initial version is:
38,478 ms and allocated 7,612,741 kb with peak working set of 874,660 kb
And the speed of the latest version is:
842 ms and allocated 208,435 kb with peak working set of 375,452 kb
So we are 45 times faster than the initial version.
The problem is that doing this in parallel takes quite a lot and mask some inefficiencies, so I decided to change it back to using a single threaded approach. Which gives:
1,498 ms and allocated 123,787 kb with peak working set of 319,436 kb
Merely 25 times faster than the original version.
And now let us focus on the output.
This is pretty simple code, but it hides a lot of inefficiencies, in particular, it is doing a lot of allocations as it format the string. We can do much better.
Merely changing the WriteLine to:
output.WriteLine($"{entry.Value.Id} {entry.Value.DurationInTicks}");
Saved us close to 200 ms (!), so there is a lot of space to improve here. Again, this is mostly an issue of creating highly specific code to solve this exact scenario. Here is what I did:
I wrote a simple function to format the number into a buffer, then change the summary line to write a single line into a prepared buffer (and skip all the static stuff), and write the to the file file in one shot.
And the results are:
1,191 ms and allocated 16,942 kb with peak working set of 311,432 kb
You might have noticed that I have two copies of the WriteFormattedInt, this is to skip the implicit cast to long, and yes, it matters, by about 50 ms in my tests. But this version also reduces the number of allocations we have by over 100 MB! So this is great.
And here are the profiler results on analyzing this method:
This function is now almost 7 times faster! That is pretty awesome, and even talking about single threaded performance, we are looking at 32 times better than the original version.
Trying the parallel version give me:
731 ms and allocated 101,565 kb with peak working set of 381,224 kb
And a total improvement of 52 times! But we can do even more… I’ll talk about it in the next post.
More posts in "Making code faster" series:
- (24 Nov 2016) Micro optimizations and parallel work
- (23 Nov 2016) Specialization make it faster still
- (22 Nov 2016) That pesky dictionary
- (21 Nov 2016) Streamlining the output
- (18 Nov 2016) Pulling out the profiler
- (17 Nov 2016) I like my performance unsafely
- (16 Nov 2016) Going down the I/O chute
- (15 Nov 2016) Starting from scratch
- (14 Nov 2016) The obvious costs
- (11 Nov 2016) The interview question
Comments
You should have written it in C from the very beginning :)
Ah, yes. This is the know-how I was lacking. I don't know enough about working with bytes to really start writing my own implementation. I will give this output implementation a go later.
Just out of curiosity, are you looking to squeeze out every last drop of performance? Are time, allocation, and peaking working set of equal importance? Or would you sacrifice an increase in allocations for a reduction in time?
Really enjoying this series. Looking forward to the next post.
JustPassingBy, I'm mostly interested in time, to be honest. As a side effect, allocations & peak working set tend to drop as well
@Oren: I see. Also, the same snippet of code (
optimized-output-summary.cs
) is repeated twice in your post.I'd like to add that this "Making code faster" series is pretty useless for the average developer working on the usual application. In a database where you want it to be as fast as possible regardless of how much work it is to implement and how hard it is to maintain, in your usual application you should never write unsafe code.
This series would be much more useful if you showed us how to squeeze that extra bit of performance without dropping to unsafe/unmanaged code. For example, when you said that a lot of time is spent for formatting each line, the optimization was to write bytes directly. Instead, you could have showed how to optimize that without unsafe code, then with unsafe but fragile code.
I think there's a bug in your implementation:
TimeSpan.Hours
isn't the total number of hours, it's modulo 24. If the TimeSpan is more is e.g. 27 hours, theHours
property will return 3, not 27. You should take theDays
property into account too. In a scenario like this, it's very likely that the total number of hours for a given car will exceed 24.I disagree wholeheartedly Mircea. I already know where I am going to use the knowledge I have gained during this series. While I am not going to write unsafe code and get into nitty gritty with pointers. I can certainly appreciate the focus on speed and performant code.
And the approach you outline, is the exact approach I am taking. I do not want to write unsafe code, but I do want it blazing fast. Both approaches are possible.
For the next step, I'd calculate the necessary length of the file, memory map it, and then write to it directly. Then I'm not incurring context switches every 21 bytes.
Another option is to preallocate the entire byte array in advance, but I wouldn't expect that to be as fast as memory mapping. (In a similar vein, I thought about using a giant StringBuilder, but the output is ASCII, so that would incur a large encoding cost.)
Oren, how do you actually measure allocated kb & peak working set? by looking at ProcessMonitor output?
Down to
347 ms
now, thanks to this post. Slightly different implementation ofWriteOutput
. I did play around withBitConverter.GetBytes()
but I think I am missing something in my knowledge because it did not return what I had expected.Targeting different platforms changes the time taken by
~50ms
too:-@Urs: It was in the first post in the series. Bottom of this gist.
@Mircea >>pretty useless for the average developer working on the usual application. I agree - that is why they remain "average" doing "usual" stuff.
Mircea, In most applications, you have a few hot spots that are critical for performance. Having the tools in place to handle that is quite important. In one particular example, I saw a painful hotspot in a tree view generation that had a lot of elements, and fixing that made the entire application MUCH faster as far as the user was concerned, and that was worth whatever we did to make it happen.
Also, remember that just understanding what is going on it useful when you need to optimize. Admittedly, not many people care about micro performance the way we do, but when you have a hot spot, knowing what it is actually doing and having the tools to resolve it is important.
Urs, Look at the actual full code sample, it shows how to do that via the API
Peter, Very good point regarding avg. dev.
Mircea Chirea: if you want to point that "unsafe" code is fragile, you are assuming that a whole world of code lines written in the world (banking solutions, per example) are just safe code... Sorry to tell that this type of code splits boys from men, in real world applications and development.
JustPassingBy, peter,
I am not saying that this series or this post in particular are bad, on the contrary, they are quite helpful and very interesting - quite eye-opening to learn how you could squeeze every last drop of performance out of a simple task (a very common one I might add, parsing some text in a strict format). I am saying that this level of optimization is not applicable on most cases, except the few hot spots in an application - just like Oren said in his reply to me. However, there are many more places in a typical application that could benefit from optimizations without dropping all the way down to unsafe code.
The next post about replacing a dictionary with an array because the format of the keys allows it is exactly the kind of posts I wish this series had more of :)
Tuschinski,
I don't suggest that just because unsafe code is fragile it is also bad or broken. I mean it should be the last resort in some specific situations when writing in C#. Banking solutions written in C++ (for example) aren't necessarily broken because they aren't written in C# or heck even JavaScript - there are other considerations there. For most applications written in C# there will be few parts with unsafe code (the hot spots), where it will absolutely be worth the effort regardless of the fact that the code will be harder to mantain; however other parts could still benefit from optimizations - see what I said above and Oren's next post.
Comment preview