Making code fasterThe obvious costs
In my previous post, I presented a small code sample and asked how we can improve its performance. Note that this code sample has been quite maliciously designed to be:
- Very small.
- Clear in what it is doing.
- The most obvious way to do it.
- Highly inefficient.
- Mislead people into non optimal optimization paths.
In other words, if you don’t get what is going on, you’ll not be able to get the best out of it. And even if you do, it is likely that you’ll try to go in a “minimum change of the code” that isn’t going to be doing as much for performance.
Let us look at the code again:
The most obvious optimization is that we are calling _line.Split() multiple times inside the Record class. Let us fix that:
This trivial change reduce the runtime by about 5 seconds, and saved us 4.2 GB of allocations. The peak working set increased by about 100 MB, which I assume is because the Record class moving from having a single 8 bytes field to having three 8 bytes field.
The next change is also pretty trivial, let us remove the File.ReadAllLines() in favor of calling File.ReadLines(). This, surprisingly enough, has had very little impact on performance.
However, the allocations dropped by 100 MB, and the working set dropped to 280 MB, very much near the size of the file itself.
This is because we no longer have to read the file into an array, and hold on to this array for the duration of the program. Instead, we can collect the garbage from the lines very efficiently.
This conclude the obvious stuff, and we managed to gain a whole 5 seconds of performance improvement here. However, we can do better, and it is sort of obvious, so I’ll put it in this post.
As written this code is single threaded. And while we are reading from a file, we are still pretty much CPU bound, why not use all the cores we have?
As you can see, all we had to do was add AsParallel(), and the TPL will take care of it for us.
This gives us a runtime of 9 seconds, allocations are a bit higher (3.45GB up from 3.3 GB) but the peak working set exceeded 1.1GB. Which makes a lot of sense.
Now, we are now standing at 1/3 of the initial performance, which is excellent, but can we do more? We’ll cover that in the next post.
More posts in "Making code faster" series:
- (24 Nov 2016) Micro optimizations and parallel work
- (23 Nov 2016) Specialization make it faster still
- (22 Nov 2016) That pesky dictionary
- (21 Nov 2016) Streamlining the output
- (18 Nov 2016) Pulling out the profiler
- (17 Nov 2016) I like my performance unsafely
- (16 Nov 2016) Going down the I/O chute
- (15 Nov 2016) Starting from scratch
- (14 Nov 2016) The obvious costs
- (11 Nov 2016) The interview question
Comments
@Arseny Kapoulkine from the previous post, got 135ms with c++ implementation, so when you need micro-optimizations and performance, there is no way but going deep and close to the hardware instructions.
the main lesson I learned from it, that there are no free meals, yes .net is an awesome language and my favorite one. its very easy to write code and fast, but you pay for it with performance and allocations.
I would expect from the .net framework to be more efficient and much closer to the c++ performance, but this is for another post.
again - thanks for great learning and sharing.
Fun little challenge, this what I came up with based on your previous post;
12 seconds with 74MB.
I'm not too familiar with the insides of the TPL stuff. On the face of it, this is reading lines from the file, then each line is being processed by a different CPU core (slight simplification, I know). Given that you need to group the data by ID, how does .AsParallel() help here? If you were calling AsParallel() on the grouped output, so the calc of duration and sum of durations is parallelised, I'd understand the value of it.
@Neil, the grouping is done by using the (concurrent)dictionary.
// Ryan
@Uri that's why super intense optimizations in C# look unnatural - C# is for relaxed programmers who don't care if it's 10 sec or 1 sec :)
@uri managing allocations is key for high performance code in whatever language you are working on. The bad practice of not controlling allocations, that is prevalent in .Net developer culture, is the culprit not the platform. You can control the assembler that gets emitted by the JIT in more or less the same fashion you do that with the C++ compiler, as long as you are willing to deal with nasty code (a normal thing in C/C++). Achieving more than decent performance out of the 1% of code that matters, and fast development for the rest is not a tradeoff I would dismiss lightly.
Using producer/consumer pattern: Took: 6.792 ms and allocated 3.128.383 kb with peak working set of 99.124 kb Of course we can't really compare performance as it depends on hardware
@ren But this is so not the case. A well optimised C++ code, might be always a bit faster than a well optimised C# code, but also this well optimised C# piece of code will completely blow away a not so optimised c++ code. And the c# segment will be quite a bit smaller, meaning easier to experiment with. Thats the beauty of C# compared with C++.
So you guys think C# could not be fast. Ok, then. Yesterday I got about about 440ms. After studying @Arseny's and @alex's solutions from yesterdays blog entry I made another attempt.
I'm using 8 threads now. The main threads loads the data chunkwise and then the 7 remaining threads crunch the data while the main thread loads the next chunk of data. At the end all the data from the different threads is added up and written to output. Also I switched from storing Ticks to seconds (like in @alex's solution). Also I set process priority to realtime in order to get more consistent results.
And here is the result: Took: 157 ms and allocated 67.549 kb with peak working set of 80.660 kb
Who says c# can't be as fast as C++. ;-) Although to be fair @Arseny's solution would surely still run faster on my computer since I have a 4C/8T and he used a 2C/4T.
And by the way: Absolutely no unsafe code.....
Source code: http://pastebin.com/PpJkB9gT
@Michael This is really impressive ! But how much time you took to write this down ? Because I sticked to the 1h interview time ... This is beautiful as a "demo" of how performance can be achieved (it took ~600ms on my laptop), but I think maintenance of this code is impossible: only a few dev (and I don't count myself in) could understand every bit, and by reimplementing everything (even datetime subtract) I think you open the door to some subtle bugs ( For example, I know it was not requested, but if we had the handle change in daylight saving time between in and out hours)
The original problem stated "how much time a car spent in the lot based on this file." If you know that car id ahead of time you could skip the DateTime parsing for all the other lines.
Fabrice, Can you show the sample code?
@Oren: original code : http://pastebin.com/cv9XuJ13
I then worked a bit less that 1h hour more on this, using some code from @Michael (for parsing date & id) but keeping it quite simple, still with producer/consumer pattern and changing how it read the file :
Took: 4.043 ms and allocated 594.374 kb with peak working set of 170.656 kb --> http://pastebin.com/CaVzpBXB
@Fabrice: The version above was my 20th iteration and it took me three evening to get there. After one hour (my 5th iteration) I was @ 1.300 ms and about 70 mb peak memory. Take a look at the comments section of the first blog post, there is a short explanation of what I did. The (first) version I posted there uses DateTime and is still ok maintainable.
After seeing that Arseny got sub 200 ms with C++ I wondered how fast I could get and had some spare time, so I started experimenting. I tried out a lot of things that I haven't done before or haven't done in some time (like System.Numerics.Vector or struct unions). For me it was a fun challenge and a good opportunity to revisit some thing and try out new things.
Regarding maintainability: Most of the time you need maintainable and expressive. But sometimes you just need AFAP (as fast as possible). :-)
@Michael - It is inspiring to see the level of effort you put into your solution for the simple pleasure of a mental exercise and sharing knowledge! And thanks to @Oren for providing the platform and the fodder!
Given that, I hesitate to report that sDaysPerMonthLeapYear has the same data as sDaysPerMonth.
Ed
@Ed: Thanks for pointing that out! The second number in sDaysPerMonthLeapYear should of course read 29. My Bad!
I have tried out the above code in my laptop with file 2.25 gb data with same schema ,but i am getting an "Exception of type 'System.OutOfMemoryException' was thrown
Vinod, Yes, that would do it, it can handle up to a bit less than 2 GB as currently written
Comment preview