Making code faster: Starting from scratch

Nov 15 2016

Making code fasterStarting from scratch

time to read 2 min | 285 words

After introducing the problem and doing some very obvious things, we have managed to get to 9 seconds instead of 30. That is pretty awesome, but we can do better.

Let us see what would happen if we will write it from scratch, sans Linq.

The code is still pretty small and idiomatic, but not using Linq gave us some interesting numbers. 10.4 seconds to run (so comparable to the parallel Linq), but we also allocated 2.9 GB (down from 3.49 GB) and our peek working set didn’t exceed 30 MB.

Taking the next step and paralleling this approach:

We now have 8 seconds, 3.49 GB of allocations and peak working set of 50 MB. That is good, but we can do better.

Now, instead of using a dictionary of long to long, we’re using a dedicated class, and the key is the string representation of the number. Most of the time, it should save us the need to parse the long. It also means that the number of dictionary operations we need to do is reduced.

This dropped the runtime to 10.2 seconds (compared to 10.4 seconds for the previous single threaded impl). That is good, but this is just the first stage, what I really want to do is save on all those expensive dictionary calls when running in parallel.

Here is the parallel version:

And that one runs at 4.1 seconds, allocates 3 GB and has a peek working set of 48 MB.

We are now close to 8 times faster than the initial version. But we can probably still do better. I’ll go over that in my next post.

Tweet Share Share 10 comments

Tags:

More posts in "Making code faster" series:

(24 Nov 2016) Micro optimizations and parallel work
(23 Nov 2016) Specialization make it faster still
(22 Nov 2016) That pesky dictionary
(21 Nov 2016) Streamlining the output
(18 Nov 2016) Pulling out the profiler
(17 Nov 2016) I like my performance unsafely
(16 Nov 2016) Going down the I/O chute
(15 Nov 2016) Starting from scratch
(14 Nov 2016) The obvious costs
(11 Nov 2016) The interview question

Comments

15 Nov 2016
12:38 PM

Victor

I wonder what improvements (if any) you'd see by making FastRecord a struct.

15 Nov 2016
13:54 PM

JustPassingBy

Wow, had no idea about Interlocked - thanks. I guess a tiny fraction could be saved by not parsing the 'id' to 'long'.

15 Nov 2016
14:32 PM

Chris B

It looks to me that you are getting ready to hit the point where you trade readability for performance. I'm interested to see how far you are going to go with it and how far would an interviewee have to go to be considered successful.

15 Nov 2016
15:44 PM

Itay Sagui

What about using dedicate Dictionary for each thread (thus avoiding all threads contests), and then merge those dictionaries together?

15 Nov 2016
17:38 PM

Rafal

reading a file will take the majority of time, parsing the strings isn't significant compared to io - why do you think running this in parallel would be the way to go?
just read line by line and do the parsing immediately and you shold get more or less same speed without allocating so much memory.

16 Nov 2016
00:00 AM

JustPassingBy

Just playing with this now, feels like the class is a bit overkill. Replaced it with long[1] to see what would happen.

var stats = new ConcurrentDictionary<string, long[]>();

Parallel.ForEach(File.ReadLines(InputPath), line =>
{
	var parts = line.Split(' ');
	var entry = DateTime.Parse(parts[0]);
	var exit = DateTime.Parse(parts[1]);
	var duration = (exit - entry).Ticks;
	var id = parts[2];
	var value = stats.GetOrAdd(id, s => new[] {0L});
	value[0] += duration;
});

var summary = stats.Select(entry => $"{entry.Key} {TimeSpan.FromTicks(entry.Value[0]):c}");
File.AppendAllLines(outputPath, summary);

Took: 3,791 ms
Allocated: 3,006,652 kb
Peak Working Set: 39,604 kb

16 Nov 2016
13:31 PM

BudGoode

And with each successive iteration, as performance increases, readability decreases.

16 Nov 2016
19:04 PM

Urs

@JustPassingBy: Since you don't lock on our add (line "value[0]+=duration;") e.g. by using Interlocked.Add, the results of your code will be inaccurate.

16 Nov 2016
20:57 PM

Vladimir Kocjančič

A lot depends on your hardware as well. On my INTEL Core i5 750 2.66G 8MB BOX LGA1156 home cpu and classic HDD it produced this result:
Took: 21.794 ms and allocated 2.976.442 kb with peak working set of 35.096 kb

Only after applying parallel solution, I managed to drop it down significantly. To:
Took: 8.865 ms and allocated 3.030.442 kb with peak working set of 43.132 kb

16 Nov 2016
21:51 PM

JustPassingBy

@Urs: My results were consistent with previous solutions.

Comment preview

Comments have been closed on this topic.

Oren Eini

Oren Eini

CEO of RavenDB