Ayende @ Rahien

Hi!
My name is Oren Eini
Founder of Hibernating Rhinos LTD and RavenDB.
You can reach me by phone or email:

ayende@ayende.com

+972 52-548-6969

, @ Q c

Posts: 6,259 | Comments: 46,582

filter by tags archive

Making code fasterThat pesky dictionary

time to read 2 min | 343 words

It is a fact of life, if you are looking at a high performance code, a dictionary is probably going to be one of those things that will be painful. Let us see how this look like in our current program:

image

Are you kidding me?! Here I am worrying about every little bit and byte to try to get the most performance out of the system, and Dictionary is eating almost 50% of my performance?

Let us look more deeply into this:

image

So that is about 3 μs, which is pretty fast. But it isn’t fast enough.

Now, know that there are about 200,000 unique values in the dictionary (look at the Insert calls in the profiler output), and we have some knowledge about the problem space.

The id that we use has 8 characters, so at most, we can have a hundred million ids. An array of that size would be roughly 762 MB in size, so that is doable. But we also can be fairly certain that the ids are generated in some sequential manner, so there is a strong likelihood that we don’t need all of this space.

So I wrote the following function:

Changed the stats to start with an array of 256 longs, and run it, the results are nice.

587 ms and allocated 2,576 kb with peak working set of 296,484 kb

And our costs look like:

image

This is single threaded, of course, and it is faster than all the previous multi threaded versions we had before.

Making code fasterStreamlining the output

time to read 3 min | 456 words

After looking at the profiler results, I realized that we are actually spending a considerable amount of time just writing the output to a file. That didn’t really matter when our code run in 30+ seconds, spending another 100 – 200 ms to write the results was just noise, but when our code is doing that in under a second, that a considerable cost.

I’m running this code on a different machine, so we can’t directly compare. The performance of the initial version is:

38,478 ms and allocated 7,612,741 kb with peak working set of 874,660 kb

And the speed of the latest version is:

842 ms and allocated 208,435 kb with peak working set of 375,452 kb

So we are 45 times faster than the initial version.

The problem is that doing this in parallel takes quite a lot and mask some inefficiencies, so I decided to change it back to using a single threaded approach. Which gives:

1,498 ms and allocated 123,787 kb with peak working set of 319,436 kb

Merely 25 times faster than the original version.

And now let us focus on the output.

This is pretty simple code, but it hides a lot of inefficiencies, in particular, it is doing a lot of allocations as it format the string. We can do much better.

Merely changing the WriteLine to:

output.WriteLine($"{entry.Value.Id} {entry.Value.DurationInTicks}");

Saved us close to 200 ms (!), so there is a lot of space to improve here. Again, this is mostly an issue of creating highly specific code to solve this exact scenario. Here is what I did:

I wrote a simple function to format the number into a buffer, then change the summary line to write a single line into a prepared buffer (and skip all the static stuff), and write the to the file file in one shot.

And the results are:

1,191 ms and allocated 16,942 kb with peak working set of 311,432 kb

You might have noticed that I have two copies of the WriteFormattedInt, this is to skip the implicit cast to long, and yes, it matters, by about 50 ms in my tests. But this version also reduces the number of allocations we have by over 100 MB! So this is great.

And here are the profiler results on analyzing this method:

image

This function is now almost 7 times faster! That is pretty awesome, and even talking about single threaded performance, we are looking at 32 times better than the original version.

Trying the parallel version give me:

731 ms and allocated 101,565 kb with peak working set of 381,224 kb

And a total improvement of 52 times! But we can do even more… I’ll talk about it in the next post.

Making code fasterPulling out the profiler

time to read 2 min | 377 words

After doing all I can without reaching out to the profiler, and managing to get x45 performance gain, let us see what the profiler actually tells us. We’ll use the single threaded version, since that is easier.

Here it is:

image

We can see that dictionary operations take a lot of time, which is to be expected. But what is very surprising is that the date time calls are extremely expensive in this case.

The relevant code for those is here. You can see that it is pretty nice, but there are a bunch of things there that are likely costing us. The exception inside the method prevents in lining, there is error handling here that we don’t need, since we can safely assume in this exercise that the data is valid, etc.

So I changed the ParseTime to do this directly, like so:

And that saved us 11%, just this tiny change.

Here are our current costs:

image

Note that we reduced the cost of parse significantly ( at the cost of error handling, though ), but there are still a lot of work being done here. It turns out that we were actually measuring the time to write to the summary file as well (that is what all those FormatHelpers calls are), so that dirty the results somewhat, but nevermind.

The next place that we need to look at is the Dictionary, it is expensive, even though the usage of FastRecord means that we only need a single call per line, that isn’t so much fun. Note that it is using the GenericEqualityComparer, can we do better?

Trying to create my own equality comparer for longs doesn't really help.

image

So we’ll go back to the parallel version with the ParseTime optimization, and we are now running at 628 ms. And at this rate, I don’t think that there is a lot  more room for improvements, so unless someone suggests something, we are done.

Making code fasterI like my performance unsafely

time to read 2 min | 293 words

After introducing the problem and doing some very obvious things, then doing some pretty non obvious things and even writing our own I/O routines we ended up with an implementation that is 17 times faster than the original one.

And yet we can still do better. But at this point, we need to go native and use a bit of unsafe code. We’ll start by implementing a naïve native record parser, like so:

This is pretty much the same as before, but now we are dealing with pointers. How do we use this?

We memory map the file, and then we go over it, doing no allocations at all throughout.

This give us 1 second to process the file, 126 MB allocated (probably in the dictionary) and a peak working set of 320 MB.

We are now 30 times faster than the initial implementation, and I wonder if I can do more… ? We can do that by going parallel, which give us the following code:

This is pretty ugly, but basically we are using 4 threads to run it, and we are giving each one of them a range of the file, as well as their own dedicated records dictionary. After we are done, we need to merge the records to a single dictionary, and that is it.

Using this approach, we can get down to 663 ms run time, 184 MB of allocations and 364 MB peak working set.

So we are now about 45(!) times faster than the original version. We are almost done, but on my next post, I’m going to go ahead and pull the profiler and see if we can squeeze anything else out of it.

Making code fasterGoing down the I/O chute

time to read 2 min | 308 words

After introducing the problem and doing some very obvious things, and then doing some pretty non obvious things we have managed to get to 1/8 of the initial time of the original implementation.

But we can do better still. So far, we relied heavily on the File.ReadLines method, which handle quite a lot of the parsing complexity for us. However, that would still allocate a string per line, and our parsing relied on us splitting the strings again, meaning more allocations.

We can take advantage of our knowledge of the file to do better. The code size blows up, but it is mostly very simple. We create a dedicated record reader class, which will read each line from the file, with a minimum of allocations.

There is a non trivial amount of stuff going on here. We start by noting that the size in character of the data is fixed, so we can compute the size of a record very easily. Each record is exactly 50 bytes long.

The key parts here is that we are allocating a single buffer variable, which will hold the line characters. Then we just wrote our own date and integer parsing routines that are very trivial, specific to our case and most importantly, don’t require us to allocate additional strings.

Using this code is done with:

So we are back to single threaded mode. Running this code gives us a runtime of 1.7 seconds, 126 MB allocated and a peak working set of 35 MB.

We are now about 2.5 times faster than previous parallel version, and over 17 times faster than the original version.

Making this code parallel is fairly trivial now, divide the file into sections and have a record reader on each section, but is there really much point at this stage?

Making code fasterStarting from scratch

time to read 2 min | 285 words

After introducing the problem and doing some very obvious things, we have managed to get to 9 seconds instead of 30. That is pretty awesome, but we can do better.

Let us see what would happen if we will write it from scratch, sans Linq.

The code is still pretty small and idiomatic, but not using Linq gave us some interesting numbers. 10.4 seconds to run (so comparable to the parallel Linq), but we also allocated 2.9 GB (down from 3.49 GB) and our peek working set didn’t exceed 30 MB.

Taking the next step and paralleling this approach:

We now have 8 seconds, 3.49 GB of allocations and peak working set of 50 MB. That is good, but we can do better.

Now, instead of using a dictionary of long to long, we’re using a dedicated class, and the key is the string representation of the number. Most of the time, it should save us the need to parse the long. It also means that the number of dictionary operations we need to do is reduced.

This dropped the runtime to 10.2 seconds (compared to 10.4 seconds for the previous single threaded impl). That is good, but this is just the first stage, what I really want to do is save on all those expensive dictionary calls when running in parallel.

Here is the parallel version:

And that one runs at 4.1 seconds, allocates 3 GB and has a peek working set of 48 MB.

We are now close to 8 times faster than the initial version. But we can probably still do better. I’ll go over that in my next post.

Making code fasterThe obvious costs

time to read 3 min | 427 words

In my previous post, I presented a small code sample and asked how we can improve its performance. Note that this code sample has been quite maliciously designed to be:

  • Very small.
  • Clear in what it is doing.
  • The most obvious way to do it.
  • Highly inefficient.
  • Mislead people into non optimal optimization paths.

In other words, if you don’t get what is going on, you’ll not be able to get the best out of it. And even if you do, it is likely that you’ll try to go in a “minimum change of the code” that isn’t going to be doing as much for performance.

Let us look at the code again:

The most obvious optimization is that we are calling _line.Split() multiple times inside the Record class. Let us fix that:

This trivial change reduce the runtime by about 5 seconds, and saved us 4.2 GB of allocations. The peak working set increased by about 100 MB, which I assume is because the Record class moving from having a single 8 bytes field to having three 8 bytes field.

The next change is also pretty trivial, let us remove the File.ReadAllLines() in favor of calling File.ReadLines(). This, surprisingly enough, has had very little impact on performance.

However, the allocations dropped by 100 MB, and the working set dropped to 280 MB, very much near the size of the file itself.

This is because we no longer have to read the file into an array, and hold on to this array for the duration of the program. Instead, we can collect the garbage from the lines very efficiently.

This conclude the obvious stuff, and we managed to gain a whole 5 seconds of performance improvement here. However, we can do better, and it is sort of obvious, so I’ll put it in this post.

As written this code is single threaded. And while we are reading from a file, we are still pretty much CPU bound, why not use all the cores we have?

As you can see, all we had to do was add AsParallel(), and the TPL will take care of it for us.

This gives us a runtime of 9 seconds, allocations are a bit higher (3.45GB up from 3.3 GB) but the peak working set exceeded 1.1GB. Which makes a lot of sense.

Now, we are now standing at 1/3 of the initial performance, which is excellent, but can we do more? We’ll cover that in the next post.

Making code fasterThe interview question

time to read 3 min | 424 words

Interview questions are always tough to design. On the one hand, you need to create something that will not be trivial to do, and on the other hand, you have a pretty much hard time limit to a reasonable solution. For example, while implementing a linked list is something that I would expect anyone to be able to do in an interview, implementing a binary tree (including the balancing), is probably not going to be feasible.

Interview tasks (that candidate can do at home) are somewhat easier, because you don’t have the same time constraints, but at the same time, if you ask for something that takes a week to write, candidates will skip the question and the position entirely. Another issue here is that if you ask a candidate to send a binary tree as a interview task, they are going to google –> copy & paste –> send, and you learn absolutely nothing*.

* Oh, sometimes you learn quite a lot, if a candidate cannot do that, they are pretty much disqualified themselves, but we could do that more easily with Fizz Buzz, after all.

So I came up with the following question, we have the following file (the full data set is 276 MB), that contains the entry / exit log to a parking lot.

image

The first value is the entry time, the second is the exit time and the third is the car id.

Details about this file: This is UTF8 text file with space separated values using Windows line ending.

What we need to do is to find out how much time a car spent in the lot based on this file. I started out by writing the following code:

You can find the full file here. The only additional stuff is that we measure just how much this cost us.

And this code process the 276MB file in 30 seconds, using a peak working set of 850 MB and allocating a total of 7.6 GB of memory.

I’m pretty sure that we can do better. And that is the task we give to candidates.

This has the nice advantage that this is a pretty small and compact problem, but to improve upon it you actually need to understand what is going on under the covers.

I’ll discuss possible optimizations for this next week.

High performance field clobbering

time to read 3 min | 562 words

So we are using a particular library in a not so standard way. And in order to gain 10x performance benefit we have to reuse a particular class from this library. This class isn’t meant to be reused, but looking at its code, it is clear that it is perfectly possible to do so. All we need is to set the _started field to false and it will be possible to reuse this instance.

So far, so good. Except that the field is private. Now, we can’t just implement our own copy of this, because this is a field in the base class that we are extending to plug our extension to the system. We could try submitting a patch for this, but this is a popular library, and tying ourselves to a particular version would suck. This code has also hasn’t changed since at Jan 2012, so that is pretty stable. And yes, we are aware of the risk in doing this, unsupported, etc.

Now that we decided to do it, the question is how. I created the following epic class:

image

And here is the simplest option to handle it:

image

And that gives us: image

Can we do better?

What happens if we cache the field lookup?

image

This has significant improvement, right? image

But that is still quite high for me. Can we do better still?

Let us try some dynamic code generation. In this case, we can’t use the much easier Expression class to do it, and have to go with direct IL generation, which gives:

image

And the benchmark result?

image

That is pretty awesome. For comparison purposes, I also did a static delegate and direct set, to compare the costs.

image

And those give me:

image

But I think that 2.5 ns is fast enough for me here Smile.

FUTURE POSTS

  1. Installing RavenDB 4.0 on your Raspberry PI 3 - 14 hours from now
  2. The edge case is in the timing - about one day from now
  3. Writing my own synchronization primitive ReaderWriterLock - 3 days from now
  4. Implementing low level trie: Part I - 4 days from now
  5. Implementing low level trie: Part II - 7 days from now

And 1 more posts are pending...

There are posts all the way to Dec 13, 2016

RECENT SERIES

  1. The performance regression in the optimization (2):
    01 Dec 2016 - Part II
  2. Digging into the CoreCLR (4):
    25 Nov 2016 - Some bashing on the cost of hashing
  3. Making code faster (10):
    24 Nov 2016 - Micro optimizations and parallel work
  4. Optimizing read transaction startup time (7):
    31 Oct 2016 - Racy data structures
View all series

Syndication

Main feed Feed Stats
Comments feed   Comments Feed Stats