Ayende @ Rahien

Hi!
My name is Oren Eini
Founder of Hibernating Rhinos LTD and RavenDB.
You can reach me by phone or email:

ayende@ayende.com

+972 52-548-6969

, @ Q c

Posts: 6,260 | Comments: 46,600

filter by tags archive

Making code fasterGoing down the I/O chute

time to read 2 min | 308 words

After introducing the problem and doing some very obvious things, and then doing some pretty non obvious things we have managed to get to 1/8 of the initial time of the original implementation.

But we can do better still. So far, we relied heavily on the File.ReadLines method, which handle quite a lot of the parsing complexity for us. However, that would still allocate a string per line, and our parsing relied on us splitting the strings again, meaning more allocations.

We can take advantage of our knowledge of the file to do better. The code size blows up, but it is mostly very simple. We create a dedicated record reader class, which will read each line from the file, with a minimum of allocations.

There is a non trivial amount of stuff going on here. We start by noting that the size in character of the data is fixed, so we can compute the size of a record very easily. Each record is exactly 50 bytes long.

The key parts here is that we are allocating a single buffer variable, which will hold the line characters. Then we just wrote our own date and integer parsing routines that are very trivial, specific to our case and most importantly, don’t require us to allocate additional strings.

Using this code is done with:

So we are back to single threaded mode. Running this code gives us a runtime of 1.7 seconds, 126 MB allocated and a peak working set of 35 MB.

We are now about 2.5 times faster than previous parallel version, and over 17 times faster than the original version.

Making this code parallel is fairly trivial now, divide the file into sections and have a record reader on each section, but is there really much point at this stage?

Making code fasterStarting from scratch

time to read 2 min | 285 words

After introducing the problem and doing some very obvious things, we have managed to get to 9 seconds instead of 30. That is pretty awesome, but we can do better.

Let us see what would happen if we will write it from scratch, sans Linq.

The code is still pretty small and idiomatic, but not using Linq gave us some interesting numbers. 10.4 seconds to run (so comparable to the parallel Linq), but we also allocated 2.9 GB (down from 3.49 GB) and our peek working set didn’t exceed 30 MB.

Taking the next step and paralleling this approach:

We now have 8 seconds, 3.49 GB of allocations and peak working set of 50 MB. That is good, but we can do better.

Now, instead of using a dictionary of long to long, we’re using a dedicated class, and the key is the string representation of the number. Most of the time, it should save us the need to parse the long. It also means that the number of dictionary operations we need to do is reduced.

This dropped the runtime to 10.2 seconds (compared to 10.4 seconds for the previous single threaded impl). That is good, but this is just the first stage, what I really want to do is save on all those expensive dictionary calls when running in parallel.

Here is the parallel version:

And that one runs at 4.1 seconds, allocates 3 GB and has a peek working set of 48 MB.

We are now close to 8 times faster than the initial version. But we can probably still do better. I’ll go over that in my next post.

Making code fasterThe obvious costs

time to read 3 min | 427 words

In my previous post, I presented a small code sample and asked how we can improve its performance. Note that this code sample has been quite maliciously designed to be:

  • Very small.
  • Clear in what it is doing.
  • The most obvious way to do it.
  • Highly inefficient.
  • Mislead people into non optimal optimization paths.

In other words, if you don’t get what is going on, you’ll not be able to get the best out of it. And even if you do, it is likely that you’ll try to go in a “minimum change of the code” that isn’t going to be doing as much for performance.

Let us look at the code again:

The most obvious optimization is that we are calling _line.Split() multiple times inside the Record class. Let us fix that:

This trivial change reduce the runtime by about 5 seconds, and saved us 4.2 GB of allocations. The peak working set increased by about 100 MB, which I assume is because the Record class moving from having a single 8 bytes field to having three 8 bytes field.

The next change is also pretty trivial, let us remove the File.ReadAllLines() in favor of calling File.ReadLines(). This, surprisingly enough, has had very little impact on performance.

However, the allocations dropped by 100 MB, and the working set dropped to 280 MB, very much near the size of the file itself.

This is because we no longer have to read the file into an array, and hold on to this array for the duration of the program. Instead, we can collect the garbage from the lines very efficiently.

This conclude the obvious stuff, and we managed to gain a whole 5 seconds of performance improvement here. However, we can do better, and it is sort of obvious, so I’ll put it in this post.

As written this code is single threaded. And while we are reading from a file, we are still pretty much CPU bound, why not use all the cores we have?

As you can see, all we had to do was add AsParallel(), and the TPL will take care of it for us.

This gives us a runtime of 9 seconds, allocations are a bit higher (3.45GB up from 3.3 GB) but the peak working set exceeded 1.1GB. Which makes a lot of sense.

Now, we are now standing at 1/3 of the initial performance, which is excellent, but can we do more? We’ll cover that in the next post.

Making code fasterThe interview question

time to read 3 min | 424 words

Interview questions are always tough to design. On the one hand, you need to create something that will not be trivial to do, and on the other hand, you have a pretty much hard time limit to a reasonable solution. For example, while implementing a linked list is something that I would expect anyone to be able to do in an interview, implementing a binary tree (including the balancing), is probably not going to be feasible.

Interview tasks (that candidate can do at home) are somewhat easier, because you don’t have the same time constraints, but at the same time, if you ask for something that takes a week to write, candidates will skip the question and the position entirely. Another issue here is that if you ask a candidate to send a binary tree as a interview task, they are going to google –> copy & paste –> send, and you learn absolutely nothing*.

* Oh, sometimes you learn quite a lot, if a candidate cannot do that, they are pretty much disqualified themselves, but we could do that more easily with Fizz Buzz, after all.

So I came up with the following question, we have the following file (the full data set is 276 MB), that contains the entry / exit log to a parking lot.

image

The first value is the entry time, the second is the exit time and the third is the car id.

Details about this file: This is UTF8 text file with space separated values using Windows line ending.

What we need to do is to find out how much time a car spent in the lot based on this file. I started out by writing the following code:

You can find the full file here. The only additional stuff is that we measure just how much this cost us.

And this code process the 276MB file in 30 seconds, using a peak working set of 850 MB and allocating a total of 7.6 GB of memory.

I’m pretty sure that we can do better. And that is the task we give to candidates.

This has the nice advantage that this is a pretty small and compact problem, but to improve upon it you actually need to understand what is going on under the covers.

I’ll discuss possible optimizations for this next week.

High performance field clobbering

time to read 3 min | 562 words

So we are using a particular library in a not so standard way. And in order to gain 10x performance benefit we have to reuse a particular class from this library. This class isn’t meant to be reused, but looking at its code, it is clear that it is perfectly possible to do so. All we need is to set the _started field to false and it will be possible to reuse this instance.

So far, so good. Except that the field is private. Now, we can’t just implement our own copy of this, because this is a field in the base class that we are extending to plug our extension to the system. We could try submitting a patch for this, but this is a popular library, and tying ourselves to a particular version would suck. This code has also hasn’t changed since at Jan 2012, so that is pretty stable. And yes, we are aware of the risk in doing this, unsupported, etc.

Now that we decided to do it, the question is how. I created the following epic class:

image

And here is the simplest option to handle it:

image

And that gives us: image

Can we do better?

What happens if we cache the field lookup?

image

This has significant improvement, right? image

But that is still quite high for me. Can we do better still?

Let us try some dynamic code generation. In this case, we can’t use the much easier Expression class to do it, and have to go with direct IL generation, which gives:

image

And the benchmark result?

image

That is pretty awesome. For comparison purposes, I also did a static delegate and direct set, to compare the costs.

image

And those give me:

image

But I think that 2.5 ns is fast enough for me here Smile.

HTTP benchmark and pipelining

time to read 29 min | 5795 words

Here is an interesting problem. If you want to load test a server, it is very hard to truly to do so. Put simply, after a while, the problem isn’t with your code, it is with the ability of the surrounding systems to actually get the requests to you fast enough.

In this case, let us talk about what is going on when you are actually doing an HTTP request.

We’ll start from the following code:

image

Seems pretty simple, right? And all we need to do is to actually send enough of those and we’ll be able to put enough load on the server to matter, right? Except that it doesn’t quite works like this. Let us see what the code above is actually doing by stripping away the HTTP later and dropping down to TCP, shall we?

image

 

And that looks good, right? Except that it is still hiding some details. I’m too lazy to go down to raw sockets and demonstrate the details, and anyway it would be way too much code to show here.

Here is a diagram that demonstrate what is going over the network for the two code sample above:

+---------+                      +---------+
| Client  |                      | Server  |
+---------+                      +---------+
     |                                |
     | [SYN]                          |
     |------------------------------->|
     |                                |
     |                      [SYN-ACK] |
     |<-------------------------------|
     |                                |
     | [SYN]                          |
     |------------------------------->|
     |                                | -----------------------------\
     |                                |-| Connection now established |
     |                                | |----------------------------|
     |                                |
     | [GET / HTTP 1.1]               |
     |------------------------------->|
     |                                | -------------------\
     |                                |-| The HTTP request |
     |                                | |------------------|
     |                                |
     |      [HTTP/1.1 302 Found ... ] |
     |<-------------------------------|
     |                                | --------------------\
     |                                |-| The HTTP response |
     |                                | |-------------------|
     |                                | -----------------------------------\
     |                                |-| Client now will close connection |
     |                                | |----------------------------------|
     |                                |
     | FIN                            |
     |------------------------------->|
     |                                |
     |                            ACK |
     |<-------------------------------|
     |                                |
     |                            FIN |
     |<-------------------------------|
     |                                |
     | ACK                            |
     |------------------------------->|
     |                                |

Note that this is for the simplest case, assuming that the response is just one packet, assume no packet drops, and ignore stuff like HTTPS, which adds another 4 packets to the initialization, and we are also accounting for the last 4 packets that are required to properly close a connection. This is important, because if you are trying to do high load benchmark, creating and not properly closing TCP connections means that you’ll soon run out of available ports (all your connections will be in CLOSE_WAIT or TIME_WAIT state).

Now, the problem is that this is really expensive. As in, wow expensive. So pretty much as soon as the web started to hit it off (mid 90s or so), people realized that this isn’t going to work, and the notion of Keep-Alive was born.

With Keep-Alive, you are going to reuse the same TCP connection to send multiple requests to the server. The idea is that once the connection is open, there is a strong likelihood that you’ll use it again soon, so why pay the 7 packets cost for opening & closing the TCP connection?

With that optimization, we then have:

+---------+                      +---------+
| Client  |                      | Server  |
+---------+                      +---------+
     |                                |
     | [SYN]                          |
     |------------------------------->|
     |                                |
     |                      [SYN-ACK] |
     |<-------------------------------|
     |                                |
     | [SYN]                          |
     |------------------------------->|
     |                                | -----------------------------\
     |                                |-| Connection now established |
     |                                | |----------------------------|
     |                                |
     | [GET / HTTP 1.1]               |
     |------------------------------->|
     |                                | -------------------\
     |                                |-| The HTTP request |
     |                                | |------------------|
     |                                |
     |      [HTTP/1.1 302 Found ... ] |
     |<-------------------------------|
     |                                | --------------------\
     |                                |-| The HTTP response |
     |                                | |-------------------|
     |                                |
     | [GET /index HTTP 1.1]          |
     |------------------------------->|
     |                                | -------------------\
     |                                |-| 2nd HTTP request |
     |                                | |------------------|
     |                                |
     |           [HTTP/1.1 200  ... ] |
     |<-------------------------------|
     |                                | --------------------\
     |                                |-| 2nd HTTP response |
     |                                | |-------------------|
     |                                | -----------------------------------\
     |                                |-| Client now will close connection |
     |                                | |----------------------------------|
     |                                |
     | FIN                            |
     |------------------------------->|
     |                                |
     |                            ACK |
     |<-------------------------------|
     |                                |
     |                            FIN |
     |<-------------------------------|
     |                                |
     | ACK                            |
     |------------------------------->|
     |                                |

And the more requests we make to the server, the better we are. Now, there is another trick that we can apply here. Remember that TCP is stream oriented, not packet oriented. That means that as far as the calling code is concerned, we aren’t actually seeing packets, just bytes arriving one after another.

So we can change the way things work to this:

+---------+                                                     +---------+
| Client  |                                                     | Server  |
+---------+                                                     +---------+
     |                                                               |
     | [SYN]                                                         |
     |-------------------------------------------------------------->|
     |                                                               |
     |                                                     [SYN-ACK] |
     |<--------------------------------------------------------------|
     |                                                               |
     | [SYN]                                                         |
     |-------------------------------------------------------------->|
     |                                                               | -----------------------------\
     |                                                               |-| Connection now established |
     |                                                               | |----------------------------|
     |                                                               |
     | [GET / HTTP 1.1, GET /data HTTP 1.1, GET /fast HTTP 1.1]      |
     |-------------------------------------------------------------->|
     |                                                               | -------------------------------------\
     |                                                               |-| 3 HTTP requests in a single apcket |
     |                                                               | |------------------------------------|
     |                                                               |
     |              [HTTP/1.1 302 Found ..., HTTP/1.1 200, HTTP 403] |
     |<--------------------------------------------------------------|
     |                                                               | ----------------------------------\
     |                                                               |-| All HTTP response in one packet |
     |                                                               | |---------------------------------|
     |                                                               | -----------------------------------\
     |                                                               |-| Client now will close connection |
     |                                                               | |----------------------------------|
     |                                                               |
     | FIN                                                           |
     |-------------------------------------------------------------->|
     |                                                               |
     |                                                           ACK |
     |<--------------------------------------------------------------|
     |                                                               |
     |                                                           FIN |
     |<--------------------------------------------------------------|
     |                                                               |
     | ACK                                                           |
     |-------------------------------------------------------------->|
     |                                                               |

What we did is pretty simple. Instead of waiting for the server to respond to the request, and only then reuse the connection to send the next one, we can send the requests immediately one after the other, without waiting.

In some cases, we can even package multiple requests into a single TCP packet. And the server (shouldn’t) care about that.

Here is what this looks like in practice:

Now, naïve server code will fail here, because it will read from the socket into a buffer, (including some part of the next request), and then forget about that.  But it isn’t hard to make sure that this work properly, and that is the key for all high performance servers.

Basically, the real problem is driving enough packets into the server to generate load. By pipelining requests like that, we reduce the number of packets we need to send while at the same time getting a lot higher load.

The cost of routing a packet is independent of its size, and while the size you send is important for bandwidth, the packet latency is much more important for actual speed (latency vs. bandwidth, again). So if we can pack the data into fewer packets, this is a net win. In other words, this is HTTP doing car pooling.

 

Image result for car pool lane

And now that you can drive enough requests into your server to actually stress it, you can work your way into actually handling this load.

Idle memory GC memory yanking

time to read 3 min | 552 words

I mentioned in passing that a lot of the actual memory use we have in RavenDB 4.0 is in unmanaged code. That frees us from the benevolent tyranny of the garbage collector, and it allow us to optimize our memory utilization to fit our usage scenarios. In particular, most of our memory is used in thread local manner, and there it is used in one of two ways.

Dedicated threads that do specific type of work (indexing, replication, etc) and request processing threads, that handle all requests.

A typical scenario for us would be to have a lot of requests come in all of a sudden and require us to allocate all bunch of memory, for example, if we have just run a benchmark. Or, for production usage, a spike in traffic that then turns into a lull.

One of the common hotspots in high performance servers is memory allocations, in particular, allocating and de-allocating can kill your performance. So you don’t do that. Instead, you pool the memory and reuse it,hopefully without major costs for doing that.

Which is all well and good, until you get that spike. In this case, you allocated memory to handle the traffic, and after the load is over, you are still holding on to that memory, and we probably really should give it back so something else can use that. Sure, it will be paged out to disk eventually, etc, and virtual memory is there to cover us, but that is really not the right way to write high performance stuff.

Now, the way we want to handle it, we want to wait until everything is calmed, and then start de-allocating all that memory that isn’t being used. But that lead to a problem. How can we do this if everything is thread local? And you want to have everything thread local, because synchronization on memory allocation/de-allocation is one of the most common performance killers.

So we handle this by using a MutliReaderSingleWriterStack. The idea is, like in the previous post,  we have a designated data structure that has only a single write and is held in a thread local value, but that is expected to be read by multiple threads. In particular, the thread that is going to be reading from it is the idle thread. Once a minute, this is going to scan all the threads, find all the saved memory that we have allocated but haven’t used in the past minute, and clear it.

Because this can run concurrently with the thread using memory itself, we’ll only free the memory that is parked (as in, waiting for work to come), not memory that is actively used. And we also need to be aware that we aren’t taking the idle memory to be disposed at the same time the thread is reaching for its memory bank to pull a new range. Here we have to have some thread safe operation. We do that using an interlocked operation on the InUse field. Both the cleanup thread and the owning thread will try to set that value, but only one of them will succeed.

In this way, under load, we get to use purely local memory, but we also have automatic cleanup going on if we have an idle period.

Optimizing read transaction startup timeEvery little bit helps, a LOT

time to read 2 min | 269 words

Some of the performance work that I have been doing recently was complex, and then, some of it isn’t:

image

And its implementation:

image

And the optimized version:

image

And the result?

image

The really interesting thing is that we have an improvement that is about 3 times more than we would expect. Why is that?

We can see that we eliminated the ResouceName call, and saved all of that cost, but we have much lower cost overall, where did that come from?

This is something that we saw over and over again. The obvious answer is that we now do less allocations, so we have less GC work to do, which is true as far as it goes, but it is also true that when we are even slightly faster, things start to align more closely, and we get a chain reaction (slightly less work means more time process the actual requests, so we can get more done in less).

Optimizing read transaction startup timeThe performance triage

time to read 2 min | 239 words

This series is no longer appropriately named, but we’ll go with that for the sake of grouping everything together.

So far we have seen how we start with a fixed benchmark, and we use a profiler to narrow down all sort of hotspots. However, we do that in a pretty strange way. Consider were we started:

Now consider this series’ name. Why are we even looking at OpenReadTransaction in the first place? There is a big giant hotspot in GetDocumentsById that we can start with.

Even a small change there is likely to generate a lot more results for our benchmark than eliminating everything else entirely. So why focus on those first?

Put simply, this isn’t just about this particular benchmark. We focused on everything else because those are costs that are shared across the board. Not only for the “read random documents” benchmark but for pretty much anything else in RavenDB.

By reducing those costs, we are enabling ourselves to get much better performance overall. And we’ll get to the actual meat of this benchmark late, when it takes the bulk of the work.

Look at the relative costs here. We moved from 60% of the request being spent there to over 80% being spent there. (Don’t look at the actual number, different machines, load and timing are responsible for those).

Now any work that we do here will have much greater impact.

Optimizing read transaction startup timeUnicode ate my perf and all I got was

time to read 3 min | 469 words

As an aside, I wonder how much stuff will break because the title of this post has in it.

The topic of this post is the following profiler output. GetSliceFromKey takes over 6% of our performance, or about 24.5 seconds out of the total run. That kinda sucks.

image

What is the purpose of this function? Well, RavenDB’s document ids are case insensitive, so we need to convert the key to lower case and then do a search on our internal index. That has quite a big cost associated with it.

And yes, we are aware of the pain. Luckily, we are already working with highly optimized codebase, so we aren’t seeing this function dominate our costs, but still…

Here is the breakdown of this code:

image

As you can see, over 60% of this function is spent in just converting to lower case, which sucks. Now, we have some additional knowledge about this function. For the vast majority of cases, we know that this function will handle only ASCII characters, and that Unicode document ids are possible, but relatively rare. We can utilize this knowledge to optimize this method. Here is what this will look like, in code:

image

Basically, we scan through the string, and if there is a character whose value is over 127 we fall to the slower method (slower being relative, that is still a string conversion in less than 25 μs).

Then we just find if a character is in the upper case range and convert it to lower case (ASCII bit analysis is funny, it was intentionally designed to be used with bit masking, and all sort of tricks are possible there) and store it in the buffer, or just store the original value.

The result?

image

This method went from taking 5.89% to taking just 0.6%, and we saved over 22 seconds(!) in the run. Again, this is under the profiler, and the code is heavily multi threaded. In practice, this means that the run took 3 seconds less.

Either way, we are still doing pretty good, and we don’t have that pile of poo, so I we’re good.

FUTURE POSTS

  1. Implementing low level trie: Part I - 23 minutes from now
  2. Implementing low level trie: Part II - 3 days from now
  3. A different sort of cross platform bug - 4 days from now
  4. The edge case is in the timing - 5 days from now

There are posts all the way to Dec 14, 2016

RECENT SERIES

  1. The performance regression in the optimization (2):
    01 Dec 2016 - Part II
  2. Digging into the CoreCLR (4):
    25 Nov 2016 - Some bashing on the cost of hashing
  3. Making code faster (10):
    24 Nov 2016 - Micro optimizations and parallel work
  4. Optimizing read transaction startup time (7):
    31 Oct 2016 - Racy data structures
View all series

RECENT COMMENTS

Syndication

Main feed Feed Stats
Comments feed   Comments Feed Stats