Ayende @ Rahien

Hi!
My name is Oren Eini
Founder of Hibernating Rhinos LTD and RavenDB.
You can reach me by email or phone:

ayende@ayende.com

+972 52-548-6969

, @ Q c

Posts: 6,524 | Comments: 47,985

filter by tags archive

Optimizing select projectionsPart IV–Understand, don’t do

time to read 2 min | 382 words

This post is what the entire series has been building building toward to. In the previous post we have refactored our code to make it easier to build additional behaviors. Here is one such behavior, which uses the Esprima project to parse the code and see if it uses a pattern that we can optimize for. Let us see the code, then discuss what it does.

The new stuff is mostly in the TryProjectOptimized method. This method is making heavy use of the new C# features to do easy parsing of ASTs, and it really shows that the C# team is using C# to build the compiler (that is one of the common things that compiler writers do, make their own lives easier). Take this code even a couple of language versions back, and it would be much hard to read and work with.

At any rate, what this code is doing is search for an object literal that does simple assignment of properties. If it detect that, instead of calling into the JS engine, it will produce a delegate that will do just that directly.

The result is pretty much the same, but we have to do so much less. If we can’t detect a pattern that we recognize, we just fall back to the JS engine again, and get the pervious behavior.

This is pretty awesome thing to have, because if we detect a particular common pattern, we can optimize that quite easily. The code really lend itself to that approach now.

What about the performance? Well…

And this runs in 0.31 seconds on the 10K, 5K, 1K run.

  • 10K in 83 ms
  • 5K in 200 ms
  • 1K in 31 ms

So that is two orders of magnitude for the 10K run.

An an exercise, try taking this code and teaching it how to recognize simple expressions like + and –. The second projection example is meant to give you some idea about that.

Oh, and one last thought, we are actually biased against the optimization. The Jint code produce results in formats that we need to pull the data from, and we aren’t accounting for that here. The optimize code can generate the output to be already in the format we need it to be ( the difference between the Dictionary and the JsValue).  

Optimizing select projectionsInterlude, refactoring

time to read 2 min | 256 words

So we got an order of magnitude performance boost, without really doing much, to be honest. But the code is starting to look really ugly, and future optimizations are out until we fix it.

Since we don’t really care about the internal details, we can do two things in one move. First, we’ll cleanup the code by introducing a delegate that will abstract the handling and the second is that this delegate will also allow us to handle caching of additional values simple by way of capturing the delegate state.

Here is what the code looks like:

Note that we have it broken apart into distinct steps, and the ProjectViaJint method is self contained and any caching it needs is done internally. This is much needed refactoring if we want to be able to continue to optimize this code, and it runs in about the same speed as the previous version, so we didn’t hurt anything there.

With this in place, we can see how much better we can make things.

Before we go on, I want to emphasis that this code is actually making a lot of assumptions. The passed dictionary must always have the same set of fields, for example, otherwise we may see a value from a previous iteration. No null handling or error handling is in place, we aren’t showing the unpacking of the values from the JS env to our code, and plenty of other stuff. That isn’t that important at this point, because I’m showing the progression of optimizations, rather then production code.

Optimizing select projectionsPart III

time to read 2 min | 229 words

After optimizing the projection performance by so much by just caching the engine, I got to thinking, we are also creating a new JS object to pass the arguments every single time. What would happen if we’ll cache that?

Here is what I ended up with, again, all of this is pretty brute force (mainly because I’m writing these posts while the baby is asleep, and I’m trying to get as many of them out before she wakes up):

I wouldn’t have expected this to be a dramatic boost, but we got:

And this runs in 0.57 seconds on the 10K, 5K, 1K run.

  • 10K in 413 ms
  • 5K in 110 ms
  • 1K in 45 ms

That is also a third of the cost that we just saved.

This is interesting, so it is worth another check, there are other two potentially expensive operations here, the invocation of the method and the sending of arguments.

Trawling of the Jint code shows that we can remove some abstract by using ICallable directly, and we can cache the array of arguments, this all leads to:

And this runs in 0.37 seconds on the 10K, 5K, 1K run.

  • 10K in 279 ms
  • 5K in 59  ms
  • 1K in 32 ms

And that is a wow. Because right now, without really doing much at all, we area already an order of magnitude higher then our first attempt, and we can do more.

Optimizing select projectionsPart II

time to read 2 min | 236 words

In the previous post, I showed how we can handle select projections and setup a perf test. The initial implementation we had run in about 4.4 seconds for our test. But it didn’t give any thought to performance.

Let us see if we can do better. The first thing to do is to avoid building the Jint engine and parsing the code all the time. The way we set things up, we wrap the actual object literal in a function, and there is no state, so we can reuse the previous engine instance without any issues. ;:

That means that we don’t need to pay the cost of creating a new engine, parsing the code, etc. Here is what this looks like:

Note that the cache key here is the raw projection, not the function we send to the entire, this allows us to avoid any string allocations in the common (cached) path.

And this runs in 0.75 seconds on the 10K, 5K, 1K run.

  • 10K in 574 ms
  • 5K in 137 ms
  • 1K in 51 ms

Just this small change boosted our performance by a major margin.

Note that because the engines are not thread safe, to use that in a real application we’ll need to ensure thread safety. The way I did that is to have a pool of these engines for each projection and just use that, so an engine is always access in a single threaded mode.

Optimizing select projectionsPart I

time to read 2 min | 240 words

I spoke about designing for performance in my previous post and I thought it would be in interesting series  of blog posts. The task I have is that we have a data source of some kind, in our case, I decided to make the task as simple as possible and define the source data as a Dictionary<string, object>, without allowing nesting of data.

We want the user to be able to provide a custom way to project data from that dictionary. Here are a few ways to do that:

And here is the data source in question:

Obviously the user can make the select projections as complex as they wish. But I think that these three sample give us a good representation of the complexity. We’ll also treat them a bit differently, with regards to their likelihood. So when testing, we’ll check that the first projection is run 10,000 times, then second projection is run 5,000 times and the last projection is run 1,000 times.

With the layout of the land settled, let us see how we can get this actually working. I’m going to use Jint and solve the issue in a very brute force manner.

Here is the initial implementation:

And this runs in 4.4 seconds on the 10K, 5K, 1K run.

  • 10K in 2,756 ms
  • 5K in 1,310 ms
  • 1K in 314 ms

I’m pretty sure that we can do better, and we’ll look at that in the next post.

The low level trie Rust challenge

time to read 1 min | 139 words

I have tried to build a low level trie impl in Rust and just gave up because it was too much work making the compiler happy. I then went ahead and wrote in in C++ in a couple of evenings. I would still like to know whatever this is possible / viable in Rust.

I’m assuming that this is the case, but I don’t know how to start. Any dear reader feels like taking this upon themselves to port the C++ code to Rust? It is all working and there are unit tests, and I think that the design should match well the Rust design philosophy, but the borrow checker actively worked to me give up doing that in Rust, and I would like to both see this implemented in Rust and hear what the experience was like.

AnswerWhat does this code do?

time to read 2 min | 225 words

This was a surprising shock, this code seems so simple, but it does something very different than what I would expect.

The question is why?

As it turns out, we are missing one character here:

image

Notice the lack of the comma? Let us see how the compiler is treating this code by breaking it apart into steps, shall we?

First, let us break it into two statements:

Note that we moved the name line into the first statement, since there isn’t a command here. But what is it actually doing? This looks very strange, but that is just because we have the dictionary initializer here. If we drop it, we get:

image

And that make a lot more sense. If we’ll break it all down, we have:

And that explains it all, make perfect sense, and a very nasty trap. We run into it accidently in production code, and it was near impossible to figure out what was going on or why it happened.

The performance regression in the optimizationPart II

time to read 4 min | 762 words

In my previous post I showed a performance conundrum. A code that has been optimized to reduced heavy allocation usage that became over twice as slow.

In particular, we had a problem here, the new code it 3.4 times slower than the new one, but how?

image

Now, the real scenario we had involved concurrent access, so it was much harder to figure out, but I cheated a bit when producing this image, I used sampling profiling, instead of tracing one. The major difference between the two is that tracing profiler will also give you the number of calls. This is called out as something that you would typically do because you want to analyze algorithmic complexity, but I find it incredibly useful to figure out what my code is actually doing.

And indeed, looking at the same code using tracing profiler gives us the following two calls:

image

image

And when looking at the diffs between those two, we have:

image

So for some reason we are making 54 million more calls to the Equals method in the optimized version, but why? Both of those are using the exact same dictionary, using the exact same key type and the same keys, even.

In the real scenario we were facing, that wasn’t the case, so that made it hard to analyze the issue. We started looking into whatever we were doing some sort of cache poisoning by having the buffer holder as the dictionary value, instead of the array directly, but that didn’t pan out. We kept circling around the number of Equals calls. Note that the number of calls to TryGetValue is the same, as well as the number of calls to GetHashCode. So what is the diff?

The diff, quite simple, is not here at all.

The problem is in the RemoveBefore method. In the old version, if we removed all the entries, we’ll remove it completely from the dictionary. In the new version, we’ll reset the buffer so it can be used again next time. The problem with that approach is that it means that the dictionary is pretty big, much bigger than it would be in the case of the old version of the code. And that means that we’ll need to find the value (which is empty), then check its content. On the old version, we’ll just do a GetHashCode, then find that the table entry is over, and exit.

Indeed, all we had to do was change RemoveBefore to look like this:

And that gives us:

  • 14.0 seconds & 1.1 GB of memory for old version
  • 12.8 seconds & 0.4 GB of memory for new version

Which is pretty good result overall. It gets better when you break it apart to its component parts.

image

This is actually surprising, since we didn’t really set out to optimize this call very much, and it is pretty much unchanged in both versions. I think that this is likely because we keep the buffers around longer, so they are more likely to be in the cache.

image

This shows more than double the speed we previous had, which is pretty awesome, since this code is actually called per transactions, so anything that reduces that cost is golden.

image

This happens during a flush, and reducing its speed is important to reducing the time we hold the write lock, so this is pretty sweet.

The performance regression in the optimizationPart I

time to read 4 min | 629 words

PageTable is a pretty critical piece of Voron. It is the component responsible for remapping modified pages in transactions and is the reason why we support MVCC and can avoid taking locks for the most part. It has been an incredibly stable part of our software, rarely changing and pretty much the same as it was when it was initially written in 2013. It has been the subject for multiple performance reviews in that time, but acceptable levels of performance from our code in 2013 is no longer acceptable today. PageTable came up recently in one of our performance reviews as a problematic component. It was responsible for too much CPU and far too many allocations.

Here is a drastically simplified implementation, which retain the salient points:

Here is the sample workout for this class, which just simulates ten thousand transactions. This little scenario takes 15.3 seconds and allocates a total of 1.1GB of memory! That is a lot of allocations, and must have tremendous amount of time spent in GC. The most problematic issue here is the SetItems methods, which will allocate two different delegates for each modified page in the transaction. Then we have the total abandon in which we’ll allocate additional memory in there. As you can imagine, we weren’t very happy about this, so we set out to fix this issue.

We can take advantage off the fact that SetItems and RemoveBefore are only called under lock, while TryGetValue is called concurrently with everything else.

So I wrote the following code:

This relies on allowing stale reads from concurrent readers, which we don’t care about since they wouldn’t be able to make use of the data anyway, and it was able to reduce the allocations to just 320 MB, but the runtime actually went up to 32 seconds.

That is quite annoying, as you can imagine, and much cursing enthused as a result. I then pulled my trusty profiler ask it kindly to figure out what piece of code needs to be hit with a rolling pin and have a stern talk to about what is expected from code after it has been laboriously and carefully optimized. It is expected to sit nicely and be fast, or by Git I’ll revert you.

What the hell?! Here are the original implementation costs, and you can clearly see how much time we are spending on garbage collection.

image

And here is the optimized version, which is actually slower, and actually used more memory?!

image

There are a bunch of interesting things going on here. We can see that we are indeed using spending a little less time in GC, and that both RemoveBefore and SetItems methods are much cheaper, but the cost of TryGetValue is so much higher, in fact, if we compare the two, we have:

image

So we are 3.4 times higher, and somehow, the cost of calling the concurrent dictionary TryGetValue has risen by 88%.

But the implementation is pretty much the same, and there isn’t anything else that looks like it can cause that much of a performance gap.

I’ll leave this riddle for now, because it drove me crazy for two whole days and give you the details on what is going on in the next post.

FUTURE POSTS

  1. Queries++ in RavenDB: I suggest you can do better - 2 hours from now
  2. Queries++ in RavenDB: Spatial searches - 3 days from now
  3. PR Review: The simple stuff will trip you - 4 days from now
  4. The married couple component design pattern - 5 days from now
  5. Distributed computing fallacies: There is one administrator - 6 days from now

There are posts all the way to Dec 21, 2017

RECENT SERIES

  1. PR Review (9):
    08 Nov 2017 - Encapsulation stops at the assembly boundary
  2. Queries++ in RavenDB (4):
    11 Dec 2017 - Gimme more like this
  3. Production postmortem (21):
    06 Dec 2017 - data corruption, a view from INSIDE the sausage
  4. API Design (9):
    04 Dec 2017 - The lack of a method was intentional forethought
  5. The best features are the ones you never knew were there (5):
    27 Nov 2017 - You can’t do everything
View all series

Syndication

Main feed Feed Stats
Comments feed   Comments Feed Stats