Ayende @ Rahien

Hi!
My name is Oren Eini
Founder of Hibernating Rhinos LTD and RavenDB.
You can reach me by phone or email:

ayende@ayende.com

+972 52-548-6969

, @ Q c

Posts: 6,259 | Comments: 46,582

filter by tags archive

RavenDB on Raspberry PI

time to read 4 min | 645 words

This has been a dream of mine for quite some time, to be able to run RavenDB on the Raspberry PI. The Raspberry PI is a great toy to play with, maybe do some home automation or stream some videos to your TV. But while that is cool, what I really wanted to see is if we could run RavenDB on top of the Raspberry PI, and when we do that, what kind of behavior we’ll see.

We got it working, and we are so excited that we are raffling off a brand new Raspberry PI, in its role as a RavenDB 4.0 server.

The first step along that way, running on Linux, was already achieved over a year and a half ago, but we couldn’t just run on the Raspberry PI, it is an ARM architecture, and we needed the CoreCLR ported there first. Luckily for us, quite a few people are also interested in running .NET on the ARM, and we have been tracking the progress of the project for over a year, trying to see if anyone will ever get to the hello world moment.

image

image

For the past month, it looked like things has gotten settled, the number of failing tests dropped to close to zero, and we decided that this is time to take another stab at running RavenDB on the Raspberry PI.

Actually getting there wasn’t trivial, unlike using the CoreCLR, we don’t have the dotnet executable to manage things for us, we are directly invoking corerun, and getting everything just right for this was… finicky. For example, trying to compile the CoreCLR on the Raspberry PI resulted in the Raspberry PI experiencing a heat death that caused no little excitement at the office.

 image

RavenDB is running on that using CoreCLR (and quite a bit of sweat) on top of a Ubuntu 16.04 system. To my knowledge, RavenDB is the first significant .NET project that has managed to run on the Pi. As in, the first thing beyond simple hello world.

Naturally, once we got it working, we had to put it through its paces and see what we can get out of this adorable little machine.

The current numbers are pretty good, but on the other hand, we can’t currently compile with –O3, so that limits us to –O1. We are also just grabbed a snapshot of our code to test it, without actually taking the time to profile and optimize for that machine.

With all of those caveats, allow me to share what is probably the best part of it is:

image

This is a benchmark from a different Ubuntu machine to the Raspberry PI, asking to get a whole bunch of documents.

This is processing requests at a rate of 21,476 req / sec. That is a pretty awesome number.

How about writes?

image

So that is 2,360 writes / sec.

Our next project, setting up a cluster of those, you can see that we have already started the work…

image

We’ll be posting a lot more about what you can now do with this in the next few days.

Come to our booth on DotNext Moscow

time to read 1 min | 150 words

imageThis Friday, our team is going to be in DotNext Moscow, showing off RavenDB 4.0 and raffling off some really cool prizes.

You can also come and learn optimization techniques that allowed us to get more than 100,000 req/sec with RavenDB 4.0.

It is going to be a lot of fun, and we are expecting some really interesting discussions on the way we are building RavenDB 4.0, so we sent three of our core developers to give you all the details about it.

This is going to be the very first time that we are going to be showing off what RavenDB 4.0 can do Smile.

The performance regression in the optimizationPart II

time to read 4 min | 762 words

In my previous post I showed a performance conundrum. A code that has been optimized to reduced heavy allocation usage that became over twice as slow.

In particular, we had a problem here, the new code it 3.4 times slower than the new one, but how?

image

Now, the real scenario we had involved concurrent access, so it was much harder to figure out, but I cheated a bit when producing this image, I used sampling profiling, instead of tracing one. The major difference between the two is that tracing profiler will also give you the number of calls. This is called out as something that you would typically do because you want to analyze algorithmic complexity, but I find it incredibly useful to figure out what my code is actually doing.

And indeed, looking at the same code using tracing profiler gives us the following two calls:

image

image

And when looking at the diffs between those two, we have:

image

So for some reason we are making 54 million more calls to the Equals method in the optimized version, but why? Both of those are using the exact same dictionary, using the exact same key type and the same keys, even.

In the real scenario we were facing, that wasn’t the case, so that made it hard to analyze the issue. We started looking into whatever we were doing some sort of cache poisoning by having the buffer holder as the dictionary value, instead of the array directly, but that didn’t pan out. We kept circling around the number of Equals calls. Note that the number of calls to TryGetValue is the same, as well as the number of calls to GetHashCode. So what is the diff?

The diff, quite simple, is not here at all.

The problem is in the RemoveBefore method. In the old version, if we removed all the entries, we’ll remove it completely from the dictionary. In the new version, we’ll reset the buffer so it can be used again next time. The problem with that approach is that it means that the dictionary is pretty big, much bigger than it would be in the case of the old version of the code. And that means that we’ll need to find the value (which is empty), then check its content. On the old version, we’ll just do a GetHashCode, then find that the table entry is over, and exit.

Indeed, all we had to do was change RemoveBefore to look like this:

And that gives us:

  • 14.0 seconds & 1.1 GB of memory for old version
  • 12.8 seconds & 0.4 GB of memory for new version

Which is pretty good result overall. It gets better when you break it apart to its component parts.

image

This is actually surprising, since we didn’t really set out to optimize this call very much, and it is pretty much unchanged in both versions. I think that this is likely because we keep the buffers around longer, so they are more likely to be in the cache.

image

This shows more than double the speed we previous had, which is pretty awesome, since this code is actually called per transactions, so anything that reduces that cost is golden.

image

This happens during a flush, and reducing its speed is important to reducing the time we hold the write lock, so this is pretty sweet.

The performance regression in the optimizationPart I

time to read 4 min | 629 words

PageTable is a pretty critical piece of Voron. It is the component responsible for remapping modified pages in transactions and is the reason why we support MVCC and can avoid taking locks for the most part. It has been an incredibly stable part of our software, rarely changing and pretty much the same as it was when it was initially written in 2013. It has been the subject for multiple performance reviews in that time, but acceptable levels of performance from our code in 2013 is no longer acceptable today. PageTable came up recently in one of our performance reviews as a problematic component. It was responsible for too much CPU and far too many allocations.

Here is a drastically simplified implementation, which retain the salient points:

Here is the sample workout for this class, which just simulates ten thousand transactions. This little scenario takes 15.3 seconds and allocates a total of 1.1GB of memory! That is a lot of allocations, and must have tremendous amount of time spent in GC. The most problematic issue here is the SetItems methods, which will allocate two different delegates for each modified page in the transaction. Then we have the total abandon in which we’ll allocate additional memory in there. As you can imagine, we weren’t very happy about this, so we set out to fix this issue.

We can take advantage off the fact that SetItems and RemoveBefore are only called under lock, while TryGetValue is called concurrently with everything else.

So I wrote the following code:

This relies on allowing stale reads from concurrent readers, which we don’t care about since they wouldn’t be able to make use of the data anyway, and it was able to reduce the allocations to just 320 MB, but the runtime actually went up to 32 seconds.

That is quite annoying, as you can imagine, and much cursing enthused as a result. I then pulled my trusty profiler ask it kindly to figure out what piece of code needs to be hit with a rolling pin and have a stern talk to about what is expected from code after it has been laboriously and carefully optimized. It is expected to sit nicely and be fast, or by Git I’ll revert you.

What the hell?! Here are the original implementation costs, and you can clearly see how much time we are spending on garbage collection.

image

And here is the optimized version, which is actually slower, and actually used more memory?!

image

There are a bunch of interesting things going on here. We can see that we are indeed using spending a little less time in GC, and that both RemoveBefore and SetItems methods are much cheaper, but the cost of TryGetValue is so much higher, in fact, if we compare the two, we have:

image

So we are 3.4 times higher, and somehow, the cost of calling the concurrent dictionary TryGetValue has risen by 88%.

But the implementation is pretty much the same, and there isn’t anything else that looks like it can cause that much of a performance gap.

I’ll leave this riddle for now, because it drove me crazy for two whole days and give you the details on what is going on in the next post.

Idle memory GC memory yanking

time to read 3 min | 552 words

I mentioned in passing that a lot of the actual memory use we have in RavenDB 4.0 is in unmanaged code. That frees us from the benevolent tyranny of the garbage collector, and it allow us to optimize our memory utilization to fit our usage scenarios. In particular, most of our memory is used in thread local manner, and there it is used in one of two ways.

Dedicated threads that do specific type of work (indexing, replication, etc) and request processing threads, that handle all requests.

A typical scenario for us would be to have a lot of requests come in all of a sudden and require us to allocate all bunch of memory, for example, if we have just run a benchmark. Or, for production usage, a spike in traffic that then turns into a lull.

One of the common hotspots in high performance servers is memory allocations, in particular, allocating and de-allocating can kill your performance. So you don’t do that. Instead, you pool the memory and reuse it,hopefully without major costs for doing that.

Which is all well and good, until you get that spike. In this case, you allocated memory to handle the traffic, and after the load is over, you are still holding on to that memory, and we probably really should give it back so something else can use that. Sure, it will be paged out to disk eventually, etc, and virtual memory is there to cover us, but that is really not the right way to write high performance stuff.

Now, the way we want to handle it, we want to wait until everything is calmed, and then start de-allocating all that memory that isn’t being used. But that lead to a problem. How can we do this if everything is thread local? And you want to have everything thread local, because synchronization on memory allocation/de-allocation is one of the most common performance killers.

So we handle this by using a MutliReaderSingleWriterStack. The idea is, like in the previous post,  we have a designated data structure that has only a single write and is held in a thread local value, but that is expected to be read by multiple threads. In particular, the thread that is going to be reading from it is the idle thread. Once a minute, this is going to scan all the threads, find all the saved memory that we have allocated but haven’t used in the past minute, and clear it.

Because this can run concurrently with the thread using memory itself, we’ll only free the memory that is parked (as in, waiting for work to come), not memory that is actively used. And we also need to be aware that we aren’t taking the idle memory to be disposed at the same time the thread is reaching for its memory bank to pull a new range. Here we have to have some thread safe operation. We do that using an interlocked operation on the InUse field. Both the cleanup thread and the owning thread will try to set that value, but only one of them will succeed.

In this way, under load, we get to use purely local memory, but we also have automatic cleanup going on if we have an idle period.

Optimizing read transaction startup timeEvery little bit helps, a LOT

time to read 2 min | 269 words

Some of the performance work that I have been doing recently was complex, and then, some of it isn’t:

image

And its implementation:

image

And the optimized version:

image

And the result?

image

The really interesting thing is that we have an improvement that is about 3 times more than we would expect. Why is that?

We can see that we eliminated the ResouceName call, and saved all of that cost, but we have much lower cost overall, where did that come from?

This is something that we saw over and over again. The obvious answer is that we now do less allocations, so we have less GC work to do, which is true as far as it goes, but it is also true that when we are even slightly faster, things start to align more closely, and we get a chain reaction (slightly less work means more time process the actual requests, so we can get more done in less).

RavenDB 3.5 RTM released

time to read 1 min | 112 words

I’m very proud to announce that RavenDB 3.5 is out and about, feeling quite awesome about itself.

image

Here is some of our team in Las Vegas celebrating the release:

image

You can find the go read the full features highlights page, or you can download it or just start playing with it directly.

This release represent about 3 years of effort and over three thousands commits.

This release makes me very happy.

Optimizing read transaction startup timeThe performance triage

time to read 2 min | 239 words

This series is no longer appropriately named, but we’ll go with that for the sake of grouping everything together.

So far we have seen how we start with a fixed benchmark, and we use a profiler to narrow down all sort of hotspots. However, we do that in a pretty strange way. Consider were we started:

Now consider this series’ name. Why are we even looking at OpenReadTransaction in the first place? There is a big giant hotspot in GetDocumentsById that we can start with.

Even a small change there is likely to generate a lot more results for our benchmark than eliminating everything else entirely. So why focus on those first?

Put simply, this isn’t just about this particular benchmark. We focused on everything else because those are costs that are shared across the board. Not only for the “read random documents” benchmark but for pretty much anything else in RavenDB.

By reducing those costs, we are enabling ourselves to get much better performance overall. And we’ll get to the actual meat of this benchmark late, when it takes the bulk of the work.

Look at the relative costs here. We moved from 60% of the request being spent there to over 80% being spent there. (Don’t look at the actual number, different machines, load and timing are responsible for those).

Now any work that we do here will have much greater impact.

Optimizing read transaction startup timeUnicode ate my perf and all I got was

time to read 3 min | 469 words

As an aside, I wonder how much stuff will break because the title of this post has in it.

The topic of this post is the following profiler output. GetSliceFromKey takes over 6% of our performance, or about 24.5 seconds out of the total run. That kinda sucks.

image

What is the purpose of this function? Well, RavenDB’s document ids are case insensitive, so we need to convert the key to lower case and then do a search on our internal index. That has quite a big cost associated with it.

And yes, we are aware of the pain. Luckily, we are already working with highly optimized codebase, so we aren’t seeing this function dominate our costs, but still…

Here is the breakdown of this code:

image

As you can see, over 60% of this function is spent in just converting to lower case, which sucks. Now, we have some additional knowledge about this function. For the vast majority of cases, we know that this function will handle only ASCII characters, and that Unicode document ids are possible, but relatively rare. We can utilize this knowledge to optimize this method. Here is what this will look like, in code:

image

Basically, we scan through the string, and if there is a character whose value is over 127 we fall to the slower method (slower being relative, that is still a string conversion in less than 25 μs).

Then we just find if a character is in the upper case range and convert it to lower case (ASCII bit analysis is funny, it was intentionally designed to be used with bit masking, and all sort of tricks are possible there) and store it in the buffer, or just store the original value.

The result?

image

This method went from taking 5.89% to taking just 0.6%, and we saved over 22 seconds(!) in the run. Again, this is under the profiler, and the code is heavily multi threaded. In practice, this means that the run took 3 seconds less.

Either way, we are still doing pretty good, and we don’t have that pile of poo, so I we’re good.

FUTURE POSTS

  1. Installing RavenDB 4.0 on your Raspberry PI 3 - one day from now
  2. The edge case is in the timing - about one day from now
  3. Writing my own synchronization primitive ReaderWriterLock - 3 days from now
  4. Implementing low level trie: Part I - 4 days from now
  5. Implementing low level trie: Part II - 7 days from now

And 1 more posts are pending...

There are posts all the way to Dec 13, 2016

RECENT SERIES

  1. The performance regression in the optimization (2):
    01 Dec 2016 - Part II
  2. Digging into the CoreCLR (4):
    25 Nov 2016 - Some bashing on the cost of hashing
  3. Making code faster (10):
    24 Nov 2016 - Micro optimizations and parallel work
  4. Optimizing read transaction startup time (7):
    31 Oct 2016 - Racy data structures
View all series

Syndication

Main feed Feed Stats
Comments feed   Comments Feed Stats