Ayende @ Rahien

Hi!
My name is Oren Eini
Founder of Hibernating Rhinos LTD and RavenDB.
You can reach me by phone or email:

ayende@ayende.com

+972 52-548-6969

, @ Q c

Posts: 6,259 | Comments: 46,582

filter by tags archive

RavenDB on Raspberry PI

time to read 4 min | 645 words

This has been a dream of mine for quite some time, to be able to run RavenDB on the Raspberry PI. The Raspberry PI is a great toy to play with, maybe do some home automation or stream some videos to your TV. But while that is cool, what I really wanted to see is if we could run RavenDB on top of the Raspberry PI, and when we do that, what kind of behavior we’ll see.

We got it working, and we are so excited that we are raffling off a brand new Raspberry PI, in its role as a RavenDB 4.0 server.

The first step along that way, running on Linux, was already achieved over a year and a half ago, but we couldn’t just run on the Raspberry PI, it is an ARM architecture, and we needed the CoreCLR ported there first. Luckily for us, quite a few people are also interested in running .NET on the ARM, and we have been tracking the progress of the project for over a year, trying to see if anyone will ever get to the hello world moment.

image

image

For the past month, it looked like things has gotten settled, the number of failing tests dropped to close to zero, and we decided that this is time to take another stab at running RavenDB on the Raspberry PI.

Actually getting there wasn’t trivial, unlike using the CoreCLR, we don’t have the dotnet executable to manage things for us, we are directly invoking corerun, and getting everything just right for this was… finicky. For example, trying to compile the CoreCLR on the Raspberry PI resulted in the Raspberry PI experiencing a heat death that caused no little excitement at the office.

 image

RavenDB is running on that using CoreCLR (and quite a bit of sweat) on top of a Ubuntu 16.04 system. To my knowledge, RavenDB is the first significant .NET project that has managed to run on the Pi. As in, the first thing beyond simple hello world.

Naturally, once we got it working, we had to put it through its paces and see what we can get out of this adorable little machine.

The current numbers are pretty good, but on the other hand, we can’t currently compile with –O3, so that limits us to –O1. We are also just grabbed a snapshot of our code to test it, without actually taking the time to profile and optimize for that machine.

With all of those caveats, allow me to share what is probably the best part of it is:

image

This is a benchmark from a different Ubuntu machine to the Raspberry PI, asking to get a whole bunch of documents.

This is processing requests at a rate of 21,476 req / sec. That is a pretty awesome number.

How about writes?

image

So that is 2,360 writes / sec.

Our next project, setting up a cluster of those, you can see that we have already started the work…

image

We’ll be posting a lot more about what you can now do with this in the next few days.

Come to our booth on DotNext Moscow

time to read 1 min | 150 words

imageThis Friday, our team is going to be in DotNext Moscow, showing off RavenDB 4.0 and raffling off some really cool prizes.

You can also come and learn optimization techniques that allowed us to get more than 100,000 req/sec with RavenDB 4.0.

It is going to be a lot of fun, and we are expecting some really interesting discussions on the way we are building RavenDB 4.0, so we sent three of our core developers to give you all the details about it.

This is going to be the very first time that we are going to be showing off what RavenDB 4.0 can do Smile.

10x speedup utilizing Nagle Algorithm in business application

time to read 4 min | 793 words

Nagle algorithm is a pretty annoying thing. Basically, it means that when we write to the network, the TCP stack will wait a bit to see if we have more stuff to send to that destination before actually emitting the packet. Most people run into this when they wonder why the minimum time for a remote operation is 200ms, even on the local network. Then you figure out how to disable Nagle Algorithm, and you are happy again.

Nagle algorithm was designed for remote terminals, where the speed difference between a human typing and the machine sending packet was big enough that each single letter you typed would be sent as a separate packet. That led to a really high overhead, 40 bytes overhead to send just a single byte to the server, and the number of packets that were sent may be high enough to cause the pipe to choke. So you buffer it, the basic algorithm goes like this, if we don’t have enough data to send a full packet, and if 200 ms didn’t pass since the first buffered data, wait up to 200 ms for more data.  In this manner, if you type fast enough, you will send more than a single character to the server, dramatically reducing the cost of talking with the server, and speeding up everything significantly.

So this is Nagle Algorithm, it is a pretty low level TCP detail, and often overlooked by people. If you don’t study networks, you’ll typically only find out about it when you have a perf problem that you can’t figure out. So how is this relating to business applications?

Imagine that you work on a system that does snail mail sending. A user will call the API and then a few days later a physical letter will show up at your door. Nothing really special about that, right? Except that we charge users based on the plan they choose. For simplicity’s sake, we’ll say that we have two plans:

  • Pay as you go – can send as many letters as they want, and we charge a certain amount for each.
  • Budgetted plan – can send up to certain number of letters per month.

In either case, it is pretty important to us to record that the mail was sent, and if the user is on a budget plan (or has spending alerts on his account), we need to respond to it in certain ways.

Overall, there is nothing really surprising here, and the code to handle this is pretty simple, right?

The problem is that under load, we’re going to have query a few requests going on to the billing service, and that is likely to be a bottleneck for us. Note that this is the case because while processing a single RecordMail request is pretty fast, the problem is that we are actually going to have to wait for both the actual processing and the back and forth between the machines. If the cost of processing a RecordMail request is 1 ms, and the cost of going over the network is 0.5 ms, that adds up quickly.

Now, let us see how we can do it better. We’ll start with moving from making the call immediately to placing the details in a queue.

You can see that we use a task completion source to wait for the result of the call. Next, we need to actually handle this, this is done in the following methods:

What this ends up doing is to reduce the network costs. Instead of going to the server once for every request, we’ll go to the server once per 200 ms or every 256 requests. That dramatically reduce network traffic. And the cost of actually sending 256 requests instead of just one isn’t really that different. Note that this gives us a higher latency per request, since we may have to wait longer for the request to go out to the server, but it also gives me much better throughput, because I can process more requests in a shorter amount of time.

A change like that can push your performance up by an order of magnitude or more. It also gives you economy of scale benefits on the other side. Instead of having to do a fixed amount of work around processing each request, we can now do it over a whole bunch of them.

For example, if I need to load the customer entity, instead of doing it once per mail, I can do that once per batch, and it is likely that this customer will be in the batch more than once, reducing the amount of work I have to do.

The performance regression in the optimizationPart II

time to read 4 min | 762 words

In my previous post I showed a performance conundrum. A code that has been optimized to reduced heavy allocation usage that became over twice as slow.

In particular, we had a problem here, the new code it 3.4 times slower than the new one, but how?

image

Now, the real scenario we had involved concurrent access, so it was much harder to figure out, but I cheated a bit when producing this image, I used sampling profiling, instead of tracing one. The major difference between the two is that tracing profiler will also give you the number of calls. This is called out as something that you would typically do because you want to analyze algorithmic complexity, but I find it incredibly useful to figure out what my code is actually doing.

And indeed, looking at the same code using tracing profiler gives us the following two calls:

image

image

And when looking at the diffs between those two, we have:

image

So for some reason we are making 54 million more calls to the Equals method in the optimized version, but why? Both of those are using the exact same dictionary, using the exact same key type and the same keys, even.

In the real scenario we were facing, that wasn’t the case, so that made it hard to analyze the issue. We started looking into whatever we were doing some sort of cache poisoning by having the buffer holder as the dictionary value, instead of the array directly, but that didn’t pan out. We kept circling around the number of Equals calls. Note that the number of calls to TryGetValue is the same, as well as the number of calls to GetHashCode. So what is the diff?

The diff, quite simple, is not here at all.

The problem is in the RemoveBefore method. In the old version, if we removed all the entries, we’ll remove it completely from the dictionary. In the new version, we’ll reset the buffer so it can be used again next time. The problem with that approach is that it means that the dictionary is pretty big, much bigger than it would be in the case of the old version of the code. And that means that we’ll need to find the value (which is empty), then check its content. On the old version, we’ll just do a GetHashCode, then find that the table entry is over, and exit.

Indeed, all we had to do was change RemoveBefore to look like this:

And that gives us:

  • 14.0 seconds & 1.1 GB of memory for old version
  • 12.8 seconds & 0.4 GB of memory for new version

Which is pretty good result overall. It gets better when you break it apart to its component parts.

image

This is actually surprising, since we didn’t really set out to optimize this call very much, and it is pretty much unchanged in both versions. I think that this is likely because we keep the buffers around longer, so they are more likely to be in the cache.

image

This shows more than double the speed we previous had, which is pretty awesome, since this code is actually called per transactions, so anything that reduces that cost is golden.

image

This happens during a flush, and reducing its speed is important to reducing the time we hold the write lock, so this is pretty sweet.

The performance regression in the optimizationPart I

time to read 4 min | 629 words

PageTable is a pretty critical piece of Voron. It is the component responsible for remapping modified pages in transactions and is the reason why we support MVCC and can avoid taking locks for the most part. It has been an incredibly stable part of our software, rarely changing and pretty much the same as it was when it was initially written in 2013. It has been the subject for multiple performance reviews in that time, but acceptable levels of performance from our code in 2013 is no longer acceptable today. PageTable came up recently in one of our performance reviews as a problematic component. It was responsible for too much CPU and far too many allocations.

Here is a drastically simplified implementation, which retain the salient points:

Here is the sample workout for this class, which just simulates ten thousand transactions. This little scenario takes 15.3 seconds and allocates a total of 1.1GB of memory! That is a lot of allocations, and must have tremendous amount of time spent in GC. The most problematic issue here is the SetItems methods, which will allocate two different delegates for each modified page in the transaction. Then we have the total abandon in which we’ll allocate additional memory in there. As you can imagine, we weren’t very happy about this, so we set out to fix this issue.

We can take advantage off the fact that SetItems and RemoveBefore are only called under lock, while TryGetValue is called concurrently with everything else.

So I wrote the following code:

This relies on allowing stale reads from concurrent readers, which we don’t care about since they wouldn’t be able to make use of the data anyway, and it was able to reduce the allocations to just 320 MB, but the runtime actually went up to 32 seconds.

That is quite annoying, as you can imagine, and much cursing enthused as a result. I then pulled my trusty profiler ask it kindly to figure out what piece of code needs to be hit with a rolling pin and have a stern talk to about what is expected from code after it has been laboriously and carefully optimized. It is expected to sit nicely and be fast, or by Git I’ll revert you.

What the hell?! Here are the original implementation costs, and you can clearly see how much time we are spending on garbage collection.

image

And here is the optimized version, which is actually slower, and actually used more memory?!

image

There are a bunch of interesting things going on here. We can see that we are indeed using spending a little less time in GC, and that both RemoveBefore and SetItems methods are much cheaper, but the cost of TryGetValue is so much higher, in fact, if we compare the two, we have:

image

So we are 3.4 times higher, and somehow, the cost of calling the concurrent dictionary TryGetValue has risen by 88%.

But the implementation is pretty much the same, and there isn’t anything else that looks like it can cause that much of a performance gap.

I’ll leave this riddle for now, because it drove me crazy for two whole days and give you the details on what is going on in the next post.

Reviewing pull requests with large number of commits

time to read 2 min | 343 words

I have to go over a fairly compress pull request, adding a pretty big feature. The feature, if you care, is compressing data in the leaf pages of a B+Tree for use in Map/Reduce entries storage. The current and non final results are reducing storage costs from about 650MB to 32MB, so we are pretty happy with it, but this isn’t the topic of this post.

At this stage, the feature is not complete, but it is ready for initial review, so we created a temporary PR to review it, here is a partial view:

image

There are over 30 commits over a period of three weeks there, and they represent a lot of experimentation as we go along. This is fine & proper as far as we are concerned, since we this represent quite a bit of work, and we certainly want to be able to track it. But since this isn’t completed, the commit history is long, which means that reviewing it one commit at a time is not trivial. In particular, there are a lot of back and forth changes in the same place, and I don’t want to review code that has been changed.

Luckily, git make it easy to change that view.

I pull the changes to my local branch, then reset the log to the last commit before the merge, then I just squashed all those commit into a single one. You can check it out here.

The final result is:

image

Which is much nicer way to go about reviewing this, since I can look at all the changes in one shot, and more to the point, I can make comments on the changes using our normal workflow.

Digging into the CoreCLRSome bashing on the cost of hashing

time to read 2 min | 372 words

Note: This post was written by Federico.

Recently at CoreFX there has been a proposal to deal with the typical case of everyone writing their own hash combining logic. I really like the framework to push that kind of functionality because hashing is one of those things that unless you really know what you are doing, it can backfire pretty bad (and even if you know, you can step over the cliff too).  But this post is not about the issue itself, but it showcases a few little details that happen at JIT land (and that we usually abuse, of course J). This is the issue: https://github.com/dotnet/corefx/issues/8034#issuecomment-260733285

Let me illustrate with some examples.
ValueTuple is a new type that is being introduced that is necessary for some of the new tuples functionality in C# 7. Because hashing is important, of course they implemented the ability to combine hashes. Now, let’s suppose that we take the actual code hashing code that is being used for ValueTuple and use constants to call it. 

Now under an optimizing compiler chances are that there shouldn't be any difference, but in reality there is.

This is the actual machine code for ValueTuple:

So now what can be seen here? First we are creating an struct in the stack, then we are calling the actual hash code.

Now compare it with the use of HashHelper.Combine which for all purposes it could be the actual implementation of Hash.Combine

I know!!! How cool is that???
But let’s not stop there... let’s use actual parameters:

We use random values here to force the JIT to treat them as parameters and not being able to eliminate the code and convert it yet again into a constant.

The good thing, this is extremely stable. But let’s compare it with the alternative:

The question that you may be asking yourselves is: Does that scale?. So, let’s go overboard...

And the result is pretty illustrative:

The takeaway of the analysis is simple: even if the holding type is a struct, it doesn't mean it is free :)

Making code fasterMicro optimizations and parallel work

time to read 2 min | 339 words

I really wanted to leave this series of posts alone. Getting 135 times faster should be fast enough for everyone, just like 640KB was.

Unfortunately, performance optimization is addictive. Last time, we left it at 283 ms per run. But we still left some performance on the table. I mean, we had inefficient code like this:

image

Just look at it. Analysis showed that it is always called with 2, 4 or 8 only. So we naturally simplified things:

image

Forcing the inlining of those methods also helped, and pushed us further toward 240 ms.

Another cost that we had was date diff calculation, we optimized for the case where the day is the same, but in our dataset, we have about 2 million records that cross the day line. So we further optimized for the scenario where the year & month are the same, and just the day is different. That pushed us further toward 220 ms.

At this point the profiler was basically laughing at us, and we had no real avenues to move forward, so I made the code use 4 threads, each processing the file at different locations.

That gave me: 73 ms and allocated 5,640 kb with peak working set of 300,580 kb

  • 527 times faster than the original version.
  • Allocate 1350 times less memory.
  • 1/3 of the working set.
  • Able to process 3.7 GB / sec.

Note that at this point, we are relying on this being in the file system cache, because if I was reading it from disk, I wouldn’t be able to do more than 100 – 200 MB / sec.

Here is the full code, write code like this at your peril.

Making code fasterSpecialization make it faster still

time to read 3 min | 403 words

Okay, at this point we are really pushing it, but I do wonder if we can get it faster still?

image

So we spend a lot of time in the ParseTime call, parsing two dates and then subtracting them. I wonder if we really need to do that?

I wrote two optimizations, once to compare only the time part if they are the same, and the second to do the date compare in seconds, instead of ticks. Here is what this looks like:

Note that we compare the first 12 bytes using just 2 instructions (by comparing long & int values), since we don’t care what they are, only that they are equal. The result:

283 ms and allocated 1,296 kb with peak working set of 295,200 kb

So we are now 135 times faster than the original version.

Here is the profiler output:

image

And at this point, I think that we are pretty much completely done. We can parse a line in under 75 nanoseconds, and we can process about 1 GB a second on this machine ( my year old plus laptop ).

We can see that the skipping the date compare for time compare if we can pay off in about 65% of the cases, so that is probably a nice boost right there. But I really can’t think of anything else that we can do here that can improve matters in any meaningful way.

For comparison purposes.

Original version:

  • 38,478 ms
  • 7,612,741 kb allocated
  • 874,660 kb peak working set
  • 50 lines of code
  • Extremely readable
  • Easy to change

Final version:

  • 283 ms
  • 1,296 kb allocated
  • 295,200 kb peak working set
  • 180 lines of code
  • Highly specific and require specialize knowledge
  • Hard to change

So yes, that is 135 times faster, but the first version took about 10 minutes to write, then another half an hour to fiddle with it to make it non obviously inefficient. The final version took several days of careful though, analysis of the data and careful optimizations.

FUTURE POSTS

  1. Installing RavenDB 4.0 on your Raspberry PI 3 - one day from now
  2. The edge case is in the timing - about one day from now
  3. Writing my own synchronization primitive ReaderWriterLock - 3 days from now
  4. Implementing low level trie: Part I - 4 days from now
  5. Implementing low level trie: Part II - 7 days from now

And 1 more posts are pending...

There are posts all the way to Dec 13, 2016

RECENT SERIES

  1. The performance regression in the optimization (2):
    01 Dec 2016 - Part II
  2. Digging into the CoreCLR (4):
    25 Nov 2016 - Some bashing on the cost of hashing
  3. Making code faster (10):
    24 Nov 2016 - Micro optimizations and parallel work
  4. Optimizing read transaction startup time (7):
    31 Oct 2016 - Racy data structures
View all series

Syndication

Main feed Feed Stats
Comments feed   Comments Feed Stats