Ayende @ Rahien

Hi!
My name is Oren Eini
Founder of Hibernating Rhinos LTD and RavenDB.
You can reach me by email or phone:

ayende@ayende.com

+972 52-548-6969

, @ Q c

Posts: 6,371 | Comments: 47,289

filter by tags archive

Overloading the Windows Kernel and locking a machine with RavenDB benchmarks

time to read 4 min | 781 words

During benchmarking RavenDB, we have run into several instances where the entire machine would freeze for a long duration, resulting in utter non responsiveness.

This has been quite frustrating to us, since a frozen machine make it kinda hard to figure out what is going on. But we finally figured it out, and all the details are right here in the screen shot.

image

What you can see is us running our current benchmark, importing the entire StackOverflow dataset into RavenDB. Drive C is the system drive, and drive D is the data drive that we are using to test our RavenDB’s performance.

Drive D is actually a throwaway SSD. That is, an SSD that we use purely for benchmarking and not for real work. Given the amount of workout we give the drive, we expect it to die eventually, so we don’t want to trust it with any other data.

At any rate, you can see that due to a different issue entirely, we are now seeing data syncs in excess of 8.5 GB. So basically, we wrote 8.55GB of data very quickly into a memory mapped file, and then called fsync. At the same time, we started increasing our  scratch buffer usage, because calling fsync ( 8.55 GB ) can take a while. Scratch buffers are a really interesting thing, they were born because of Linux crazy OOM design, and are basically a way for us to avoid paging. Instead of allocating memory on the heap like normal, which would then subject us to paging, we allocate a file on disk (mark it as temporary & delete on close) and then we mmap the file. That give us a way to guarantee that Linux will always have a space to page out any of our memory.

This also has the advantage of making it very clear how much scratch memory we are currently using, and on Azure / AWS machines, it is easier to place all of those scratch files on the fast temp local storage for better performance.

So we have a very large fsync going on, and a large amount of memory mapped files, and a lot of activity (that modify some of those files) and memory pressure.

That force the Kernel to evict some pages from memory to disk, to free something up. And under normal conditions, it would do just that. But here we run into a wrinkle, the memory we want to evict belongs to a memory mapped file, so the natural thing to do would be to write it back to its original file. This is actually what we expect the Kernel to do for us, and while for scratch files this is usually a waste, for the data file this is exactly the behavior that we want. But that is beside the point.

Look at the image above, we are supposed to be only using drive D, so why is C so busy? I’m not really sure, but I have a hypothesis.

Because we are currently running a very large fsync, I think that drive D is not currently process any additional write requests. The “write a page to disk” is something that has pretty strict runtime requirements, it can’t just wait for the I/O to return whenever that might be. Consider the fact that you can open a memory mapped file over a network drive, I think it very reasonable that the Kernel will have a timeout mechanism for this kind of I/O. When the pager sees that it can’t write to the original file fast enough, it shrugs and write those pages to the local page file instead.

This turns an  otherwise very bad situation (very long wait / crash) into a manageable situation. However, with the amount of work we put on the system, that effectively forces us to do heavy paging (in the orders of GBs) and that in turn lead us to a machine that appears to be locked up due to all the paging. So the fallback error handling is actually causing this issue by trying to recover, at least that is what I think.

When examining this, I wondered if this can be considered a DoS vulnerability, and after careful consideration, I don’t believe so. This issue involves using a lot of memory to cause enough paging to slow everything down, the fact that we are getting there in a somewhat novel way isn’t opening us to anything that wasn’t there already.

Reducing the cost of occasionally needed information

time to read 3 min | 486 words

Consider the case when you need to call a function, and based on the result of the function, and some other information, do some work. That work require additional information, that can only be computed by calling the function.

Sounds complex? Let us look at the actual code, I tried to make it simple, but keep the same spirit.

This code has three different code paths. We are inserting a value that is already in the tree, so we update it and return. We insert a new value, but the page has enough space, so we add it and return.

Or, the interesting case for us, we need to split the page, which requires modifying the parent page (which may require split that page, etc). So the question is, how do we get the parent page?

This is important because we already found the parent page, we had to go through it to find the actual page we returned in FindPageFor. So ideally we’ll just return the list of parent pages whenever we call FindPageFor.

The problem is that in all read scenarios, which by far outweigh the  write operations, never need this value. What is more, even write operations only need it occasionally. On the other hand, when we do need it, we already did all the work needed to just give it to you, and it would be a waste to do it all over again.

We started the process (this particular story spans 4 years or so) with adding an optional parameter. If you passed a not null value, we’ll fill it with the relevant details. So far, so good, reads will send it null, but all the writes had to send it, and only a few of them had to actually use it.

The next step was to move the code to use an out lambda. In other words, when you called us, we’ll also give you a delegate that you can call, which will give you the details you need. This way, you’ll only need to pay the price when you actually needed it.

It looked like this:

However, that let to a different problem, while we only paid the price of calling the lambda when we needed it, we still paid the price  of allocating the labmda itself.

The next step was to create two overloads, one which is used only for reads, and one for writes. The writes one allows us to send a struct out parameter. The key here is that because it is a struct there is no allocation, and the method is written so we put the values that were previously captured on the lambda on the struct. Then if we need to actually generate the value, we do that, but have no additional allocations.

And this post is now about six times longer than the actual optimization itself.

Dynamic compression acceleration

time to read 3 min | 523 words

imageAfter talking about what specifying LZ4 acceleration do, let us get down and talk about how we use it.

In our benchmark, we run into a problem. Our benchmark hardware is too fast. I’m testing on a disk that can deliver 250,000 IOPS and can write over 1GB/sec. On the other hand, if I’m running on Amazon, the maximum IOPS that I can pay for is 20,000. Azure seems to have a limit of 5,000 IOPS per disk.  And standard SSD will give you about 75,000 IOPS, while HDD will typically have IOPS in the low hundreds. I’m using IOPS because it is an easy single metric to compare, but the same is abound disk bandwidth, write speed, etc.

That means that we are actually testing on hardware that is likely to be better than what we’ll typically run on. Which means that we need to be careful about what kind of optimizations we bring in. It would be easy to optimize for a specific machine, at the cost of worst performance in the general case.

Case in point, for many cases, the cost of actually writing to disk on that particular machine is low enough that it isn’t worth to put a lot of effort into compressing the data. That isn’t the case all the time, though, for example, is we are applying pressure on the I/O system, or if we have some other stuff going on that will impact our writes.

On that particular machine, however, it got to the point where we are seeing higher compression times than write times, which is pretty ridiculous. We aren’t actually saving anything by doing that. But instead of tailoring a solution to a single machine, we decided to go in a bit of a roundabout way. When we start, our initial assumption is that we are in a machine with higher I/O cost than CPU cost, which will cause us to put more effort into compressing the data to reduce the amount of writes we have to make.

On the other hand, after we start writing, we can start measuring the relative costs, and adjust accordingly. In other words, based on how expensive it is to compress the data versus writing the data to disk, we can dynamically adjust the cost of compression, until we hit the sweet spot for the machine and environment that we are running on. The beauty in this behavior is that it will automatically adjust to pressures, so if someone is running a backup on this disk, and slowing us down, we’ll learn about it and increase the compression rate to compensate. Once the I/O pressure is off, we can relax our compression ratio to reduce total resource consumed.

On our benchmark machine, we have managed to get the import of the full StackOverflow dataset, just over 18 million documents, totaling over 50 GB in size in under 8 minutes, reducing our overall benchmark time by almost 10%.

Performance optimizationsOne step forward, ten steps back

time to read 3 min | 547 words

AvatarCrane01-300pxAs we continuously optimized more and more of our code, we kept seeing faster and faster benchmarks. In fact, the more we optimized, the faster we became. One would think that there is some sort of correlation there.

However, that is a mere theory that can be disproven, as this story will demonstrate.

When optimizing, you eventually expects to get into the land of diminishing returns. But something very strange happened, we have made a few changes, the each should have brought our speed up by a significant percentage, we had the micro benchmarks to prove that this is the case, and we were even able to see that the code was running much faster than before, but the overall benchmark time kept growing, and we started seeing higher and higher stalls in the process.

That… sucked. Especially because we couldn’t figure out what was going on. Every single metric we could see was saying that we should be seeing higher speed, our disk usage went up, our CPU usage went up a bit, we increased our memory buffers from 32 MB to 1GB, and every single indication we had told us that we are faster on a per operation basis. But the entire thing just started slowing down more and more.

Frustratingly, there was nothing we could really sink our teeth into. The system would just go into stalls and do nothing. We got to the point it looked like we broke the operating system, but nothing helped, stuff just didn’t want to work. It looked like we were waiting for I/O, but tracing at the syscall level showed that we were getting much faster response from the hardware than we saw in the application. Somewhere, stuff was getting lost.

Eventually we managed to track it down to the offending line:

image

So this is pretty obvious, right? We are waiting, so we are slow. But this line is called from a completely different part of the code, and it isn’t blocking anything else in the code path that is suffering from stalls. The key here is that this line is called from:

image

Still fine, right? We threw that into the thread pool, so it is fine to wait. But…

image

The line above is responsible for releasing our threads when the I/O operation has completed. Note that it needs to run on the thread pool as well, but because we are now much faster, we now have a lot of threads that are stuck in the call to SyncEnvironment, that overloaded the thread pool, and meant that the notification that we can proceed would come very late. We missed it in all of our profiling because we didn’t look at that code path at all, since it was obviously unrelated to the issue at hand.

The cost of allocating memory and cheating like crazy for best performance

time to read 5 min | 813 words

Allocating memory is expensive. It is expensive in managed code, and it is expensive in native code, it is just expensive. There are many ways to alleviate those costs, but at some point, any generic allocation scheme is going to run to ground with the harsh reality of the cost of bookkeeping.

Many managed programs are actually much faster than their native counterparts precisely because of that. Manage runtimes have more information about the allocations, and can defer / batch and optimize memory usage accordingly. For example, in .NET the amount of code that you would typically run for allocation is miniscule. Effectively a pointer bump and that is it. And the garbage collector can do really good work on collecting those allocations if they are short lived, and the generational nature means that we typically have far less cost of memory book keeping than in most native applications, and when we do, we can optimize things by doing that all at once.

All of which is great, except that there is still a cost to it, a decidedly non trivial one if you are writing high performance software. So what is the answer?

Put simply, you cheat, and you cheat like mad. The issue is with the problem statement, if any generic allocation scheme is problematic, throw the generic portion out the window and specialize like an insect.

With RavenDB, we have quite early on realized that the cost of memory allocation was too prohibitive to allow us to use it for most common tasks. So we manage our own memory, but we do that with a pattern. All operations are always done inside a scope, that scope define the boundary of the operation. An operation can be something like writing a single document, or answering a query, etc. The idea is that this is (usually) short lived and well scoped. And that turns out to be key.

We use the notion of context and context pool. A context is just a holder for a bunch of (native) memory that we get from the OS and holds on to. And when we start an operation, we grab a context from the pool and start using it. Whenever we need to grab some memory, we go to the context and ask it for some.

The code that typically run in this case is:

Amusingly enough, we are actually doing a managed allocation as part of are unmanaged allocation. This is fine, because we typically allocate things in the ~KB range, so we don’t have a lot of individual AllocatedMemoryData instances to manage.

But the key here is that when the operation scope is over, we return the context to the pool. The key here is that when we return the context to the pool, we already reset _ptrCurrent to its initial value. So all the memory we used during that operation is available to be used again very cheaply.

Now, this is actually almost the entire story. As it turns out, this works very well for a lot of scenarios, but in some cases we have a context open for a long while ( bulk insert, indexing, etc). That means that an operation would go through quite a lot of memory if we didn’t have some way to recover it on the fly. And we do. When we are done with the memory, we return it to the context. Now, this is usually where things starts to get hairy. But we made sure that there is an O(1) cost for both allocation & deallocation in the context.

When you return memory to the context, we check if this is the top most section, and if so, we just reduce _ptrCurrent by the amount of memory that was used. But there are times when we return memory that isn’t on the top of the context, what then? Instead of leaving this memory unused until the next context reset, we’re adding that to a linked list. But this linked list is actually stored in the memory that was freed. This looks like this:

The _freed is an array that hold the most recent value for that size. So when we allocate, we can do:

And if we don’t find returned memory, we’ll go to the cheap-alloc code above and bump the pointer. So during the lifetime of the context, we get pretty cheap memory reuse, and we don’t worry too much about it because the context reset will clear everything anyway.

A side benefit of this is that because the context goes to the pool, we are not going to release that memory, so when the next request comes, we’ll have relatively hot memory right there & available. And, of course, all of it is going to be with high locality.

Deleting highly performant code

time to read 9 min | 1779 words

image5 years ago, we introduced bulk insert support to RavenDB. The idea was to let users to have a good way to insert data into RavenDB as fast as possible. In order to do that, we created special code paths that were super optimized for loading large amount of data into the database as fast as possible.

Basically, we bypassed a lot of the stuff along the way, put some constraint on the user when calling this and smoothed everything we could. Internally, this was implemented using a pretty complex system which split work among multiple threads to handle serializing, sending over the network, parsing and writing to disk.

When we start working on 4.0, one of the very first things we did was to build bulk insert from scratch. We built it on top of TCP, instead of HTTP, and we decided that for the sake of performance, we can throw pretty much anything at this problem. The overall design looked something like this:

  • One thread at the client side, getting entities and serializing them to our blittable binary format.
  • One thread at the client side for sending the data over the network.

In order to save memory, those two threads have a common buffer pool that they reuse all the time.

  • On the server side, we have a dedicated thread reading from the network, constructing our blittable instances and validating their format and putting them in a queue.
  • On the server side, we have a dedicated thread pooling from the queue and batching requests, then saving them in a single transaction.

Like the client, we are saving memory by having a common shared buffer pool.

Initially, having a dedicated code path for bulk insert was awesome, because we could get really high performance from it. But it also introduced problems. Consider the scenario above, the server thread reading from the network is reading as fast as it can, and send the data to the queue. If the consuming thread isn’t fast enough (for example, we just hit an I/O bump), we might start accumulating stuff in the queue, and if this goes for long enough, we might run out of usable memory.

We had to implement our own backpressure and congestion control because of that, and that led to more interesting issues. Because we were starting to read very quickly, and only run into the I/O problem later, we had incidents in which we weren’t fast enough with the “let me catch my breath” notification and overloaded the TCP connection, resulting in timeouts on the client. This sounds all very complicated, and it was, but we managed to solve all of those issues and this piece of code had very few changes for a long time.

And this week we deleted all of it in favor of a radically different implementation. Basically, the new implementation will just open a standard HTTP request and write JSON strings to the network. This is about as simple an implementation as we can get. And we delete all this high performance, extremely tuned and carefully crafted code.

Why?

You might have noticed that we are putting a major emphasis on performance, so why would we throw all that code away?

There are several reasons:

  • Back pressure, congestions control and other fun factors are not things that we want to deal with, instead, we want to deal with them at the network layer, where there is a lot more information about it.
  • Complex code is costly, and it requires us to modify other pieces of code to ensure that this still works.
  • We used to getting impressive numbers from this special casing, but we are seeing similar numbers without all the hoopla.

The trigger for this was a set of optimization we did for optimizing standard write patterns. Your usual OLTP workload, basically. As we made that faster & faster, we started to think whatever it made sense to allow the bulk insert code to still remain a special snowflake. For example, the bulk insert code didn’t use our transaction merging capabilities, instead, it would directly talk to the storage. But that meant that it lost out on a lot of the benefits we made to the transaction merging code. It would also cause bulk inserts to fight concurrent write loads (including other bulk inserts), instead of cooperating.

The decision to go with TCP connection directly was made because we wanted absolute performance & control, but we have taken too much upon ourselves, and we were concerned about firewall issues. Forcing admins to open another port can be tricky, and we want to reduce the cost of deploying RavenDB instances as much as possible.

Finally, we needed something that was a lot more approachable. While using our own binary format over the wire meant that we could do a lot less work, it also put a major stumbling block if you wanted to do bulk insert via anything but our .NET code. If we wanted to do bulk insert from node.js, that would require… non trivial amount of effort, to say the least.

The new wire format looks like this:

This is trivial to integrate with regardless of platform. The reason we use this particular format? This is actually identical to the format we use for SaveChanges, and that piece of code has been through multiple optimization rounds. Here is a small example from that code path:

This is pretty fast (and gnarly) code, so we get to reuse that and benefit from any future optimizations.

The major difference here is that this is a a streaming format, that is, we are going to read from the stream and process this immediately. With SaveChanges, we have to read the whole thing to memory, and apply it as a single transaction. In the case of our new bulk insert code, we’re not going to do the whole thing as a single transaction, but a series of them, based on the actual workload.

The fun thing here is that we can dramatically reduce the overall complexity while maintaining a lot of the desired behavior. In fact, the entire bulk insert implementation is under 50 lines of code now. Small enough that I can just include it in this post in its entirety.

Note that this code rely on a lot of other code, but none of this other code is Bulk Insert specific. There are a few interesting bits here that are worth exploring.

BatchRequestParser.ReadMany allow us to stream read the data, it read each individual command and return it. The code in line 17 ( var task = await parser.MoveNext() ) is pretty strange. MoveNext is returning a Task<Task>. The idea is that we are going to read a command from the network and parse it immediately. However, if there isn’t a full command already buffered, we’ll return an async task for completing it. (The Task<Task> is here because we might need to wait for the initial command, or for the final ] of the array. We’ll probably optimize that once we are done with all the taxes on this feature, to avoid the Task allocation.).

The basic idea is that we’ll read & parse from the network as long as there is information available, and when we start waiting for the network, we’ll go into the if statement and send it to the transaction merger. That gives us three very important properties.

  • We parallelize the work, while we are waiting for the next document to arrive over the network, we are concurrently writing the current batch to disk. Note that we don’t have any such things as batch size defined. This is all controlled by the network speed. We do have the 16 MB limitation, to prevent a really fast network from causing excess memory utilization on the server side.
  • We are only buffering a relatively small amount of data (by default, however much the network can give us without waiting), so we get pretty consistent read experience. “Read a buffered document” –> “Write to disk”. This has the effect of being able to teach the connection how fast we can read & process the results, giving us much better behavior as far as the network stack is concerned.
  • We play well with the rest of the system. By utilizing the transaction merger, we’ll work well with concurrent work on the server, instead of fighting for the same resources.

The code still need some work (resource handling, better error management, etc), but it is a new win in terms of simplicity and manageability.

But what about performance? Previously we have spread the load across many threads, on both client & server ends. That gave us faster overall performance at the cost of much higher complexity.

As it turns out, quite a lot of the benefit of bulk insert is related to the fact that we have just a single network connection going on, and we have that. We get some parallelism between writing the documents and reading from the network, but that is intentionally limited (to avoid fooling the other side that we can keep up with that rate of reads if we have  an I/O stall).

But let us assume that you have really need to be able to insert a lot of data into the database? Fast enough and large enough that you want us to be even faster?

That is actually quite easy. Remember how many times I said that this new system play well with other operations? This is important, because if you want to have faster performance for bulk insert… just run parallel bulk inserts. This way, you’ll have multiple threads on the server side reading & parsing, and all of that work will go (together) to the transaction merger), giving us a much better overall performance.

In our benchmarks, we actually had issues with data generation on the client side causing a major slowdown. We had to pre-generate all the data upfront, and only then start the bulk insert process, otherwise the call to Random to generate the data would have enough of an impact on the overall benchmark test. I don’t think that you’ll be able to generate the data fast enough to saturate a single bulk insert connection easily, but given how easy concurrency has became, if you can, the scale up option is incredibly easy.

Why we aren’t publishing benchmarks for RavenDB 4.0 yet

time to read 3 min | 436 words

You might have noticed something very strange in all my posts about performance work. I’m always talking about performance of either specific small pieces, or showing the difference between old and new code in percentages.

This is quite intentional, we are not ready yet to publish benchmark results. But the reason isn’t that we are too slow. Across the board, we are looking at a minimum of an order of magnitude improvement, and in many cases, several orders of magnitude improvements. The reason that we don’t publish benchmark it actually very simple, we aren’t done yet.

A few days ago I published some of the changes that we were doing around JSON parsing. As you can imagine for a JSON Document Database, JSON parsing is a crucial part of our workload, and any miniscule difference there can have a widespread impact on the whole system.

When we started with RavenDB 4.0, one of the first things we did was write our own unmanaged JSON parser. In Jan 2016, we run a benchmark that gave us a parsing cost of 170  μs per document parse using JSON.Net, while our code code parse it in 120  μs, giving us 40% benefit over JSON.NET. Those numbers actually lie, because JSON.Net also does managed allocations, while our parser don’t, giving us far reduced actual cost overall.

Getting 40% performance increase (and no future GC debt) is amazing, but in the year since, we have made a lot more changes. Here are some numbers from the performance team for the recent spate of work in that area. You can also see the details on the image on the right.

  • T – 3 weeks (our current baseline, with a lot of performance work since Jan 2016): 80 μs per document, and this time, the documents are much bigger than the previous benchmark we run.
  • T – 1 week: 30 μs per document.
  • T: 20 μs per document.

And at that point, we were done, and moved on to optimize other parts.

  • T + 2 days: 17 μs per document.

We weren’t even looking at that part, but optimizations in different areas meant that we were able to be even more efficient. That is about 17% higher.

And we still have about a month before we go into code freeze for the beta release.

Once that is done, we already have a plan to do a few round of benchmark / optimization / benchmark, and then publish the full results. I’m really excited about this.

Getting 600 times better performance

time to read 1 min | 73 words

During the design process for the next release of RavenDB, we set ourselves a pretty crazy goal. We wanted to get a tenfold performance improvement across the board…

This is how my article about gaining 600x performance improvement the the new DZone Guide - Performance: Optimization & Monitoring starts. For the rest, head there and read it all Smile.

What did all this optimization give us?

time to read 2 min | 332 words

I’ve been writing a lot about performance and optimizations, and mostly I’m giving out percentages, because it is useful to compare to before the optimizations.

But when you start looking at the raw numbers, you see a whole different picture.

On the left, we have RavenDB 4.0 doing work (import & indexing) over about 4.5 million documents. On the right, you have RavenDB 3.5, doing the same exact work.

We are tracking allocations here, and this is part of a work we have been doing to measure our relative change in costs. In particular, we focused on the cost of using strings.

A typical application will use about 30% of memory just for strings, and you can see that RavenDB 3.5 (on the right) is no different.

image

On the other hand, RavenDB 4.0 is using just 2.4% of its memory for strings. But what is even more interesting is to look at the total allocations. RavenDB 3.5 allocated about 300 GB to deal with the workload, and RavenDB 4.0 allocated about 32GB.

image

Note that those are allocations, not total memory used, but on just about every metric. Take a look at those numbers:

image

RavenDB 4.0 is spending less time overall in GC than RavenDB 3.5 will spend just on blocking collections.

Amusingly enough, here are the saved profile runs:

image

FUTURE POSTS

No future posts left, oh my!

RECENT SERIES

  1. RavenDB 4.0 (4):
    19 May 2017 - Managing encrypted databases
  2. De-virtualization in CoreCLR (2):
    01 May 2017 - Part II
  3. re (17):
    26 Apr 2017 - Writing a Time Series Database from Scratch
  4. Performance optimizations (2):
    11 Apr 2017 - One step forward, ten steps back
  5. RavenDB Conference videos (12):
    03 Mar 2017 - Replication changes in 3.5
View all series

RECENT COMMENTS

Syndication

Main feed Feed Stats
Comments feed   Comments Feed Stats