RavenDB 3.5 Release Candidate 3 (and final) is out
All our latest fixes are in it, and this is likely the final release. You can get it here.
Please take it for a spin, we plan to do RTM release next week, and RC3 is supported to run in production.
All our latest fixes are in it, and this is likely the final release. You can get it here.
Please take it for a spin, we plan to do RTM release next week, and RC3 is supported to run in production.
This is a small part from a larger benchmark that we run:
The index in question is using a DateTime field, and as you can see, quite a lot of time is spent in translating that. 50% of our time, in fact. That is… not so nice.
The question now is why we do it? Well, let us look at the code:
Here we can see several things, first, there is the small issue with us allocating the string to check if it is a date, but that isn’t where the money is. That is located in the TryParseExact.
This method is actually quite impressive. Given a pattern, it parses the pattern, then it parse the provided string. And if we weren’t calling it hundreds of thousands of times, I’m sure that it wouldn’t be an issue. But we are, so we are left with writing our own routine to do this in a hard coded manner.
I built the following benchmark to test this out:
As you can see, this is pretty much identical to our code, and should tell us how good we are. Here are the benchmark results:
In my next post, I’ll show what I came up with that can beat this.
System administrators like to see graphs with server utilizations sitting at the very low end of the scale. That means that they don’t need to worry about spikes, capacity or anything much, they are way over provisioned, and that means less waking up at night.
That works very well, until you hit a real spike, hit some sort of limit, and then have to scramble to upgrade your system while under fire, so to speak. [I have plenty of issues with the production team behavior as described in this post, but that isn’t the topic for this post.]
So one of the things that we regularly test is a system being asked to do something that is beyond its limits. For example, have 6 indexes running at the same time, indexing at different speeds, a dataset that is 57GB in size.
The idea is that we will force the system to do a lot of work that it could typically avoid by having everything in memory. Instead, we’ll be forced to page data, and we need to see how we behave under such a scenario.
Here is what this looks like from the global system behavior.
If you’ll show this to most admins, they will feel faint. That usually means Something Is About To Break Badly.
But things are slightly better when we look at the details:
So what do we have here? We have a process that (at the time of running it, has mapped about 67 GB of files, and has allocated 8.5 GB of RAM). However, only about 4.5 GB of that is actively used, and the rest of the working set is actually the memory mapped files. That lead to an interesting observation, if most of your work is local and transient (so you scan through sections of the file, like we do during indexing), the operating system will load those pages from disk, and keep them around until there is memory pressure, at which point it will look at all of those nice pages that are just sitting them, unmodified and with a source on disk.
That means that the operating can immediately discard them without having to page them out. So that means that they are very cheap. Oh, we’ll still need to load the data from disk into them, but we’ll have to do that anyway, since we can’t fit the entire dataset into memory.
So that means that our allocation strategy basically goes something like this:
The private working set is what goes into the page file, it mostly consists of managed memory and whatever unmanaged allocations we have to do during the indexing. So by comparing the two, we can tell how much of the used memory is actually used by memory mapped files. We are careful to ensure that we leave about 25% of the system memory to the memory mapped files (otherwise we’ll do a lot of paging), but that still gives us leave to use quite a lot of memory to speed things up, and we can negotiate between the threads to see who is faster (and thus deserve more memory).
In my preview post, I mentioned that removing artificial batch limits has caused us to double our performance. But what are those artificial batch limits?
Well, anything that doesn’t involve actual system resources. For example, limit batch size by time or by document count is artificial. We used to have to do that as a correlation to the amount of managed memory we use, and because it allowed us to parallelize I/O and computation work. Now, each index is actually working on its own, so if one index is stalling because it need to fetch data, other indexes will use the available core, and every one will be happy.
Effectively, an indexing batch stopped being a global database event that we had to fetch data for specifically and became something much smaller. That fact alone gave us leeway to remove drastic amounts of code to handle things like prefetching, I/O / memory / time / CPU balancing and a whole bunch of really crazy stuff that we had to do.
So all of that went away, and we learned that anything that would artificially reduce a batch size is bad, that we should make the batch size as big as possible to benefit from economy of scale effects.
But wait, what about non artificial limits? For example, running an indexing batch take some memory. We can now track it much better, and most of it is in unmanaged memory anyway, so we don’t worry about keeping it around for a long time. We do worry about running out of it, though.
If we have six indexes all running at the same time, each trying to use as much of the system resources as it possible could. Of course, if we actually let them to that they would allocate enough memory to push us into the page file, resulting in all our beautiful code spending all its time just paging in and out from disk, and our performance looking like it was hit in the face repeatedly with the hard disk needle.
So we have a budget. In fact, we have a pretty complete heuristics system in place.
* Enough memory available is actually a really complex idea, enough so that I’ll dedicate the next post to it.
So that leads us to all indexes competing with one another to get more memory, until we hit the predefined limit (which is supposed to allow us memory to do other work as well). At that point, we hit a real limit, and we stop the batch, complete our work and carry on. After the batch is completed, we could release all of that memory and start from scratch, but that would probably be a waste, we already know that we haven’t gone too badly over budget, so why release all that precious memory just to immediately require it again?
So that is what we did, and we run our benchmarks again. And the performance was not nice to us.
It took a while to figure out what happened, but you can see this on the following graph.
We started allocating memory, and as you can see, we have some indexes that have high memory requirement. At some point, we have hit the memory ceiling we specified, and started completing batches so we won’t use too much memory.
All well and good. Except that the act of completing the batch will also (sometimes) release memory. This is typically done because we have found the ideal sizes we need for processing, so we discard everything that is too small. But the allocator is free to release memory if it thinks that this is the best for the system.
Unfortunately, we didn’t adjust the budget in this case. Consider the case of indexes C & F, both of which released significant amount of memory after the batch was completed. Index B, which was forced to make do with whatever memory it managed to grab, suddenly finds itself in a position to grab more memory, and it will slowly increase its budgets and allocations.
At the same time, indexes C & F are also going to allocate more memory, after all, they are well within their budget, since we didn’t account for the released memory that was gobbled up by index B. The fact that this starts happening only about 45 minutes into the batch, and it actually shows up as higher memory utilization about 4 hours after that is really quite annoying when you need to debug it.
We’ve been running all sort of benchmarks on RavenDB, ranging from micro optimizations to a single heavily used routine that focused on re-ordering the instructions so the CPU can execute them in parallel to large scale load testing and data processing.
This tale is about the later. We have defined 6 indexes over a data set of over 18 million documents with a total size of 57GB. We then run it on a system that had 16GB of RAM (and typically had a lot less actually available).
When we started, we very quickly run out of memory, in fact, our allocations exceeded the database size (so we allocated over 60 GB) before we decided to focus on fixing that. The problem was the scope of the batch, it was too long, and we didn’t reuse the memory inside the same batch. Once that was fixed, we were able to run the performance test successfully.
Total time to run it? Over 9 hours. Granted, this is talking about 6 indexes each needing to go over the entire dataset, so we expect it to take a while, and it is much faster than in previous versions, but that is still far too much.
I should probably explain that the whole point of doing something like that is to see the interference effects. What happen when you have a bunch of competing indexes over the same resources (CPU, memory, disk)?
Turn out, there is a lot going there, and you can’t really get good profiling results from this kind of run (to start with, the profiling overhead would push this into a multi day effort), and while we captured some profiling run of shorter stats, we weren’t really able to pinpoint the blame.
So we added the tooling we needed to figure it out, and then put that in the studio. The end result was… not so good.
Yes, that is the studio trying to display the debug information for that benchmark. It… did not go so well for us. In other words, before we could fix the benchmark, we had to optimize the code that would tell us where we are spending all that time in the benchmark .
As it turned out, the problem was that our code was optimizing for problems we no loner had. Basically, in RavenDB 3.x we have to take into account our memory usage, and pretty much all of it is managed memory. That lead to some interesting problems, to whit, we don’t know how much memory we use. The fact that a document is size so and so when serialized to disk means exactly squat with regards to how much it is actually going to take in memory. And because too much allocations lead to higher GC costs down the road, we put term limits on the size of the indexing batches we’ll process.
Effectively, we will take a certain amount of documents to process in a batch, and then complete the batch. All the memory we used in the batch will go away, and hopefully we didn’t push too much of it into higher generations. But the real costs that we have in RavenDB 4.x are very different. To start with, we use a lot less managed operations, and we use Voron for pretty much everything. That means that our costs has now shifted, instead of worrying about releasing the memory as quickly as possible, we need to worry about reducing the number of times we go to disk.
As it turns out, artificially reducing the batch size results in us processing more batches, which require us to hit the disk a lot more. The same logic applies to RavenDB 3.x (and we have users who have configured RavenDB to have very long batches for exactly that reason), but that come at GC cost that simply does not exist in RavenDB 4.0.
The immediate solution was to remove all the batch limits and see what would happen. Overall performance had doubled. We were able to process the same amount of information in about half the time. And that is before we did deep dive with a profiler to seek inefficiencies.
I outlined the scenario in my previous post, and I wanted to focus on each scenario one at a time. We’ll start with the following simple map index:
In RavenDB 4.0, we have done a lot of work to make this scenario as fast as possible. In fact, this has been the focus of a lot of architectural optimizations. When running a simple index, we keep very few things in managed memory, and even those are relatively transient. Most of our data is in memory mapped files. That means, no high GC cost because we have very few objects getting pushed to Gen1/Gen2. Mostly, this is telling Lucene “open wide, please” and shoving as much data inside as we can.
So we are seeing much better performance overall. But that isn’t to say that we don’t profile (you always do). So here are the results of profiling a big index run:
And this is funny, really funny. What you are seeing there is that about 50% of our runtime is going into those two functions.
What you don’t know is why they are here. Here is the actual code for this:
The problem is that the write.Delete() can be very expensive, especially in the common case of needing to index new documents. So instead of just calling for it all the time, we first check if we have previously indexed this document, and only then we’ll delete it.
As it turns out, those hugely costly calls are still a very big perf improvement, but we’ll probably replace them with a bloom filter that can do the same job, but with a lot less cost.
That said, notice that the runtime cost of those two functions together is 0.4 ms under profiler. So while I expect bloom filter to be much better, we’ll certainly need to double check that, and again, only the profiler can tell.
In this post, we are going to look at the relevant costs we have when testing out different indexes on the stack overflow data dump. In this case, we have the following two collections, and sample documents from each.
The first index we’ll look at is a simple full text index, simply mapping the users and asking to search over their data.
This index is pretty simple, both in how it works and what it would cause RavenDB to do. We do a simple scan over all the Users documents, and just index those two properties. Nothing much to see here, let us move on.
Here we have a map/reduce index, over users, to see how many signups StackOverflow had per month.
This gets more interesting as a performance analysis goes. While a map only index just needs to spill everything to Lucene, and let it do its work (more on that later), a map/reduce index has to keep track of quite a lot of internal state. And as it runs out, the costs associated with the index vary according to what it does, and the access pattern it has.
In this case, we’re going to be scanning the users in roughly the order they were inserted into the database, and that would match pretty closely to the aggregation by the registration month. That means that this is likely to be a pretty cheap index, since we scan the data and it is mostly going to be naturally grouped along the same lines that we are going to group it. It means that the working set we’ll need for this index is going to be fairly small. Once we have passed by a particular month, we won’t be accessing it again.
The number of reduce keys (the distinct number of values that we group by) is also going to be fairly small, in the order of 100 keys or so.
Now, let us move to a more complex index:
This one is aggregating information over questions twice, once for the questions, and once for the answers. In terms of sheer amount of data it works on, it is pretty big. However, in terms of actual work that is done, we still somewhat benefit so significantly from the ordering of the data. This is the case since while questions and answers are usually temporally close to one another, that isn’t always the case. That said, for the vast majority of cases, we’re likely to see quite a bit of locality, which will help the index. There is also the fact that we still have very few reduce key, only about 100 or so, which also helps.
And now we are talking about putting the db through its paces. The problem here is that we are scanning all the indexes, and outputting an entry for each tag. There are 46,564 tags, and in total this index outputs 37,122,139 entries. However, there are just 3,364 tags that have more than a thousand questions, and they are responsible for 32,831,961 of the questions, so while there is a significant long tail, it is pretty small, in terms of actual number of questions.
Note that we are still scanning in a manner consistent with the index, so we’ll typically go over a set of questions that all belong to the same month. And each month will have roughly 3 million questions (obviously less the further back in time we go). So complex, but not too badly so. The number of reduce keys will be in the hundreds of thousands, but the number of entries per reduce key will be relatively small.
This one, however, is much more problematic:
On the face of it, it looks like it would be a simpler index than the previous two, but it lacks a very important property, it isn’t actually limited by anything related to the scanning order we have. What this means is that whenever we will scan the data for this index, we are going to find a lot of tags, so each batch will have to touch a lot of reduce keys, and they are going to be big reduce keys (with lots of entries), so that means that we’ll need to re-reduce large number of large reduce keys often.
Effectively, this index is forcing us to scatter values all over the place, then aggregate all the information about each of those values. It is likely to be one of the worst scenarios for us to work with. Now that we have the groundwork laid out, we can talk about how we are actually going to approach performance optimizations here.
As I mentioned, given a 45 GB file containing all the StackOverflow questions since 2008, I wanted to transform items like this:
This turn out to be rather tricky to do. The file is sorted by the row id, but there are tricks here:
Ideally, we could keep a dictionary of the questions and their answers until we have all the answers to the question, they write it all out. The problem here is that a question might have answered years after it was asked, which means that the answer id would place it very far from the question in the file, which means that we have to keep the question in memory for a very long time. In practice, this happens quite frequently, and it plays havoc with trying to limit the memory utilization for this approach.
The answer id is: 2388
The question id is: 44965
That means that if you scan the file sequentially, you are going to find the answers before their questions, somethings a lot before (I’m guessing that this happens if you have edits to a question, so a popular question might get edited long after it was asked & had answers).
The next thing I tried was to group all the questions together, then process them. Because I can’t do that in memory, I tried writing them to disk directly:
The idea is that I can write them out to the file system, and then traverse the file system to gather all of them. The problem is that we are talking about over 32 million questions & answers, that isn’t really going to work for us using most normal file systems.
So I tried writing it all to a zip file, which was successful, but resulted in a zip file that was 21GB in size, and was very good in ensuring that nothing could open it before I run out of patience.
Eventually I just gave us and used Voron to handle it, here are the relevant sections:
Remember that Voron stores everything in a B+Tree, and allows to do ordered operations, so once this is done, all I need to do is traverse the tree, get all the records that have the same parent id, and voila, I have the grouping I wanted.
If I wanted to do it without the use of a database (or Voron), I would probably go with the following manner:
Why separate file? Because XmlReader does buffering, so it will be really hard to just in the file from just reading, by writing it out, we have the proper position to seek to.
Note that this will require a lot of random I/O, but that is pretty much just what it has to be.
We are doing perf testing right now, and we are looking into real world datasets to play around with. Luckily for us, Stackoverflow have regular data dump of size significant enough to be useful for our experiments.
The file that I’m currently looking it is the Posts.xml file, which is about 45GB in size, and looks roughly like this (lot of stuff removed to make the point).
Since Stackoverflow is using relational database, their output is also relational. You can see that each element is a single row, and you have the ParentId in row #7 that points back to row #4.
Basically, row #4 is the question, and row #7 is one of the answers.
What I want it to take all of this data and move it into a more document format. In other words, what I want to have all the answers for a question contained within the question, something like this:
The fun part here is that this is a pretty big file, and we are writing the output into a GzipStream, so we don’t really have the option of saving / modifying midway through. Once we have written something out to the GzipStream, it cannot be changed.
So we need to find a way in which we can group all the answers under their questions, but at the same time, the file size is big, much bigger than the memory I have available, so we can’t just keep it all in memory and write it out in the end.
How would you solve this issue? My attempt is currently sitting at roughly 10GB of RAM used after processing about 30GB of XML, but I have to admit that I have thrown it together rather quickly, since I just needed the data and a quick & dirty solution is just fine here.
We spent some time recently looking into a lot of our old design decisions. Some of them make very little sense today (json vs. blittalbe as a good example), but made perfect sense at the time, and were essential to actually getting the product out.
Some of those design decisions, however, are still something that I very firmly believe in. This series of posts is going to explore those decisions, their background and how they played out in the real world. So, without further ado, let us talk about unbounded result sets.
The design of RavenDB was heavily influenced by my experience as That NHibernate Guy (I got started with NHibernate over a decade ago, if you can believe that), where I saw the same types of error, repeated over and over again. I then read Release It!, and I suddenly discovered that I wasn’t alone fighting those kind of demons. When I designed RavenDB, I set out explicitly to prevent as many of those as I possibly could.
One of the major issues that I wanted to address was Unbounded Result Sets, simply put, this is when you have:
SELECT * FROM OrderLines WHERE OrderID = 1555
And you don’t realize that this order has three million line items (or, which is worst, that most of your orders have a few thousands line items, so you are generating a lot of load on the database, only to throw most of them away).
In order to prevent this type of issue, RavenDB has the notion of mandatory page sizes.
I think that this is one of the more controversial decisions in RavenDB design, and one that got a lot of heated discussion. But I still think that this is a good idea,because I have seen what happens when you don’t do that. And the arguments are mostly about “RavenDB should trust developers to know what they are doing” and a particular irate guy called me while I was out shopping to complain how I broke the sacred contract of Linq with regards to “queries should return all by default, even if this is ten billion results”. I pointed out that this is actually configurable, and if he wanted to set the default to any size he wanted, he could do that, but apparently it is supposed to be “shoot my own foot first, then think” kind of deal.
Even though that I still think that this is a really good idea, we have added some features over the years to make it easy for people to access the entire dataset when they need it. Streaming has been around since 2.5 or so, giving you a dedicated API to stream unbounded results. Streams were built to make it efficient to process large sets of data, and they allow both client & server to process the data in parallel, instead of batching huge responses on the server, then consuming ridiculous amounts of memory on the client before giving you the full result set. Instead, you can get each result as soon as it arrive from server, and you can process it and send it further.
In 4.0, we are going to change the behavior of the paging limits so:
The idea is that we want to reduce the surprise for users, and that can give them the experience to draw upon early on. Another thing that we’ll do is make sure that the operations guys can also change that, likely with an environment variable or something like that. If you need to modify the conventions on the fly, you usually have hard time deploying a new version, and an immediate action is needed.
In this manner, we can help users avoid expensive requests to the server, and they can be explicit with what they need to do.
There are posts all the way to Oct 27, 2016