My talk to the Azure Israel about Cosmos DB is up (in Hebrew)…
This is part of the same issue as the previous post. I was tracking a performance regression between RavenDB 4.1 and RavenDB 4.2, there was a roughly 8% performance difference between the two (favoring the older version) which was reported to us. The scenario was very large and complex documents (hundreds of objects in a document, > 16KB of JSON each).
The two code bases have had a lot of (small) changes between them, so it was hard to figure out exactly what was the root cause for the regression. Eventually I found something utterly bizarre. One of the things that we have to do when you save a document is check if the document has been modified. If so, we need to save it, otherwise, we can skip it. Here is the relevant piece of code in 4.1:
So this costs 0.5 ms (for very large documents), seems perfectly reasonable. But when looking at this on 4.2, we have:
This cost six times as much, what the hell?! To clarify, Blittable is the name of the document format that RavenDB uses. It is a binary JSON format that is highly efficient. You can think about this as comparing two JSON documents, because this is what it is doing.
I mentioned that there are differences between these versions? There have been quite a few (thousands of commits worth), but this particular bit of code hadn’t changed in years. I just couldn’t figure out what was going on. Then I looked deeper. Here are the cost of these calls. Here is the 4.1 version:
And here is the 4.2 version:
There are a few interesting things here. First, we can see that we are using Enumerable.Contains and that is where most of the time goes. But much more interesting, in 4.1, we are calling this method a total of 30,000 times. In 4.2, we are calling it 150,000 times!!! Note that CompareBlittable is recursive, so even though we call it on 10,000 documents, we get more calls. But why the difference between these version?
I compared the code for these two functions, and they were virtually identical. In 4.2, we mostly change some error message texts, nothing major, but somehow the performance was so different. It took a while to figure out that there was another difference. In 4.1, we checked the changes in the documents in the order of the properties on the document, but on 4.2, we optimized things slightly and just used the index of the property. A feature of the blittable format is that properties are lexically sorted.
Here is the document in question, in our test, we are modifying Property6, as you can see here:
There are a total of 40 properties in this document. And much nesting. In this case, in 4.2, we are scanning for changes in the document using the lexical sorting, which means:
The CompareBlittable() function will exit on the first change it detect, and in 4.1, it will get to the changed Property6 very quickly. On 4.2, it will need to scan most of the (big) document before it find a change. That is a large part of the cost difference between these versions.
Now that I know what the issue is, we have to consider whatever behavior is better for us. I decided to use the order of inserted properties, instead of the lexical order. The reasoning is simple. If a user care about that, they can much more easily change the order of properties in the document than the names of the properties. In C#, you can just change the order the properties shows up in the class definition.
I have to say, this was much harder to figure out than I expected, because the change happened in a completely different location and was very much none obvious in how it worked.
We were looking at the cost of writing 10,000 documents to RavenDB and found out something truly interesting. The documents in question are complex, this is an object graph that includes over 120 objects belonging to 20 different classes. The serialized JSON is over 16KB in size. It is expected that something like that would take a lot of time.
Here is the results under the profiler:
Given the size involved, taking (under the profiler), just under 4.2 ms per document isn’t bad, I thought. In general, we consider the cost of JSON serialization as pretty fixed and ignore it. This time, I looked little bit deeper and found this:
Note that this is deep in the call stack. But it is a pretty expensive call, I decided to look at all the calls that we had for this method, which gave me:
A few things that I would like to note here. The time this single method takes is roughly 50% of the total time it takes to serialize the documents. In fact, I checked what would happen if I could remove this cost:
You can insert your favorite profanity here, I usually go with Pasta!
Looking at the stack trace, it is clear what is going on. RavenDB has a bunch of customizations for the default way documents are serialized. And JSON.Net will call the CanConvet() method on the converters on each new object that it is about to covert.
Given the level of complexity in these documents (over 450 values), that means that we would call this a lot. And the cost of finding out if we need to check convert the value or not dominated this kind of operation completely.
I wrote the following converter, instead:
As you can see, the core of the fix is in the CanConvertInternal() method, there I’m checking the internal converters and caching the results. Note that I’m intentionally not using any thread safe operations here, even though this is meant to be a shared instance. The idea is that on each new item, we’ll create a new copy of the cache, and we can afford to waste the memory / work to compute the cache if we can reduce the common cost of verifying the types on each call.
We are actually expecting to lose the race conditions here, which is fine. After a very short while, we are going to be stabilized and not have to pay any additional costs.
Here is the result, you can see that we are taking about half of the previous run time.
And if we look at just the GetMatchingConverter() call:
I actually had to end up with two such calls, because I have some converters that are only for writes. Given the huge performance improvement, we just created two converters to handle the caching.
Note that without a profiler, I would have never been able to figure something like this out, and even with the profiler, I had to dig deep and it took a while to narrow down where the costs actually were. I’m very surprised.
At the beginning of the year, we run into a problematic query. The issue was the use of an in clause vs. a series of OR. You can see the previous investigation results here. We were able to pinpoint the issue pretty well, very deep in the guts of Lucene, our query engine.
|Fast Query||Slow Query|
|Time: 1 – 2 ms||Time: 60 – 90 ms|
The key issue for this query was simple. There are over 600,000 orders with the relevant statuses, but there are no orders for CustomerId “customers/100”. In the OR case, we would evaluate the query lazily. First checking the CustomerId, and given that there have been no results, short circuiting the process and doing no real work for the rest of the query. The IN query, on the other hand, would do things eagerly. That would mean that it would build a data structure that would hold all 600K+ documents that match the query, and then would throw that all away because no one actually needed that.
In order to resolve that, I have to explain a bit about the internals of Lucene. As its core, you can think of Lucene in terms of sorted lists inside dictionaries. I wrote a series of posts on the topic, but the gist of it is:
Note that the ids for documents containing a particular term are sorted. That is important for a lot of optimizations in Lucene, which is also a major problem for the in query. The problem is that each component in the query pipeline needs to maintain this invariant. But when we use an IN query, we need to go over potentially many terms. And then we need to get the results in the proper order to the calling code. I implemented a tiered approach. If we are using an IN clause with a small number of terms in it (under 128), we will use a heap to manage all the terms and effectively do a merge sort on the results.
When we have more than 128 terms, that stops being very useful, however. Instead, we’ll create a bitmap for the possible results and scan through all the terms, filling the bitmap. That can be expensive, of course, so I made sure that this is done lazily by RavenDB.
The results are in:
|OR Query||IN Query|
|Invalid CustomerId||1.39 – 1.5 ms||1.33 – 1.44 ms|
|Valid CustomerId||17.5 ms||12.3 ms|
For the first case, this is now pretty much a wash. The numbers are slightly in favor of the IN query, but it is within the measurement fluctuations.
For the second case, however, there is a huge performance improvement for the IN query. For that matter, the cost is going to be more noticeable the more terms you have in the IN query.
I’m really happy about this optimization, it ended up being quite elegant.
I have wrote an article about the uses of the profiler and the benefits it brings. And here is the 30 seconds video:
You can watch the full demo in the Cosmos DB webinar.
This post is here because we recently had to add this code to RavenDB:
Yes, we added a sleep to RavenDB, and we did it to increase performance.
The story started out with a reported performance regression. On a previous version of RavenDB, the user was able to insert 32,000 documents per second. Same code, same machine, new version of RavenDB, but the performance is 13,000 documents per second.
That is, as we call it internally, and Issue. More specifically issue: RavenDB-14777.
Deeper investigation revealed that the problem was that we are too fast, therefor we are too slow. Adding a sleep fixed the being too fast thing, so we were faster again.
You might need to read the previous paragraph a few times to make sense of it, I’m particularly proud of it. Here is what actually happened. Our bulk insert code is reading from the network and as soon as we have some data, we start parallelizing the write to disk and the read from the network. The idea is that we want to be reduce the user time, so we maximize the amount of work we do. This is a fairly standard optimization for us and has paid many dividends in performance. The way it works, we read from the network until there is nothing available in memory and we have to wait for I/O, at which point we start writing to the disk and wait for the network I/O to resume the operation.
However, the issue is that the versions that the user was trying also included a runtime change. The old version run on .NET Core 2.2 and the new version run on .NET Core 3.1. There has been many optimizations as a result of this change, and it seems that the read from network path has benefited from these.
As a result, we would be able to read the data from the buffer more quickly, which meant that we would end up faster with waiting for network I/O. And that meant that we would do a lot more disk writes because we were better in reading from the network. And that, in turn, slowed down the whole thing enough to be noticeable.
Our change means that we’ll only queue a new disk operation if there has been 5 milliseconds with no new network traffic (or a bunch of other conditions that you don’t really care about). This way, we retain the parallel work and not saturate the disk with small writes.
As I said earlier, we had to pump the brakes to get into real high speed.
The very first product of Hibernating Rhinos was a profiler for NHibernate, to allow you to figure out exactly what is going between your database and application. Now I’m proud to present our latest product: the Cosmos DB Profiler.
If you are using Azure, you are likely familiar with Cosmos DB. Cosmos DB is not a traditional relational database. It is marketed by Microsoft as a multi model database and it is widely known in the world of distributed databases. The first part is important enough to bear repeating. Cosmos DB is not a relational database, even if there is a tendency to treat it as such.
We have gathered everything we know about optimal database usage, mixed in all the experience we run into seeing users bump into issue working with distributed systems and then looked into all the best practices published about successful Cosmos DB applications. After we had all of that, we looked into patterns, things that we can do for you, automatically, that would prevent you from messing up. Thus, the Cosmos DB profiler was born.
Here is how it looks like, profiling an application locally:
As you can see, it give you context to the interaction between your application and the database. It allows you to see exactly what is going on behind the scenes. This is important, since most Cosmos DB applications aren’t trivial, we are usually talking about big applications with a lot of data and moving pieces. It can be hard to understand what is actually going on when you run a particular action.
Furthermore, the profiler is able to give you concrete suggestions that will improve your performance and reduce you cloud bills.
The pricing model for Cosmos DB is based on provisioned capacity, and it is very easy to get into a state where you need to provision a lot more than what you expected to need. The profiler is able to detect such issues, provide you with concrete recommendations on how to fix them and show you the savings, immediately.
I’m doing a webinar on the Cosmos DB profiler on Tuesday and I would love to see you there.
Recently JetBrains announce that dotTrace and dotMemory supports profiling applications on Linux. We have been testing these out for production usage, analyzing production scenarios and have been finding (and fixing) issues with about an order of magnitude less work and hassle. I have been a fan of JetBrains for over 15 years and they have been providing excellent software for a long time. So why this post (and no, this isn’t an ad and we are paying for a subscription to JetBrains for all our devs).
I’m super excited about being able to profile on Linux, because this is where most of our customers are now running. Before, we had to use arcane tooling and try to figure things out. With better tools, we are able to pinpoint issues almost immediately.
Here is the result of a single profiling session:
Here, we create a new List and then called ToArray() on it. All we had to do here was just create the array upfront, without going through resize and additional allocations.
Here we had to do a much move involved operation, remembering values across transactions to reduce load. Note how much time we saved!
And of course, just as important:
We tried to optimize an operation. It looks like it didn’t work out exactly as expected, so we need to take additional look at that.
These are about performance, but even more important here is the ability to track memory and its sources. We were able to identify multiple places where we are using too much memory. In some cases, this is about configuration options. In one such case, limiting the amount of memory that we speculatively hold back resulted in a reduction of memory utilization over time from 3.5 GB to 800 MB. And these are much harder to figure out, because it is very hard to see the full picture.
Another factor that became obvious is that RavenDB’s current infrastructure was written in 2015 / 2016. The lowest pieces of the code were written on what was then called DNX. It was written to be high performance and stable. We didn’t have to touch it much ever since. Profiling memory utilization showed us that we hold a large number of values in the large object heap. This is because we need to have a byte that we can also send to native code, so we need to pin it. That was because we needed to call some methods, such as Stream.Read(byte, int,int() and then process it using raw pointers.
When the code was written, that was the only choice. Allocate it on the large object heap ( to reduce fragmentation) and run with it. Now, however, we have Memory<byte> and Span<byte>, which are much easier to work with for this scenario, so we can discard this specific feature and optimize our memory utilization further.
And here is the state:
And the mechanism there to clean them up will not be triggered normally. So we have this needlessly around. After fixing the cache eviction logic, we get:
Looking at the byte arrays here, we can track this down further, to see that the root cause is the pinning buffer I spoke about earlier. We’re handling that fix in a different branch, so this is pretty good sign that we are done with this kind of benchmark.
The down sides dotTrace and dotMemory, exist. On Linux, dotTrace only supports sampling, not tracing. That sounds like a minor issue, but it means that when I’m running the same benchmark, I’m missing something that I find critical. The number of times a method was run. If a method is taking 50% of the runtime, but it is invoked hundred millions times, I need a very different approach from something that happens three times. The lack on tracing option with number of times the methods were called is a real issue for my usual workflow.
For dotMemory, if you have a highly connected graph (for example, a doubly linked list) and a lot of objects, the amount of resources that it takes to analyze the process is huge. I deployed a cloud machine with 128 GB of RAM and 32 cores to try to analyze an 18 GB process dump. That wasn’t sufficient to give all the details, sadly. It was still able to provide me with enough so I could figure out what the issue was, though.
Another issue is that the visuals are amazing, but when I have a 120 millions links in a node, trying to represent that in the UI is a hopeless cause. I wish that there was a way to dump to text the same information that dotMemory provides. That would allow to analyze ridiculous amount of information in a structured manner.
I can a lot about the performance of RavenDB, a you might have noticed from this blog. A few years ago we had a major architecture shift that increased our performance by a factor of ten, which was awesome. But with the meal, you get appetite, and we are always looking for better performance.
One of the things we did with RavenDB is build things so we’ll have the seams in place to change the internal behavior without users noticing how things are working behind the scenes. We have used this capability a bunch of time to improve the performance of RavenDB. This post if about one such scenario that we recently implemented and will go into RavenDB 5.0.
Consider the following query:
As you can see, we are doing a range based query on a date field. Now, the source collection in this case has just over 5.9 million entries and there are a lot of unique elements in the specified range. Let’s consider how RavenDB will handle this query in version 4.2:
- First, find all the unique CreatedAt values between those ranges (there can be tens to hundreds of thousands).
- Then, for each one of those unique values, find all the match documents (usually, only one).
This is expensive and the problem almost always shows up when doing date range queries over non trivial ranges because that combine the two elements of many unique terms and very few results per term.
The general recommendation was to avoid running the query above and instead use:
This allows RavenDB to use a different method for range query, based on numeric values, not distinct string values. The performance different is huge.
But the second query is ugly and far less readable. I don’t like such a solution, even if it can serve as a temporary workaround. Because of that, we implemented a better system in RavenDB 5.0. Behind the scenes, RavenDB now translate the first query into the second one. You don’t have to do anything to make it happen (when migrating from 4.2 instances, you’ll need to reset the index to get this behavior). You just use dates as you would normally expect them to be used and RavenDB will do the right thing and optimize it for you.
To give you a sense of the different in performance, the query above on a data set of 5.9 million records will have the following performance:
- RavenDB 4.2 - 7,523 ms
- RavenDB 5.0 – 134 ms
As you can imagine, I’m really happy about this kind of performance boost.
Last week we upgraded our internal production servers to run RavenDB under .NET Core 3.1. Here are the CPU metrics over the past two weeks.
It isn’t the only change in the deployed version, but none of the others were about performance .