I have wrote an article about the uses of the profiler and the benefits it brings. And here is the 30 seconds video:
You can watch the full demo in the Cosmos DB webinar.
I have wrote an article about the uses of the profiler and the benefits it brings. And here is the 30 seconds video:
You can watch the full demo in the Cosmos DB webinar.
This post is here because we recently had to add this code to RavenDB:
Yes, we added a sleep to RavenDB, and we did it to increase performance.
The story started out with a reported performance regression. On a previous version of RavenDB, the user was able to insert 32,000 documents per second. Same code, same machine, new version of RavenDB, but the performance is 13,000 documents per second.
That is, as we call it internally, and Issue. More specifically issue: RavenDB-14777.
Deeper investigation revealed that the problem was that we are too fast, therefor we are too slow. Adding a sleep fixed the being too fast thing, so we were faster again.
You might need to read the previous paragraph a few times to make sense of it, I’m particularly proud of it. Here is what actually happened. Our bulk insert code is reading from the network and as soon as we have some data, we start parallelizing the write to disk and the read from the network. The idea is that we want to be reduce the user time, so we maximize the amount of work we do. This is a fairly standard optimization for us and has paid many dividends in performance. The way it works, we read from the network until there is nothing available in memory and we have to wait for I/O, at which point we start writing to the disk and wait for the network I/O to resume the operation.
However, the issue is that the versions that the user was trying also included a runtime change. The old version run on .NET Core 2.2 and the new version run on .NET Core 3.1. There has been many optimizations as a result of this change, and it seems that the read from network path has benefited from these.
As a result, we would be able to read the data from the buffer more quickly, which meant that we would end up faster with waiting for network I/O. And that meant that we would do a lot more disk writes because we were better in reading from the network. And that, in turn, slowed down the whole thing enough to be noticeable.
Our change means that we’ll only queue a new disk operation if there has been 5 milliseconds with no new network traffic (or a bunch of other conditions that you don’t really care about). This way, we retain the parallel work and not saturate the disk with small writes.
As I said earlier, we had to pump the brakes to get into real high speed.
The very first product of Hibernating Rhinos was a profiler for NHibernate, to allow you to figure out exactly what is going between your database and application. Now I’m proud to present our latest product: the Cosmos DB Profiler.
If you are using Azure, you are likely familiar with Cosmos DB. Cosmos DB is not a traditional relational database. It is marketed by Microsoft as a multi model database and it is widely known in the world of distributed databases. The first part is important enough to bear repeating. Cosmos DB is not a relational database, even if there is a tendency to treat it as such.
We have gathered everything we know about optimal database usage, mixed in all the experience we run into seeing users bump into issue working with distributed systems and then looked into all the best practices published about successful Cosmos DB applications. After we had all of that, we looked into patterns, things that we can do for you, automatically, that would prevent you from messing up. Thus, the Cosmos DB profiler was born.
Here is how it looks like, profiling an application locally:
As you can see, it give you context to the interaction between your application and the database. It allows you to see exactly what is going on behind the scenes. This is important, since most Cosmos DB applications aren’t trivial, we are usually talking about big applications with a lot of data and moving pieces. It can be hard to understand what is actually going on when you run a particular action.
Furthermore, the profiler is able to give you concrete suggestions that will improve your performance and reduce you cloud bills.
The pricing model for Cosmos DB is based on provisioned capacity, and it is very easy to get into a state where you need to provision a lot more than what you expected to need. The profiler is able to detect such issues, provide you with concrete recommendations on how to fix them and show you the savings, immediately.
I’m doing a webinar on the Cosmos DB profiler on Tuesday and I would love to see you there.
Recently JetBrains announce that dotTrace and dotMemory supports profiling applications on Linux. We have been testing these out for production usage, analyzing production scenarios and have been finding (and fixing) issues with about an order of magnitude less work and hassle. I have been a fan of JetBrains for over 15 years and they have been providing excellent software for a long time. So why this post (and no, this isn’t an ad and we are paying for a subscription to JetBrains for all our devs).
I’m super excited about being able to profile on Linux, because this is where most of our customers are now running. Before, we had to use arcane tooling and try to figure things out. With better tools, we are able to pinpoint issues almost immediately.
Here is the result of a single profiling session:
Here, we create a new List and then called ToArray() on it. All we had to do here was just create the array upfront, without going through resize and additional allocations.
Here we had to do a much move involved operation, remembering values across transactions to reduce load. Note how much time we saved!
And of course, just as important:
We tried to optimize an operation. It looks like it didn’t work out exactly as expected, so we need to take additional look at that.
These are about performance, but even more important here is the ability to track memory and its sources. We were able to identify multiple places where we are using too much memory. In some cases, this is about configuration options. In one such case, limiting the amount of memory that we speculatively hold back resulted in a reduction of memory utilization over time from 3.5 GB to 800 MB. And these are much harder to figure out, because it is very hard to see the full picture.
Another factor that became obvious is that RavenDB’s current infrastructure was written in 2015 / 2016. The lowest pieces of the code were written on what was then called DNX. It was written to be high performance and stable. We didn’t have to touch it much ever since. Profiling memory utilization showed us that we hold a large number of values in the large object heap. This is because we need to have a byte that we can also send to native code, so we need to pin it. That was because we needed to call some methods, such as Stream.Read(byte, int,int() and then process it using raw pointers.
When the code was written, that was the only choice. Allocate it on the large object heap ( to reduce fragmentation) and run with it. Now, however, we have Memory<byte> and Span<byte>, which are much easier to work with for this scenario, so we can discard this specific feature and optimize our memory utilization further.
And here is the state:
And the mechanism there to clean them up will not be triggered normally. So we have this needlessly around. After fixing the cache eviction logic, we get:
Looking at the byte arrays here, we can track this down further, to see that the root cause is the pinning buffer I spoke about earlier. We’re handling that fix in a different branch, so this is pretty good sign that we are done with this kind of benchmark.
The down sides dotTrace and dotMemory, exist. On Linux, dotTrace only supports sampling, not tracing. That sounds like a minor issue, but it means that when I’m running the same benchmark, I’m missing something that I find critical. The number of times a method was run. If a method is taking 50% of the runtime, but it is invoked hundred millions times, I need a very different approach from something that happens three times. The lack on tracing option with number of times the methods were called is a real issue for my usual workflow.
For dotMemory, if you have a highly connected graph (for example, a doubly linked list) and a lot of objects, the amount of resources that it takes to analyze the process is huge. I deployed a cloud machine with 128 GB of RAM and 32 cores to try to analyze an 18 GB process dump. That wasn’t sufficient to give all the details, sadly. It was still able to provide me with enough so I could figure out what the issue was, though.
Another issue is that the visuals are amazing, but when I have a 120 millions links in a node, trying to represent that in the UI is a hopeless cause. I wish that there was a way to dump to text the same information that dotMemory provides. That would allow to analyze ridiculous amount of information in a structured manner.
I can a lot about the performance of RavenDB, a you might have noticed from this blog. A few years ago we had a major architecture shift that increased our performance by a factor of ten, which was awesome. But with the meal, you get appetite, and we are always looking for better performance.
One of the things we did with RavenDB is build things so we’ll have the seams in place to change the internal behavior without users noticing how things are working behind the scenes. We have used this capability a bunch of time to improve the performance of RavenDB. This post if about one such scenario that we recently implemented and will go into RavenDB 5.0.
Consider the following query:
As you can see, we are doing a range based query on a date field. Now, the source collection in this case has just over 5.9 million entries and there are a lot of unique elements in the specified range. Let’s consider how RavenDB will handle this query in version 4.2:
This is expensive and the problem almost always shows up when doing date range queries over non trivial ranges because that combine the two elements of many unique terms and very few results per term.
The general recommendation was to avoid running the query above and instead use:
This allows RavenDB to use a different method for range query, based on numeric values, not distinct string values. The performance different is huge.
But the second query is ugly and far less readable. I don’t like such a solution, even if it can serve as a temporary workaround. Because of that, we implemented a better system in RavenDB 5.0. Behind the scenes, RavenDB now translate the first query into the second one. You don’t have to do anything to make it happen (when migrating from 4.2 instances, you’ll need to reset the index to get this behavior). You just use dates as you would normally expect them to be used and RavenDB will do the right thing and optimize it for you.
To give you a sense of the different in performance, the query above on a data set of 5.9 million records will have the following performance:
As you can imagine, I’m really happy about this kind of performance boost.
Last week we upgraded our internal production servers to run RavenDB under .NET Core 3.1. Here are the CPU metrics over the past two weeks.
It isn’t the only change in the deployed version, but none of the others were about performance .
The last production postmortem that I blogged about in real time was almost a year ago. This is something that makes me very happy, considering the uptick we see in RavenDB usage. All the efforts we put into making RavenDB more stable, predictable and robust has been paying off. The “downside” of that is that I have less interesting stories to tell, of course, but I’ll live with that.
Today’s story, however, is about the nastiest of problems. An occasional slow down in production that cause RavenDB to halt for about 5 seconds. The killer is that this is something that would only reproduce after several weeks of running, and it isn’t consistent. Once in a while, without any consistency, RavenDB would appear to stop processing requests for a period of a few seconds, and then resume normally. Those kind of bugs are the worst, because it is very hard to narrow down exactly what is going on, even before we get to trying to figure out the root cause.
We quickly ruled out the usual suspects. There was no high CPU, swapping to disk or slow I/O that can explain it. We tested the underlying hardware and it seemed fine as well. The problem would usually be quickly fixed if you restarted RavenDB, but sometimes that wasn’t enough. Restarting the whole server was required to get back to the baseline performance. Note that usually, RavenDB performed just fine, it is just that occasionally it would pause.
This naturally made us suspect that we have some issue with the GC causing pauses, but it didn’t make sense. Our allocation rates weren’t high and we didn’t have that big of a managed heap. In short, pretty much all avenues of investigation looked like they were closed to us.
We took several dumps of the process state and inspected what was going on there. Pretty much all indications pointed to there being an issue with the GC, but we couldn’t figure out why. Then we started to analyze the dump file in more detail, here is everything in the dump that was over 100MB:
The total size of the managed heap was just over 8GB, in a system with 64GB of RAM. So nothing really that interesting. The number of strings was high, I’ll admin, much higher than what we’ll usually find in a RavenDB process, but this database instance was doing heavy indexing, so that was probably the reason for this.
But pay very close attention to the second item from the end. That is about 800 MB (!!) of ThreadLocal<WeakReference>.LinkedSlotVolatile array. And that was suspicious. We looked into this a bit more and discovered that we had this tidbit:
00007f8203682ce0 50062 2002480 System.Threading.ThreadLocal`1[[System.WeakReference, System.Private.CoreLib]]
To start with, that isn’t too bad. We have 2MB or so of ThreadLocal<WeakReference> instances, no big deal. But look at the instance count (which is the second column). We had over 50,000 of those. And that didn’t seem right at all.
We started to investigate how ThreadLocal<T> works, and we found that it is really interesting. Here is the in memory structure of a ThreadLocal<T>. The internal structure is quite interesting:
Each ThreadLocal<T> instance has an id, which is generated sequentially. For each thread, there is a static thread local array that is allocated to store the values for this thread. The id of the ThreadLocal instance is used to index into this array. The array is for the local thread, but all the values across all threads for a particular ThreadLocal are held together as a doubly linked list.
Note that ThreadLocal has a trackAllValues constructor parameter that does not affect this behavior at all. It simply control whatever you are allowed to call the Values property, not whatever the thread local instance will track all the values.
Due to reasons that I’ll get to later, we created a lot of ThreadLocal instances. That means that we had instance ids in the high tens of thousands. When allocating the thread static array, the ThreadLocal will allocate an array that can hold its id (to the next power of two). So if we have a ThreadLocal with id of 50,062, it will allocate an array with 65,536 elements. That would explain the amount of memory that we saw in the memory dump and is interesting all on its own.
It did not explain the problem with the GC. At least, not yet. As we looked further into this issue, we noticed that this problem only occurred on very large database instances. Ones that had dozens of databases and many indexes. One of the ways that RavenDB ensure isolation of components is to have them each run in a different thread. In those machines,we have had processes that run with thousands of threads, usually in the range of 3000 to 6000.
Combine what we know about ThreadLocal and the number of threads, and you might start to see the problem. Not all ThreadLocals are used in all threads, but when they do, we need to allocate an array that is 65,536 elements in each of the threads. That translate to a total size that is measured in hundreds of millions.
That explains the size, again, but what about the GC speed? I wrote a small isolated test to see what this looks like and I was able to reproduce this on its own. That was really interesting, but I didn’t think that the issue was with ThreadLocal directly. Rather, the problem was with the lattice like structure that we have here. Because of this, I decided to check what it would cost for the GC to run on such a system without dealing with intermediaries.
Here is what this looks like:
On my machine, this code results in GC taking over 200ms each time on a heap that is less than 0.5 GB in size. Given how the GC works, it makes sense. And that means that the accidental lattice structure that we create using ThreadLocal is at the root of our troubles. The question is why do we have so many of them.
Internally, inside Lucene, there is a ThreadLocal<WeakReference> that is being used when you use a particular feature. This is used once per segment, so it isn’t too bad. However, consider what happens over time in a process that have thousands of indexes and is constantly busy?
Each indexing run will create a segment, and each one of them will have a ThreadLocal instance. At the same time, we also have a lot of threads, which create this exact scenario. The problem slowly accumulate over time. As you have more and more indexing runs, you’ll have more and more such instances and you’ll get to bigger and bigger arrays on each thread. This explains why we are able to see the issue only on instances that have been running for weeks, and then, only on those instances that run a particular set of queries that make use of this feature.
We reported the issue to the .NET team and I’m very curious about what the end result will be here. On our end, we are going to have to revamp how we are handling this type of situation. We have a plan of action already and we’ll see over the next week or so how it plays out in production load.
One of our developers recently got a new machine, and we were excited to see what kind of performance we can get out of it. It is an AMD Ryzen 9, 12 cores @ 3.79 Ghz with 32 GB of RAM. The disk used was Samsung SSD 970 EVO Plus 500 GB.
This isn’t an official benchmark, to be fair. This is us testing on how fast the machine is. As such, this is a plain vanilla Windows 10 machine, with no effort to perform any optimizations. Our typical benchmark involves loading all of stack overflow into RavenDB, so we’ll have enough data to work with. Here is what things looked like midway through:
As you can see, the write speed we are able to get is impressive.
We were able to insert all of stack overflow, a bit over 52GB in 3 and a half minutes, at a rate of about 300 MB / sec sustained.
Then we tested indexing.
Remember, these numbers are for indexing everything for the first time. It is worth noting that RavenDB dedicates a single thread per index, to avoid hammering the system with too much work. That means that this indexes were building concurrently with one another.
Here is the system utilization while this was going on:
Finally, we tested some other key scenarios (caching disabled in all of them):
These two are really interesting numbers. The first one, we generate queries to specific documents over an over (with no caching). That means that RavenDB is able to answer them from memory directly. The idea is to simulate a common scenario of a working set that can fit entirely in memory.
The second one is different. The data size on disk is 52 GB and we have 32 GB available for us. We generate random queries here, for different documents each time. We ensure that the queries cannot be served directly from memory and that RavenDB will have to hit the disk. As you can see, even under this scenario, we are doing fairly well. As an aside, it helps that the disk is good. We tried running this on HDD once. The results were… not nice.
The final test we did was for writes, writing a small document to RavenDB. We got 118,000 writes/sec on a sustained basis, with about 32MB / sec in data throughput. Note that we can do more, but playing with the system configuration, but we are already at high enough rate that it probably wouldn’t matter.
All in all, that is a pretty nice machine.
In the previous post, I wrote about how I changed the structure of the hash leaf page to increase data density. I managed to get it down to 32MB range when I’m using random keys. That is a pretty great number, for memory usage, but what is the cost in terms of performance?
Well, let’s figure it out, shall we?
I added some tracing code and got the first result:
3.124000 us/op with 32.007813 MB
That is not to shabby, right? Let’s see where we are spending most of our time, shall we? I opened the profiler and got:
Okay, that is a good point, isn’t it? Changing to release mode gives us:
1.471000 us/op with 32.007813 MB
that is much nicer, but still, profiler please…
As a side note, it actually takes less time to run the profiler than for it to analyze its output. I was looking at this for a while.
The result was… stunning:
What is this thing? And why did it take almost 50% of my runtime?
As it turns out, I was compiling for x86, and I’m using a lot of shifts on 64 bits numbers. This _allshl seems to be part of the x86 runtime. That means that what I expected to be a cheap instruction on a register was actually a method call.
That is interesting, but easy to fix. When running in Release/x64, we get the following results:
0.723 us/op with 32.007813 MB
Okay, so we are under a microsecond per op, and very reasonable memory, good to go, right?
Well, remember that I did absolutely zero optimizations so far? What does the profiler tell us now? Here is an interesting hotspot:
That is reasonable, we are benching this method, after all. But inside that method, we see:
This is the part where we scan an existing piece to see if the value is inside it or not. This tell us if we need to add a new value or update an existing one. It make sense this will be hot, we have to do it on each put to the data related to the piece where we want to put the new key.
There are a few ways to deal with this, we can try to move from the simple varint mode to a more complex (and performant) system. StreamVByte would probably be a good solution, in term of raw performance. But it is meant for 32 bits numbers and doesn’t play nice with being able to remove and add values from the stream easily.
I could also try to play games, instead of calling this function twice, call it once and pass both k and v. However, that is almost assuredly a false play. The varint method is small enough that it doesn’t really matter, the compiler can inline it and play its own optimizations. Also, I tried it and there was no noticeable performance change, so that’s down.
Another way to deal with it is to reduce the number of times we call this function. And here is where things get interesting. Why is this called so much? Because during the put process, we find a page to put a value, then in that page, we find a piece (a 64 byte range) that we will put the key and value in. When we get to the piece, we need to check the already existing data if the key is there or not. So far, so good, but there is another factor to consider, overflows.
A piece may overflow and spill into consecutive pieces. After all, that is what allowed us to reduce the memory usage from 147MB to just 32MB in the random integers scenario. However, that also means that we may need to scan much larger piece of the page. That explains why we are seeing so much usage of the decoding function.
Let’s look at the previous behavior, where we have no overflow at all?
0.551000 us/op with 147.320313 MB
That is a much cheaper cost, but much higher memory. It looks like the typical compute vs. memory cycle, but let’s look at the actual costs?
You’ll notice that we spend most of our time on increasing the hash table size, allocating and moving memory, etc. So even though we are faster, that isn’t a good option for us.
One thing to note, we are looking for the same key, and decoding all the data to find it. But we don’t actually need to do that, we already have the key, and encoded it to its varint form. We can do a search on the raw encoded data to find it. It won’t be good enough for the positive case (we may have a value that was encoded to the same form), but it should help for the common case of inserting a new value. If we find something with memmem(), we still need to decode the data itself and see if the pattern we found is a key or a value, but that should help.
I tested it using GCC’s implementation, and the performance dropped by almost 50%, it took 1.3 us/op! Maybe if I was using a SIMD optimized implementation, that would help, but given the kind of data we are looking for, it didn’t pan out.
Another option is to reduce the number of times we’ll try to overflow a value. Right now, if we can’t put a value in its proper place, we’ll try putting it in any of the other locations. That means that we may probe as many as 127 pieces. It also means that during put, we have to scan overflow chains. As we saw in the previous post, that can add up to scanning up to 1.8 KB of data for a single put. What happens if we limit the overflow amount?
Let’s see if we limit the overflow to 32 probes. Now it only takes 0.403 us/op, which is a huge improvement. But what about the memory size? It’s easier to look things up as a table:
Max chain Overall Time (sec) us/op Size (MB) 1 0.545000 0.545000 147.320313 2 0.359000 0.359000 75.156250 4 0.372000 0.372000 55.523438 8 0.322000 0.322000 36.882813 16 0.336000 0.336000 32.226563 32 0.448000 0.448000 32.007813 64 0.596000 0.596000 32.007813 128 0.770000 0.770000 32.007813
These numbers are interesting, but let’s look at them as a graph, shall we?
We can see that the size drops sharply as the performance is best between 8 and 16 probe attempts, and all we are left choosing is the memory cost.
If we go with 8 probe attempts, we’ll pay with additional 4.875 MB, but with 16 probe attempts, we’ll use just 224KB more with a cost of 0.044 us/op more than the optimal value.
We could go to 32, of course, which gives us optimal size, with about 60% of the cost of doing the full scan. However, by paying just 224KB more, we get down to 43% of the initial cost. And that certainly seems like it is worth it.
You can find the full source code (a little bit cleaned up) here.
We got a few requests for some guidance on how to optimize RavenDB insert rate. Our current benchmark is standing at 135,000 inserts/sec on a sustained basis, on a machine that cost less than a 1,000$. However, some users tried to write their own benchmarks and got far less (about 50,000 writes / sec). Therefor, this post, in which I’m going to do a bunch of things and see if I can make RavenDB write really fast.
I’m sorry, this is likely to be a long post. I’m going to be writing this as I’m building the benchmark and testing things out. So you’ll get a stream of consciousness. Hopefully it will make sense.
Because of the size of this post, I decided to move most of the code snippets out. I created a repository just for this post, and I’m showing my steps as I go along.
Rules for this post:
The first thing to do is to figure out what we are going to write.
The test machine is: t3a.xlarge with 4 cores, 16 GB RAM. This seemed like a fairly reasonable machine to test with. I’m using Ubuntu 18.04 LTS as the operating system.
The machine has an 8GB drive that I’m using to host RavenDB and a separate volume for the data itself. I create a 512GB gp2 volume (with 1536 IOPS) to start with. Here what this looked like from inside the machine:
I’m including the setup script here for completeness, as you can see, there isn’t really anything here that matters.
Do note that I’m going the quick & dirty mode here without security, this is mostly so I can see what the impact of TLS on the benchmark is at a later point.
We are now pretty much ready, I think. So let’s take a look at the first version I tried. Writing 100,000 random user documents like the following:
As you can see, that isn’t too big and shouldn’t really be too hard on RavenDB. Unfortunately, I discovered a problem, the write speed was horrible.
Oh wait, the problem exists between keyboard and chair, I was running that from my laptop, so we actually had to go about 10,000 KM from client to server. That… is not a good thing.
Writing the data took almost 12 minutes. But at least this is easy to fix. I setup a couple of client machines on the same AZ and tried again. I’m using spot instances, so I got a t3.large instance and a m5d.large instance.
That gave me a much nicer number, although still far from what I wanted to have.
On the cloud machines, this takes about 23 - 25 seconds. Better than 12 minutes, but nothing to write home about.
One of the reasons that I wanted to write this blog post is specifically to go through this process, because there are a lot of things that matter, and it sometimes can be hard to figure out what does.
Let’s make a small change in my code, like so:
What this does is to remove the call to RavenDB entirely. The only cost we have here is the cost of generating the from the Bogus library. This time, the code completes in 13 seconds.
But wait, there are no RavenDB calls here, why does it take so long? Well, as it turns out, the fake data generation library has a non trivial cost to it, which impact the whole test. I changed things so that we’ll generate 10,000 users and then use bulk insert to send them over and over again. That means that the time that we measure is just the cost of sending the data over. With these changes, I got much nicer numbers:
While this is going on, by the way, we have an interesting observation about the node while I’m doing this.
You can see that while we have two machines trying to push data in as fast as them can, we have a lot of spare capacity. This is key, actually. The issue is what the bottleneck, and we already saw that the problem is probably on the client. We improved our performance by over 300% by simply reducing the cost of generating the data, not writing to RavenDB. As it turns out, we are also leaving a lot of performance on the table because we are doing this single threaded. A lot of the time is actually spent on the client side, doing serialization, etc.
I changed the client code to use multiple threads and tried it again. By the way, you might notice that the client code is… brute forced, in a way. I intentionally did everything in the most obvious way possible, caring non at all about the structure of the code. I just want it to work, so no error handling, nothing sophisticated at all here.
This is with both client machines setup to use 4 threads each to send the data. It’s time to dig a bit deeper and see what is actually going on here. The t3.large machine has 2 cores, and looking into what it is doing while it has 4 threads sending data is… instructive…
The m5d.large instance also have two cores, and is in a similar state:
Leaving aside exactly what is going on here (I’ll discuss this in more depth later in this post), it is fairly obvious that the issue here is on the client side, we are completely saturating the machine’s capabilities.
I created another machine to serve as a client, this time a c5.9xlarge, an instance that has 36 cores and is running a much faster CPU that the previous instances. This time, I a single machine and I used just a single thread, and I got the following results:
And at the same time, the server resources utilization was:
Note that this is when we have a single thread doing the work… what happens when we increase the load?
Given the disparity between the client (36 cores) and the server (just 4), I decided to start slow and told the client to use just 12 threads to bulk insert the data. The result:
Now we are talking, but what about the server’s resources?
We got ourselves some spare capacity to throw around, it seems.
At this point, I decided to go all in and see what happens when I’m using all 36 cores for this. As it runs out, we can get faster, which is great, but the rise isn’t linear, unfortunately.
At this point, I mostly hit the limits. Regardless of how much load I put on the client, it wasn’t able to hit any higher than this. I decided to look at what the server is doing. Write speed for RavenDB is almost absolutely determined by the ACID nature of the database, we have to wait for the disk to confirm the write. Because this is such an important factor of our performance, we surface all of that information to you. In the database’s stats page, you can go into the IO Stats section, like so:
The first glace might be a bit confusing, I’ll admit. We tried to pack a lot of data into a single view.
The colors are important. Blue are writes to the journal, which are the thing that would usually hold up the transaction commit. The green (data write / flush) and red (sync) are types of disk operations, and they are shown here to allow you to see if there are any correlation. For example, if you have a big sync operation, it may suck all the I/O bandwidth, and your journal writes will be slow. With this view, you can clearly correlate that information. The brighter the color, the bigger the write, the wider the write, the more time it took. I hope that this is enough to understand the gist of it.
Now, let’s zoom in. Here you can see a single write, for 124KB, that took 200ms.
Here is another one:
These are problematic for us, because we are stalling. We can’t really do a lot while we are waiting for the disk (actually, we can, we start processing the next tx, but there is a limit to that as well). That is likely causing us to wait when we read from the network and in likely the culprit. You might have noticed that both slow writes happened in conjunction with the sync (the red square below), that indicate that we might have latency because both operations go to the same location at the same time.
On the other hand, here is another section, where we have two writes very near one another and they both very slow, without a concurrent sync. So the interference from the sync is a theory, not a proven fact.
We can go and change the gp2 drive we have to an io1 drive with provisioned IOPS (1536, same as the gp2). That would cost me 3 times as much, so let’s see if we can avoid this. Journals aren’t meant to be forever. They are used to maintain durability of the data, once we synced the data to disk, we can discard them.
I created an 8 GB io2 drive with 400 IOPS and attached it to the server instance and then set it up:
Here is what this ended up as:
Now, I’m going to setup the journals’ directory for this database to point to the new drive, like so:
And now we have a better separation of the journals and the data, let’s see what this will give us? Not much, it seems, I’m seeing roughly the same performance as before, and the IO stats tells the same story.
Okay, time to see what we can do when we change instance types. As a reminder, so far, my server instance was t3a.xlarge (4 cores, 16 GB). I launched a r5d.large instance (2 cores, 16 GB) and set it up with the same configuration as before.
Here is what I got when I started hammering the machine:
This is interesting, because you can see a few discrepancies:
That seems pretty promising, so I decided to switch instances again. This time to i3en.xlarge instance (4 cores, 30GB, 2 TB NVMe drive). To be honest, I’m mostly interested in the NVMe drive .
Here are the results:
As you can see, we are running pretty smoothly with 90K – 100K writes per second sustained.
On the same i3en.xlarge system, I attached the two volumes (512GB gp2 and 8GB io2) with the same setup (journals on the io2 volume), and I’m getting some really nice numbers as well:
And now, the hour is nearing 4AM, and while I had a lot of fun, I think this is the time to close this post. The factor in write performance for RavenDB is the disk, but we care a lot more about latency than throughput for these kind of operations.
A single t3a.xlarge machine was able to hit peak at 77K writes/second and by changing the instance type and getting better IO, we were able to push that to 100K writes/sec. Our current benchmark is sitting at 138,000 writes/second, by the way, but it isn’t running on virtual machine but on physical hardware. Probably the most important part of that machine is the fact that is has an NVMe drive (latency, again).
However, there is one question that still remains. Why did we have to spend so much compute power on generating the bulk insert operations? We had to hit the server from multiple machines or use 36 concurrent threads just to be able to push enough data so the server will sweat it.
To answer this, I’m going to do the Right Thing and look at the profiler results. The problem is in the client side, so let’s profile the client and see what is taking so much computation horse power. Here are the results:
The cost here is serialization is the major factor here. That is why we need to parallelize the work, otherwise, as we saw, RavenDB is basically going to sit idle.
The reason for this "issue" is that JSON.Net is a powerful library with many features, but it does have a cost. For bulk insert scenarios, you typically have a very well defined set of documents, and you don't need all this power. For this reason, RavenDB exposes an API that allow you to fully control how serialization works for bulk insert:
You can use that to significantly speed up your insert processes.
No future posts left, oh my!