Ayende @ Rahien

Oren Eini aka Ayende Rahien CEO of Hibernating Rhinos LTD, which develops RavenDB, a NoSQL Open Source Document Database.

Get in touch with me:

oren@ravendb.net

+972 52-548-6969

Posts: 7,162 | Comments: 50,148

Privacy Policy Terms
filter by tags archive
time to read 4 min | 660 words

image (2)We care a lot about the performance of RavenDB.

Aside from putting a lot of time an effort into ensuring that RavenDB uses optimal code, we also have a monitoring system in place to alert us if we can observe a performance degradation. Usually those are fairly subtle issues, but we got an alert on the following scenario. As you can see, we are seeing a big degradation of this test.

The actual test in question is doing a low level manipulation of Voron (RavenDB’s storage engine), and as such, stand at the core of our performance hotspots to watch for.

Looking at the commits around that time frame, we quickly narrow the fault down to the following changes:

image

A really important observation here, however, is that this method is not called in the test. So we were looking at whatever this change caused a regression in code generation. I couldn’t believe that this is the case, to be honest.

Indeed, looking at the generated assembly, there was no difference between the two versions. But something cause the performance to degrade significantly enough for this test that it raised all sorts of alarm bells.

We started looking into a lot of details about the system, the usual things like checking for thermal throttling, etc.

We struck gold on this command: sudo smartctl --all /dev/nvme0n1

Take a look at the results:

    SMART overall-health self-assessment test result: FAILED!
    - NVM subsystem reliability has been degraded
    SMART/Health Information (NVMe Log 0x02, NSID 0x1)
    Critical Warning:                   0x04
    Temperature:                        35 Celsius
    Available Spare:                    100%
    Available Spare Threshold:          10%
    Percentage Used:                    115%
    Data Units Read:                    462,613,897 [236 TB]
    Data Units Written:                 2,100,668,468 [1.07 PB]
    Host Read Commands:                 10,355,495,581
    Host Write Commands:                9,695,954,131
    Controller Busy Time:               70,777

In other words, the disk is literally crying at us. This tells us that the drive has been in action for ~50 days of actual activity and that it has gone beyond is design specs.

In particular, you can see that we wrote over a petabyte of data to the disk as part of our test case executions. This is a 500GB drive, which means that we fill it to capacity over 2,000 times before we hit this issue.

Once we hit this capacity (Percentage Used is > 100%), the drive needs to do a lot more work, so we are seeing longer test times.

First time that I closed a bug by sending a PO to get new hardware, I got to admit.

time to read 4 min | 663 words

This is a tale of two options that we took for an exhaustive test. Amazon recently came out with a new disk type on the cloud. As a database vendor, that is of immediate interest to me, so we took a deep look into that.

GP3 disks are about 20% cheaper than their GP2 equivalent. What is more, they come with a guarantee level of performance even before you purchase additional IOPS. Consider the following two disks:

  Size IOPS MB/S Price
GP2 512GB 1,536 250 51.2 USD
GP3 512GB 3,000 125 40.9 USD
GP3 512GB 4,075 250 51.2 USD

In other words, for the same disk, we can get a much better baseline performance at a cheaper price. What isn’t there not to like?

The major difference between GP2 and GP3, however, is their latency. In practice, we see an additional 1 – 2 milliseconds in response times from the GP3 disk vs. the GP2 disk. In other words, GP3 disks are somewhat slower, even if they are able to run more IOPS, their latency is higher.

A really key observation from us, however, is that GP3 does not offer burst I/O capabilities. And that means that I can breath a huge sigh of relief.

RavenDB as a database is meant to run on anything from an SD card to HDD to SSD to NVMe drives. We are used to account for the I/O being the slowest thing around and have already mostly coded around that. An additional millisecond in disk latency doesn’t matter that much in the grand scheme of things.

However… the fact that this doesn’t provide I/O burst is a huge plus for us. RavenDB can easily deal with slow I/O, what it find it very hard to deal with is an environment that very rapidly change its operational characteristics.

Let’s assume that we have a 100 GB GP2 disk, which means that we have a baseline of around 300 IOPS and 75MB / sec of throughput. RavenDB is under some high load, and it is using the maximum capabilities of the hardware. However, because of burstiness, we are actually able to utilize 3,000 IOPS and 250MB/sec for a while.

Until all the I/O credits are gone and we are forced into a screeching halt. That means, for example, that we read from the network at a rate of 250MB/sec, but we are unable to write to the disk at this level. There is a negative balance of 125MB/sec that needs to be stored some where. We can buffer that in memory, of course, but that only work for so long. That means that we have to put a huge break all of a sudden, which the rest of the eco system isn’t happy with. For example, the other side that is sending us data at 250MB /sec, they are likely not going to be able to respond in time to the shift is our behavior. It is very likely that the network connection would congest and break in this case.

All of the internal optimizations inside of RavenDB will also be skewed for a while, until we are used to the new level of speed. If this was gradual, we could adjust a lot more easily, but this is basically like hitting the brakes at speed. You will slow down, sure, but you are also likely to cause an accident.

As a simple example, RavenDB can compress the data that it writes to disk, and it balances the compression ratio vs. the cost to write to the disk. If we know that the disk is slow, we can spend more time trying to reduce the amount of data we write. If this changes rapidly, we are operating under the old assumptions and may create a true traffic jam

The fact that GP3 disks have a predictable performance profile means that we are much better suited to run on them. A more predictable platform from which to operate gives me a much better opportunity to handle optimizations.

time to read 2 min | 220 words

imageWe just published a white paper on RavenDB performance vs. Couchbase performance in a real customer scenario.

I had to check the results three times before I believed them. RavenDB is pretty awesome, but I had no idea it was that awesome.

The data set was reasonably big, 1.35 billion docs and the scenario we present is a real world one based on production load.

Some of the interesting details:

  • RavenDB uses 1/3 of the disk space that Couchbase uses, but stores 3 times as much data.
  • Operationally, RavenDB just worked, Couchbase needed 6 times the hardware to just scrape by. A single failure in Couchbase meant at least 15 – 45 minutes for the node to recover. Inducing failures in RavenDB brought the node back up in a few seconds.
  • For queries, we pitted a Couchbase cluster with 96 cores and 384 GB RAM against single RavenDB node running on a Raspberry PI. RavenDB on the Pi was able to sustain better latencies at the 99 percentile handling twice as much load as Couchbase is able.

There are all sort of other goodies in the white paper and we went pretty deep into the overall architecture and impact of the difference design decisions.

As usual, we welcome your feedback.

time to read 3 min | 472 words

One of the “fun” aspects of running in the cloud is the fact that certain assumptions that you take for granted are broken, sometimes seriously so. Today post is about an issue a customer run into in the cloud. They were seeing some cases of high latency of operations from RavenDB. In the cloud, the usual answer is to provision more resources, but we generally recommend that only when we can show that the load is much higher than expected to be handled on the hardware.

The customer was running on a cluster with disk that were provisioned with 1,000 IOPS and 120 MB /sec, that isn’t a huge amount, but it is certainly respectable. Looking at the load, we can see fairly constant writes and the number of indexes is around 30. Looking at the disk, we can see that we are stalling there, the queue length is very high and the disk latency has a user visible impact.

All told, we would expect to see a significant amount of I/O operations as a result of that, but the fact that we hit the limits of the provisioned IOPS was worth a second look. We started pulling at the details and it became clear that there was something that we could do about it. During indexing, we create some temporary files to store the Lucene segments before we commit them to the index. Each indexing run can create between four and six such files. When we create them, we do that with the flag DeleteOnClose, this is a flag that exists on Windows, but not on Linux. On Linux, we are running on ext4 with journaling enabled, which means that each file system metadata modification requires a journal write at the file system level. Those temporary files live for a very short amount of time, however. We delete them on close, after all, and the indexing run is very short.

6 files per index times 30 indexes means 180 files. Each one of those will be created and destroyed (generating a journal event each time) and there is a constant low volume of writes. That means that there are 360 IOPS at the file system level just because of this issue.

The fix for that was two folds. First, for small files, under 128KB, we never hit the disk. We can keep them completely in memory. For larger file, we want to avoid using too much memory, so we spill them to disk, but instead of creating new files each time, we’ll reuse them between indexing run.

The end result is that we are issuing fewer I/O operations, reducing the amount of trivial IOPS we consume and can get a lot more done with the same hardware. The actual fix is fairly small and targeted, but the impact is felt across the entire system.

time to read 4 min | 666 words

I recently got my hands on a the Raspberry PI 400 (the one that comes in a keyboard form). That is an amazing idea and it make the Raspberry a lot more approachable for consumer cases.

At any rate, one of my first actions was to put RavenDB on it and see how well it performs. You can see the results in the image below.

image

In this case, we are running 1,500 queries per second on the system. It has 4 GB of RAM and the database we are using has 450 GB (!) worth of data. I actually just took the nearest external disk I had available and plugged that into the PI. This is a generic hard disk and I can get a maximum of about 30 MB / sec from it.

This is important because my queries are covering more data than can fit in memory. Each query asks for a random (different) document, so there is little chance for optimizations by having a hot working set. We are going to see some I/O to the (pretty poor) disk impacting the outcome. Here are the results:

image

You can see that the for 95% of the queries, we got a result in under 125 milliseconds and that for 99% of the requests, RavenDB on a Raspberry PI is able to answer in about half a second.  And even with some of the requests having to hit the disk, the maximum number of time to wait for a request is just above a second. All of that when we are facing 1,500 queries per second, which is respectable even for big applications running on much more massive hardware.

Of particular interest to me is the state of the server when we are running this benchmark. You can see that both in terms of CPU utilization and in the number of queries processed, we are nearly absolutely flat. There aren’t any hiccups in the load, there haven’t been a GC pause that stopped the world and the system just runs at top speed for as long as we’ll let it. In this case, the benchmark lasted over 5 minutes, so more than enough time to run through all the usual suspects.

Note also the number of documents involved here. We are looking at 882 million documents. And we are requesting close to half a million of them. I run the benchmark long enough to ensure that we will cover more documents than can be fit into memory, so we are seeing I/O work here (on a fairly poor disk, I might add, but that is what I had available at the moment).

The actual size of disk is a bit of a cheat, I’m using documents compression here to pack the data more tightly. The actual data size, without using RavenDB data compression is around 750GB. That also helps a lot with the amount of I/O we have to deal with, but it increase the CPU consumption. Given the difference in relative costs, that is a task that is paying dividends in spades.

I also decided to see what we can look at when we are running a query that touches just a small part of the documents. Instead of working through nearly half a million, I chose to run it on about 100,000 documents. That is small enough that it should mostly all fit in memory. It also represent a far more likely scenario, to be frank.

image

And here we can see that we get all requests, under 1,500 queries per second on a Raspberry PI in under 150 ms, with the 99.999% (!!) percentile in about 50 milliseconds.

And that makes me very happy, because it shows the result of all the work we put into optimizing RavenDB.

time to read 3 min | 420 words

We have been working on a big benchmark of RavenDB recently. The data size that we are working on is beyond the TB range and we are dealing with over a billion documents. Working with such data sizes can be frustrating, because it takes quite a bit of time for certain things to complete. Since I had downtime while I was waiting for the data to load, I reached to a new toy I just got, a Raspberry PI 400. Basically, a Raspberry Pi 4 that is embedded inside a keyboard. It is a pretty cool machine and an awesome thing to play around with:

image

Naturally, I had to try it out with RavenDB. We have had support on running on ARM devices for a long while now, and we have dome some performance work on the Raspberry PI 3. There are actually a whole bunch of customers that are using RavenDB in production on the Pi. These range from embedding RavenDB in industrial robots, using RavenDB to store traffic analysis data on vehicles and deploying Raspberry PI servers to manage fleets of IoT sensors in remote locations.

The Raspberry PI 4 is supposedly much more powerful and I got the model with 4GB of RAM to play around with. And since I already had a data set that hit the TB ranges lying around, I decided to see what we could do with both of those.

I scrounged an external hard disk that we had lying around that had sufficient capacity and started the import process. This is where we are after a few minutes:

image

A couple of things to notice about this. At this point the import process is running for about two and half minutes and imported about 4 million documents. I want to emphasize that this is running on an HDD (and a fairly old one at that). Currently I can feel its vibrations on the table, so we are definitely I/O limited there.

Once I’ll be done with the data load (which I expect to take a couple of days), we’ll be testing this with queries. Should be quite fun to compare the costs of this to a cloud instance. Given typical cloud machines, we can probably cover the costs of the PI in a few daysSmile.

time to read 3 min | 563 words

During benchmark run on a large dataset, I started to notice that longer benchmarks were showing decidedly worse numbers than short ones. In other words, a benchmark that is run for 1 minute is orders of magnitude higher latencies than a benchmark that is run for 30 seconds. And the longer the benchmark, the worst things off.

That raised a lot of red flags, and spawn a pretty serious investigation. We take performance very seriously and the observed behavior was that we were getting slower over time. We suspected a leak, high number of GC calls, or memory fragmentation. The scenario under test was a web application using the RavenDB API to talk to RavenDB. We run both the web application and the server under profilers and found a few hot spots, but nothing really major. There was no smoking gun.

Then we noticed that the load testing  machine was sitting there with 100% CPU. I initially thought that this is us generating too much load for the machine, but that wasn’t it. We are using wrk2, which is capable of handling million of requests per seconds.

We were generating the requests dynamically using a Lua script, and in one of the scenarios under test, we have code like this:

path = "/orders/user/" .. page * pageSize .. "/" .. pageSize .. "/?userId=" .. item.id .. "&deep=y"

That isn’t the most optimal way to do things, I’ll admit. We can do better by using something like table.concat(), but the problem was that regardless of how you build the string, this is supposed to be fairly cheap. The wrk2 project is using LuaJIT, which has a reputation as a really scripting system. I never really thought that this would be a problem. Sure, it is a little wasteful, but it isn’t too bad, a few string temporaries and maybe some realloc() calls, but nothing major.

Instead, this resulted in us getting far worse results over time. It took a while to actually figure out why, but the root cause is in the way LuaJIT handles string hashing.

a = lj_getu32(str);
h ^= lj_getu32(str+len-4); b = lj_getu32(str+(len>>1)-2); h ^= b; h -= lj_rol(b, 14); b += lj_getu32(str+(len>>2)-1);

Strings in Lua are interned, which means that there is just a single copy of a string per value. That means that hashing is important, but the way it does hashing is to take the first 4 bytes, the last 4 bytes and the 4 bytes in the middle and use that for a hash. And that is it.

If you have a bunch of strings where those 3 locations match… well, welcome to hash collisions. At which point, what is supposed to be a O(1) call becomes an O(N) call. And creating strings will turn the operations into an O(N^2) operation!

Here is the reproduction code:

Change the prefix to be an empty string for a major performance boost. The actual bug is well known (5 or 6 years), but it was only recently fixed and not on the version that wrk2 is using.

We had to toss out the entire benchmarking set and start over because of this.

We were generating requests with random data, so some of them would hit this problem hard, and some would avoid it by magic. I was not expecting to debug hash collision in Lua code while trying to get some performance numbers from overloading RavenDB, quite random, literally.

time to read 4 min | 742 words

This is part of the same issue as the previous post. I was tracking a performance regression between RavenDB 4.1 and RavenDB 4.2, there was a roughly 8% performance difference between the two (favoring the older version) which was reported to us. The scenario was very large and complex documents (hundreds of objects in a document, > 16KB of JSON each).

The two code bases have had a lot of (small) changes between them, so it was hard to figure out exactly what was the root cause for the regression. Eventually I found something utterly bizarre. One of the things that we have to do when you save a document is check if the document has been modified. If so, we need to save it, otherwise, we can skip it. Here is the relevant piece of code in 4.1:

image

So this costs 0.5 ms (for very large documents), seems perfectly reasonable. But when looking at this on 4.2, we have:

image

This cost six times as much, what the hell?! To clarify, Blittable is the name of the document format that RavenDB uses. It is a binary JSON format that is highly efficient. You can think about this as comparing two JSON documents, because this is what it is doing.

I mentioned that there are differences between these versions? There have been quite a few  (thousands of commits worth), but this particular bit of code hadn’t changed in years. I just couldn’t figure out what was going on. Then I looked deeper. Here are the cost of these calls. Here is the 4.1 version:

image

And here is the 4.2 version:

image

There are a few interesting things here. First, we can see that we are using Enumerable.Contains and that is where most of the time goes. But much more interesting, in 4.1, we are calling this method a total of 30,000 times. In 4.2, we are calling it 150,000 times!!! Note that CompareBlittable is recursive, so even though we call it on 10,000 documents, we get more calls. But why the difference between these version?

I compared the code for these two functions, and they were virtually identical. In 4.2, we mostly change some error message texts, nothing major, but somehow the performance was so different. It took a while to figure out that there was another difference. In 4.1, we checked the changes in the documents in the order of the properties on the document, but on 4.2, we optimized things slightly and just used the index of the property. A feature of the blittable format is that properties are lexically sorted.

Here is the document in question, in our test, we are modifying Property6, as you can see here:

image

There are a total of 40 properties in this document. And much nesting. In this case, in 4.2, we are scanning for changes in the document using the lexical sorting, which means:

image

The CompareBlittable() function will exit on the first change it detect, and in 4.1, it will get to the changed Property6 very quickly. On 4.2, it will need to scan most of the (big) document before it find a change. That is a large part of the cost difference between these versions.

Now that I know what the issue is, we have to consider whatever behavior is better for us. I decided to use the order of inserted properties, instead of the lexical order. The reasoning is simple. If a user care about that, they can much more easily change the order of properties in the document than the names of the properties. In C#, you can just change the order the properties shows up in the class definition.

I have to say, this was much harder to figure out than I expected, because the change happened in a completely different location and was very much none obvious in how it worked.

time to read 4 min | 605 words

We were looking at the cost of writing 10,000 documents to RavenDB and found out something truly interesting. The documents in question are complex, this is an object graph that includes over 120 objects belonging to 20 different classes. The serialized JSON is over 16KB in size. It is expected that something like that would take a lot of time.

Here is the results under the profiler:

image

Given the size involved, taking (under the profiler), just under 4.2 ms per document isn’t bad, I thought. In general, we consider the cost of JSON serialization as pretty fixed and ignore it. This time, I looked little bit deeper and found this:

image

Note that this is deep in the call stack. But it is a pretty expensive call, I decided to look at all the calls that we had for this method, which gave me:

image

A few things that I would like to note here. The time this single method takes is roughly 50% of the total time it takes to serialize the documents. In fact, I checked what would happen if I could remove this cost:

image

You can insert your favorite profanity here, I usually go with Pasta!

Looking at the stack trace, it is clear what is going on. RavenDB has a bunch of customizations for the default way documents are serialized. And JSON.Net will call the CanConvet() method on the converters on each new object that it is about to covert.

Given the level of complexity in these documents (over 450 values), that means that we would call this a lot. And the cost of finding out if we need to check convert the value or not dominated this kind of operation completely.

I wrote the following converter, instead:

As you can see, the core of the fix is in the CanConvertInternal() method, there I’m checking the internal converters and caching the results. Note that I’m intentionally not using any thread safe operations here, even though this is meant to be a shared instance. The idea is that on each new item, we’ll create a new copy of the cache, and we can afford to waste the memory / work to compute the cache if we can reduce the common cost of verifying the types on each call.

We are actually expecting to lose the race conditions here, which is fine. After a very short while, we are going to be stabilized and not have to pay any additional costs.

Here is the result, you can see that we are taking about half of the previous run time.

image

And if we look at just the GetMatchingConverter() call:

image

I actually had to end up with two such calls, because I have some converters that are only for writes. Given the huge performance improvement, we just created two converters to handle the caching.

Note that without a profiler, I would have never been able to figure something like this out, and even with the profiler, I had to dig deep and it took a while to narrow down where the costs actually were. I’m very surprised.

FUTURE POSTS

  1. The cost of the cloud - about one day from now
  2. Installing RavenDB on a Ubuntu machine - 3 days from now

There are posts all the way to Jun 22, 2021

RECENT SERIES

  1. Challenge (58):
    16 Jun 2021 - Detecting livelihood in a distributed cluster
  2. Webinar (4):
    11 Jun 2021 - Machine Learning and Time Series in RavenDB with Live Examples
  3. Webinar recording (13):
    24 May 2021 - The Rewards of Escaping the Relational Mindset
  4. Building a phone book (3):
    02 Apr 2021 - Part III
  5. Building a social media platform without going bankrupt (10):
    05 Feb 2021 - Part X–Optimizing for whales
View all series

Syndication

Main feed Feed Stats
Comments feed   Comments Feed Stats