Ayende @ Rahien

Oren Eini aka Ayende Rahien CEO of Hibernating Rhinos LTD, which develops RavenDB, a NoSQL Open Source Document Database.

Get in touch with me:

oren@ravendb.net

+972 52-548-6969

Posts: 7,198 | Comments: 50,268

Privacy Policy Terms
filter by tags archive
time to read 5 min | 945 words

A common question that is raised by customers is how to determine what kind of hardware you need to run RavenDB on. I’m sorry, but the answer is it’s depend, because there are a lot of variables to juggle, in this post, I”m going to try to give some insights about what sort of things you should consider when sizing your instances.

In general, you have three axis that you can work with. CPU, Memory and I/O. In terms of the best bang for the buck, optimizing I/O is usually the way to go and will return the most dividends. This is because most of the time, RavenDB will be bottlenecked on the I/O. This is especially true when you are running on the cloud, where 500 IOPS is a fairly common default (that is basically zilch to a database).

To give more a concrete answer we’ll need more details. Let’s say that you have an application with a database per customer (common for multi tenant scenarios). The structure of the database is the same, but the databases contain data that is separated from each customer. The database has 20 indexes in total, 15 map / full text search as well as 5 for map-reduce / facets operations. There is also an few ETL tasks and a couple of subscriptions for background work.

Let’s breakdown the load on  single server in this mode, shall we?

  • 100 databases (meaning 100 tx merger threads for I/O).
  • 2,000 indexes - 20 indexes x 100 databases (meaning 2,000 indexing threads).

Across the cluster, we also have:

  • 500 ETL tasks – 5 per database x 100
  • 200 subscriptions – you get the drill

The latest items are spread fairly among all the nodes that you have, but the first two are present in all nodes in the cluster.

What does this mean? We have 2,100 threads active at any given point in time? Well, that is where things gets a bit complex. We need to know more than just the raw numbers, we need to understand usage.

How many of those databases are active at any given point in time? In a multi tenant system, it is common to have many customers using the system sporadically, which can allow you to pack a lot more instances into the same hardware resources.

Of more interest, however, is usually the rate of writes. Here we need to ask ourselves what is the write write as well. In general, for reads RavenDB will load all the relevant items into memory and serve directly from there. For writes, given it’s durable nature, RavenDB must hit the disk. And the question now becomes how many database are active at the same time?

This is important, because 10 writes per second to a single database are far better than 10 writes / second across 10 databases. This is because RavenDB is able to batch I/O for a single database, but not across databases. Let’s consider the scenario where we have writes that would impact 5 indexes in the database, what is going to happen when we have 10 writes / sec in a single database?

  • 1 – 5 writes to the disk for the actual documents writes (depends on a lot of factors, and assuming that we are talking about concurrent requests here).
  • 5 – 10 index updates: 1 –2 index updates x 5 relevant indexes (in most cases, we are able to batch indexes even better than documents writes).

Total number of writes to disk: 6 – 15 writes.

However, if we take the same scenario, but now run it across 10 databases, each having a single write? There is no way for us to batch updates, so we’ll have:

  • 10 databases x (1 document writes  + 5 index updates) = 50 writes to disk.

If the number of relevant indexes is high, or if there are more databases involved, it is easy to hit the limits of I/O, especially on the cloud.

I’m actually painting somewhat bleak picture, in most cases you don’t have to worry too much about those details, RavenDB will take care of that for you. However, when you need to consider the sizing, you want to be aware of the possible load that you’ll have. Ironically enough, if you have enough load, RavenDB is able to really optimize things, it is when you have sporadic operations, spread across many locations that we start putting a lot of load on the underlying system.

So far, I was talking about I/O only, but there are other factors as well. Let’s assume that you are running 100 databases with 20 indexes each on a system with 4 cores. How is RavenDB going to split the load across the system?

The first priority is going to be given to processing requests, and then we’ll start on running indexes. That is actually by design, to ensure that we won’t overwhelmed the underlying system by issuing too much work all at once. That means that we’ll round robin the work across all the indexes that want to run, while keeping enough capacity to process user requests. In this case, more cores will allow us higher degree of parallelism, but if you have an unbalanced system (a lot of CPU but slow I/O), you’re going to see stalls because we’ll wait a lot for I/O.

In short, you need to have a fair idea about how your system is going to be used. If you don’t have at least a good guess on the topic, you are probably better off getting more I/O bandwidth than anything else. RavenDB continuously monitor itself and will alert you if there are resource issues. You are then able to shore up anything that is lacking to get the best system performance.

time to read 1 min | 88 words

I posted a few weeks ago about a performance regression in our metrics that we tracked down the to the disk being exhausted.

We replaced the hard disk to a new one, can you see what the results were?

image (1)

This is mostly because we were pretty sure that this is the problem, but couldn’t rule out that this was something else. Good to know that we were on track.

time to read 4 min | 660 words

image (2)We care a lot about the performance of RavenDB.

Aside from putting a lot of time an effort into ensuring that RavenDB uses optimal code, we also have a monitoring system in place to alert us if we can observe a performance degradation. Usually those are fairly subtle issues, but we got an alert on the following scenario. As you can see, we are seeing a big degradation of this test.

The actual test in question is doing a low level manipulation of Voron (RavenDB’s storage engine), and as such, stand at the core of our performance hotspots to watch for.

Looking at the commits around that time frame, we quickly narrow the fault down to the following changes:

image

A really important observation here, however, is that this method is not called in the test. So we were looking at whatever this change caused a regression in code generation. I couldn’t believe that this is the case, to be honest.

Indeed, looking at the generated assembly, there was no difference between the two versions. But something cause the performance to degrade significantly enough for this test that it raised all sorts of alarm bells.

We started looking into a lot of details about the system, the usual things like checking for thermal throttling, etc.

We struck gold on this command: sudo smartctl --all /dev/nvme0n1

Take a look at the results:

    SMART overall-health self-assessment test result: FAILED!
    - NVM subsystem reliability has been degraded
    SMART/Health Information (NVMe Log 0x02, NSID 0x1)
    Critical Warning:                   0x04
    Temperature:                        35 Celsius
    Available Spare:                    100%
    Available Spare Threshold:          10%
    Percentage Used:                    115%
    Data Units Read:                    462,613,897 [236 TB]
    Data Units Written:                 2,100,668,468 [1.07 PB]
    Host Read Commands:                 10,355,495,581
    Host Write Commands:                9,695,954,131
    Controller Busy Time:               70,777

In other words, the disk is literally crying at us. This tells us that the drive has been in action for ~50 days of actual activity and that it has gone beyond is design specs.

In particular, you can see that we wrote over a petabyte of data to the disk as part of our test case executions. This is a 500GB drive, which means that we fill it to capacity over 2,000 times before we hit this issue.

Once we hit this capacity (Percentage Used is > 100%), the drive needs to do a lot more work, so we are seeing longer test times.

First time that I closed a bug by sending a PO to get new hardware, I got to admit.

time to read 4 min | 663 words

This is a tale of two options that we took for an exhaustive test. Amazon recently came out with a new disk type on the cloud. As a database vendor, that is of immediate interest to me, so we took a deep look into that.

GP3 disks are about 20% cheaper than their GP2 equivalent. What is more, they come with a guarantee level of performance even before you purchase additional IOPS. Consider the following two disks:

  Size IOPS MB/S Price
GP2 512GB 1,536 250 51.2 USD
GP3 512GB 3,000 125 40.9 USD
GP3 512GB 4,075 250 51.2 USD

In other words, for the same disk, we can get a much better baseline performance at a cheaper price. What isn’t there not to like?

The major difference between GP2 and GP3, however, is their latency. In practice, we see an additional 1 – 2 milliseconds in response times from the GP3 disk vs. the GP2 disk. In other words, GP3 disks are somewhat slower, even if they are able to run more IOPS, their latency is higher.

A really key observation from us, however, is that GP3 does not offer burst I/O capabilities. And that means that I can breath a huge sigh of relief.

RavenDB as a database is meant to run on anything from an SD card to HDD to SSD to NVMe drives. We are used to account for the I/O being the slowest thing around and have already mostly coded around that. An additional millisecond in disk latency doesn’t matter that much in the grand scheme of things.

However… the fact that this doesn’t provide I/O burst is a huge plus for us. RavenDB can easily deal with slow I/O, what it find it very hard to deal with is an environment that very rapidly change its operational characteristics.

Let’s assume that we have a 100 GB GP2 disk, which means that we have a baseline of around 300 IOPS and 75MB / sec of throughput. RavenDB is under some high load, and it is using the maximum capabilities of the hardware. However, because of burstiness, we are actually able to utilize 3,000 IOPS and 250MB/sec for a while.

Until all the I/O credits are gone and we are forced into a screeching halt. That means, for example, that we read from the network at a rate of 250MB/sec, but we are unable to write to the disk at this level. There is a negative balance of 125MB/sec that needs to be stored some where. We can buffer that in memory, of course, but that only work for so long. That means that we have to put a huge break all of a sudden, which the rest of the eco system isn’t happy with. For example, the other side that is sending us data at 250MB /sec, they are likely not going to be able to respond in time to the shift is our behavior. It is very likely that the network connection would congest and break in this case.

All of the internal optimizations inside of RavenDB will also be skewed for a while, until we are used to the new level of speed. If this was gradual, we could adjust a lot more easily, but this is basically like hitting the brakes at speed. You will slow down, sure, but you are also likely to cause an accident.

As a simple example, RavenDB can compress the data that it writes to disk, and it balances the compression ratio vs. the cost to write to the disk. If we know that the disk is slow, we can spend more time trying to reduce the amount of data we write. If this changes rapidly, we are operating under the old assumptions and may create a true traffic jam

The fact that GP3 disks have a predictable performance profile means that we are much better suited to run on them. A more predictable platform from which to operate gives me a much better opportunity to handle optimizations.

time to read 2 min | 220 words

imageWe just published a white paper on RavenDB performance vs. Couchbase performance in a real customer scenario.

I had to check the results three times before I believed them. RavenDB is pretty awesome, but I had no idea it was that awesome.

The data set was reasonably big, 1.35 billion docs and the scenario we present is a real world one based on production load.

Some of the interesting details:

  • RavenDB uses 1/3 of the disk space that Couchbase uses, but stores 3 times as much data.
  • Operationally, RavenDB just worked, Couchbase needed 6 times the hardware to just scrape by. A single failure in Couchbase meant at least 15 – 45 minutes for the node to recover. Inducing failures in RavenDB brought the node back up in a few seconds.
  • For queries, we pitted a Couchbase cluster with 96 cores and 384 GB RAM against single RavenDB node running on a Raspberry PI. RavenDB on the Pi was able to sustain better latencies at the 99 percentile handling twice as much load as Couchbase is able.

There are all sort of other goodies in the white paper and we went pretty deep into the overall architecture and impact of the difference design decisions.

As usual, we welcome your feedback.

time to read 3 min | 472 words

One of the “fun” aspects of running in the cloud is the fact that certain assumptions that you take for granted are broken, sometimes seriously so. Today post is about an issue a customer run into in the cloud. They were seeing some cases of high latency of operations from RavenDB. In the cloud, the usual answer is to provision more resources, but we generally recommend that only when we can show that the load is much higher than expected to be handled on the hardware.

The customer was running on a cluster with disk that were provisioned with 1,000 IOPS and 120 MB /sec, that isn’t a huge amount, but it is certainly respectable. Looking at the load, we can see fairly constant writes and the number of indexes is around 30. Looking at the disk, we can see that we are stalling there, the queue length is very high and the disk latency has a user visible impact.

All told, we would expect to see a significant amount of I/O operations as a result of that, but the fact that we hit the limits of the provisioned IOPS was worth a second look. We started pulling at the details and it became clear that there was something that we could do about it. During indexing, we create some temporary files to store the Lucene segments before we commit them to the index. Each indexing run can create between four and six such files. When we create them, we do that with the flag DeleteOnClose, this is a flag that exists on Windows, but not on Linux. On Linux, we are running on ext4 with journaling enabled, which means that each file system metadata modification requires a journal write at the file system level. Those temporary files live for a very short amount of time, however. We delete them on close, after all, and the indexing run is very short.

6 files per index times 30 indexes means 180 files. Each one of those will be created and destroyed (generating a journal event each time) and there is a constant low volume of writes. That means that there are 360 IOPS at the file system level just because of this issue.

The fix for that was two folds. First, for small files, under 128KB, we never hit the disk. We can keep them completely in memory. For larger file, we want to avoid using too much memory, so we spill them to disk, but instead of creating new files each time, we’ll reuse them between indexing run.

The end result is that we are issuing fewer I/O operations, reducing the amount of trivial IOPS we consume and can get a lot more done with the same hardware. The actual fix is fairly small and targeted, but the impact is felt across the entire system.

time to read 4 min | 666 words

I recently got my hands on a the Raspberry PI 400 (the one that comes in a keyboard form). That is an amazing idea and it make the Raspberry a lot more approachable for consumer cases.

At any rate, one of my first actions was to put RavenDB on it and see how well it performs. You can see the results in the image below.

image

In this case, we are running 1,500 queries per second on the system. It has 4 GB of RAM and the database we are using has 450 GB (!) worth of data. I actually just took the nearest external disk I had available and plugged that into the PI. This is a generic hard disk and I can get a maximum of about 30 MB / sec from it.

This is important because my queries are covering more data than can fit in memory. Each query asks for a random (different) document, so there is little chance for optimizations by having a hot working set. We are going to see some I/O to the (pretty poor) disk impacting the outcome. Here are the results:

image

You can see that the for 95% of the queries, we got a result in under 125 milliseconds and that for 99% of the requests, RavenDB on a Raspberry PI is able to answer in about half a second.  And even with some of the requests having to hit the disk, the maximum number of time to wait for a request is just above a second. All of that when we are facing 1,500 queries per second, which is respectable even for big applications running on much more massive hardware.

Of particular interest to me is the state of the server when we are running this benchmark. You can see that both in terms of CPU utilization and in the number of queries processed, we are nearly absolutely flat. There aren’t any hiccups in the load, there haven’t been a GC pause that stopped the world and the system just runs at top speed for as long as we’ll let it. In this case, the benchmark lasted over 5 minutes, so more than enough time to run through all the usual suspects.

Note also the number of documents involved here. We are looking at 882 million documents. And we are requesting close to half a million of them. I run the benchmark long enough to ensure that we will cover more documents than can be fit into memory, so we are seeing I/O work here (on a fairly poor disk, I might add, but that is what I had available at the moment).

The actual size of disk is a bit of a cheat, I’m using documents compression here to pack the data more tightly. The actual data size, without using RavenDB data compression is around 750GB. That also helps a lot with the amount of I/O we have to deal with, but it increase the CPU consumption. Given the difference in relative costs, that is a task that is paying dividends in spades.

I also decided to see what we can look at when we are running a query that touches just a small part of the documents. Instead of working through nearly half a million, I chose to run it on about 100,000 documents. That is small enough that it should mostly all fit in memory. It also represent a far more likely scenario, to be frank.

image

And here we can see that we get all requests, under 1,500 queries per second on a Raspberry PI in under 150 ms, with the 99.999% (!!) percentile in about 50 milliseconds.

And that makes me very happy, because it shows the result of all the work we put into optimizing RavenDB.

time to read 3 min | 420 words

We have been working on a big benchmark of RavenDB recently. The data size that we are working on is beyond the TB range and we are dealing with over a billion documents. Working with such data sizes can be frustrating, because it takes quite a bit of time for certain things to complete. Since I had downtime while I was waiting for the data to load, I reached to a new toy I just got, a Raspberry PI 400. Basically, a Raspberry Pi 4 that is embedded inside a keyboard. It is a pretty cool machine and an awesome thing to play around with:

image

Naturally, I had to try it out with RavenDB. We have had support on running on ARM devices for a long while now, and we have dome some performance work on the Raspberry PI 3. There are actually a whole bunch of customers that are using RavenDB in production on the Pi. These range from embedding RavenDB in industrial robots, using RavenDB to store traffic analysis data on vehicles and deploying Raspberry PI servers to manage fleets of IoT sensors in remote locations.

The Raspberry PI 4 is supposedly much more powerful and I got the model with 4GB of RAM to play around with. And since I already had a data set that hit the TB ranges lying around, I decided to see what we could do with both of those.

I scrounged an external hard disk that we had lying around that had sufficient capacity and started the import process. This is where we are after a few minutes:

image

A couple of things to notice about this. At this point the import process is running for about two and half minutes and imported about 4 million documents. I want to emphasize that this is running on an HDD (and a fairly old one at that). Currently I can feel its vibrations on the table, so we are definitely I/O limited there.

Once I’ll be done with the data load (which I expect to take a couple of days), we’ll be testing this with queries. Should be quite fun to compare the costs of this to a cloud instance. Given typical cloud machines, we can probably cover the costs of the PI in a few daysSmile.

time to read 3 min | 563 words

During benchmark run on a large dataset, I started to notice that longer benchmarks were showing decidedly worse numbers than short ones. In other words, a benchmark that is run for 1 minute is orders of magnitude higher latencies than a benchmark that is run for 30 seconds. And the longer the benchmark, the worst things off.

That raised a lot of red flags, and spawn a pretty serious investigation. We take performance very seriously and the observed behavior was that we were getting slower over time. We suspected a leak, high number of GC calls, or memory fragmentation. The scenario under test was a web application using the RavenDB API to talk to RavenDB. We run both the web application and the server under profilers and found a few hot spots, but nothing really major. There was no smoking gun.

Then we noticed that the load testing  machine was sitting there with 100% CPU. I initially thought that this is us generating too much load for the machine, but that wasn’t it. We are using wrk2, which is capable of handling million of requests per seconds.

We were generating the requests dynamically using a Lua script, and in one of the scenarios under test, we have code like this:

path = "/orders/user/" .. page * pageSize .. "/" .. pageSize .. "/?userId=" .. item.id .. "&deep=y"

That isn’t the most optimal way to do things, I’ll admit. We can do better by using something like table.concat(), but the problem was that regardless of how you build the string, this is supposed to be fairly cheap. The wrk2 project is using LuaJIT, which has a reputation as a really scripting system. I never really thought that this would be a problem. Sure, it is a little wasteful, but it isn’t too bad, a few string temporaries and maybe some realloc() calls, but nothing major.

Instead, this resulted in us getting far worse results over time. It took a while to actually figure out why, but the root cause is in the way LuaJIT handles string hashing.

a = lj_getu32(str);
h ^= lj_getu32(str+len-4); b = lj_getu32(str+(len>>1)-2); h ^= b; h -= lj_rol(b, 14); b += lj_getu32(str+(len>>2)-1);

Strings in Lua are interned, which means that there is just a single copy of a string per value. That means that hashing is important, but the way it does hashing is to take the first 4 bytes, the last 4 bytes and the 4 bytes in the middle and use that for a hash. And that is it.

If you have a bunch of strings where those 3 locations match… well, welcome to hash collisions. At which point, what is supposed to be a O(1) call becomes an O(N) call. And creating strings will turn the operations into an O(N^2) operation!

Here is the reproduction code:

Change the prefix to be an empty string for a major performance boost. The actual bug is well known (5 or 6 years), but it was only recently fixed and not on the version that wrk2 is using.

We had to toss out the entire benchmarking set and start over because of this.

We were generating requests with random data, so some of them would hit this problem hard, and some would avoid it by magic. I was not expecting to debug hash collision in Lua code while trying to get some performance numbers from overloading RavenDB, quite random, literally.

FUTURE POSTS

  1. Atomic reference counting (with Zig code samples) - 2 days from now

There are posts all the way to Sep 20, 2021

RECENT SERIES

  1. Production postmortem (31):
    17 Sep 2021 - The Guinness record for page faults & high CPU
  2. RavenDB 5.2 (2):
    06 Aug 2021 - Simplifying atomic cluster wide transactions
  3. Postmortem (2):
    23 Jul 2021 - Accidentally quadratic indexing output
  4. re (28):
    23 Jun 2021 - The performance regression odyssey
  5. Challenge (58):
    16 Jun 2021 - Detecting livelihood in a distributed cluster
View all series

Syndication

Main feed Feed Stats
Comments feed   Comments Feed Stats