The performance regression odyssey

Jun 10 2021

The performance regression odyssey

time to read 4 min | 660 words

We care a lot about the performance of RavenDB.

Aside from putting a lot of time an effort into ensuring that RavenDB uses optimal code, we also have a monitoring system in place to alert us if we can observe a performance degradation. Usually those are fairly subtle issues, but we got an alert on the following scenario. As you can see, we are seeing a big degradation of this test.

The actual test in question is doing a low level manipulation of Voron (RavenDB’s storage engine), and as such, stand at the core of our performance hotspots to watch for.

Looking at the commits around that time frame, we quickly narrow the fault down to the following changes:

A really important observation here, however, is that this method is not called in the test. So we were looking at whatever this change caused a regression in code generation. I couldn’t believe that this is the case, to be honest.

Indeed, looking at the generated assembly, there was no difference between the two versions. But something cause the performance to degrade significantly enough for this test that it raised all sorts of alarm bells.

We started looking into a lot of details about the system, the usual things like checking for thermal throttling, etc.

We struck gold on this command: sudo smartctl --all /dev/nvme0n1

Take a look at the results:

    SMART overall-health self-assessment test result: FAILED!
    - NVM subsystem reliability has been degraded

    SMART/Health Information (NVMe Log 0x02, NSID 0x1)
    Critical Warning:                   0x04
    Temperature:                        35 Celsius
    Available Spare:                    100%
    Available Spare Threshold:          10%
    Percentage Used:                    115%
    Data Units Read:                    462,613,897 [236 TB]
    Data Units Written:                 2,100,668,468 [1.07 PB]
    Host Read Commands:                 10,355,495,581
    Host Write Commands:                9,695,954,131
    Controller Busy Time:               70,777

In other words, the disk is literally crying at us. This tells us that the drive has been in action for ~50 days of actual activity and that it has gone beyond is design specs.

In particular, you can see that we wrote over a petabyte of data to the disk as part of our test case executions. This is a 500GB drive, which means that we fill it to capacity over 2,000 times before we hit this issue.

Once we hit this capacity (Percentage Used is > 100%), the drive needs to do a lot more work, so we are seeing longer test times.

First time that I closed a bug by sending a PO to get new hardware, I got to admit.

Tweet Share Share 6 comments

Tags:

Comments

10 Jun 2021
13:28 PM

Tomasz Rzepka

Not the first time I see a ticket closed by blaming hardware. First time to see it actually being justified and not just a lazy developer ;-)

10 Jun 2021
20:21 PM

peter

now that is really interesting. I assume the disk IOs decreased, but doesn't ravendb monitor raw IO for slow writes already?

Would it have helped to modify the value that triggers the slow write warning?

11 Jun 2021
04:01 AM

Alois Kraus

I had also recently deal with slow SSDs which caused me to write: https://github.com/Alois-xx/FileStress. This is the first tool which would allow you to check the real sustained write performance while testing over hours. The link to Seagate about the many lies of SSD performance testing is true. My tester is not trying to only measure the disk performance but also NTFS file system performance where we might run into NTFS performance issues. I had the case to write sensor data with 110 MB/s for > 30 minutes to a SSD. It turned out the vender delivered a SSD which did go from 450 MB/s write performance after 3 minutes down to 90 MB/s which was ruining the system because the data was buffered and we were running of memory. The usual disk benchmarks do not cover this and SSD Garbage collection characteristics is poorly understood as well as the effects of data entropy where the SSD controller will compress data leading to much better numbers which low entropy data.

11 Jun 2021
06:30 AM

Oren Eini

Peter, RavenDB will alert of slow I/O if a single write takes over a second to run. That is because we need something really bad to avoid spurious warnings. In this case, the cost went by a LOT, but that went from 1 ms to 5 ms. Huge change, but not enough to trigger this.

11 Jun 2021
06:32 AM

Oren Eini

Alois, Yes, doing sustained work like that is a PITA if you need to understand what exactly is going on in the system. Great tool, will be useful if we need smoking gun.We actually built something similar to RavenDB at one point. In the cloud, mind you, this is actually worse, you have I/O credits and the speed can go from 3K IOPS to 100 IOPS effectively instantly.

11 Jun 2021
07:36 AM

Rafal

gradual performance degradation is certain like death and taxes.

Comment preview

Comments have been closed on this topic.

Oren Eini

Oren Eini

CEO of RavenDB