The performance regression odyssey

time to read 4 min | 660 words

image (2)We care a lot about the performance of RavenDB.

Aside from putting a lot of time an effort into ensuring that RavenDB uses optimal code, we also have a monitoring system in place to alert us if we can observe a performance degradation. Usually those are fairly subtle issues, but we got an alert on the following scenario. As you can see, we are seeing a big degradation of this test.

The actual test in question is doing a low level manipulation of Voron (RavenDB’s storage engine), and as such, stand at the core of our performance hotspots to watch for.

Looking at the commits around that time frame, we quickly narrow the fault down to the following changes:

image

A really important observation here, however, is that this method is not called in the test. So we were looking at whatever this change caused a regression in code generation. I couldn’t believe that this is the case, to be honest.

Indeed, looking at the generated assembly, there was no difference between the two versions. But something cause the performance to degrade significantly enough for this test that it raised all sorts of alarm bells.

We started looking into a lot of details about the system, the usual things like checking for thermal throttling, etc.

We struck gold on this command: sudo smartctl --all /dev/nvme0n1

Take a look at the results:

    SMART overall-health self-assessment test result: FAILED!
    - NVM subsystem reliability has been degraded
    SMART/Health Information (NVMe Log 0x02, NSID 0x1)
    Critical Warning:                   0x04
    Temperature:                        35 Celsius
    Available Spare:                    100%
    Available Spare Threshold:          10%
    Percentage Used:                    115%
    Data Units Read:                    462,613,897 [236 TB]
    Data Units Written:                 2,100,668,468 [1.07 PB]
    Host Read Commands:                 10,355,495,581
    Host Write Commands:                9,695,954,131
    Controller Busy Time:               70,777

In other words, the disk is literally crying at us. This tells us that the drive has been in action for ~50 days of actual activity and that it has gone beyond is design specs.

In particular, you can see that we wrote over a petabyte of data to the disk as part of our test case executions. This is a 500GB drive, which means that we fill it to capacity over 2,000 times before we hit this issue.

Once we hit this capacity (Percentage Used is > 100%), the drive needs to do a lot more work, so we are seeing longer test times.

First time that I closed a bug by sending a PO to get new hardware, I got to admit.