We got a pretty nasty bug at a customer site a few months ago. Every now and then, the server running RavenDB will go into a high CPU mode, use 100% CPU and stay there for extended period of time. After a while, it will just return back to normal.
Looking at the details, there was nothing really that should cause this scenario. The amount of load of the system didn’t justify this load, we are talking about a maximum load of under 500 requests / second. RavenDB can handle that much on a Raspberry PI without straining itself, after all.
The problem was that we couldn’t figure out what was going on… none of the usual metrics were relevant. Typically, when we see high CPU utilization, the fault is either our code or the GC is working hard. In this case, however, while the RavenDB process was responsible for the CPU usage, there was no indication that it was any of the usual suspects. Here is what the spike looked like:
The customer has increased the size of the machine several time, trying to accommodate the load, but the situation was not getting better. In fact, the situation appeared to be getting worse. This is on a server running Windows 2016, and all the nodes in the cluster would experience this behavior, effectively taking the system down. The did not do that on an synchronized schedule, but one would go into a high CPU load, and the clients would fail to the other nodes, which will usually (but not always) trigger the situation. After a short while, it would get back down to normal rates, but that was obviously not a good situation.
After a while, we found something that was absolutely crazy. Looking at the Task Manager, we added all the possible columns and looked at them. Take a look at the following screen shot:
What you see here is the Page Fault Delta, basically, how many page faults happened in the system in the past second for this process. A high number on this column is when you see hundreds of page faults. Thousands of page faults usually means that you are swapping badly. Hundreds of thousands? I have never seen such a scenario and I couldn’t imagine what that actually is.
What is crazier is that this number should be physically impossible. At a glance, we are looking at a lot of reads from the disk, but looking at the disk metrics, we could see that we had very little activity there.
That is when we discovered that there is another important metric, Hard Page Faults / sec. That metric, on the other hand, typically ranged in the single digits or very low double digits, nothing close to what we were seeing. So what was going on here?
In addition to hard page faults (reading the data from disk), there is also the concept of soft page faults. Those page faults can happen if the OS can find the data it needs in RAM (the page cache, for example). But if it is in RAM, why do we even have a page fault in the first place? The answer is that while the memory may be in RAM, it may not have mapped to the process in question.
Consider the following image, we have two processed that mapped the same file to memory. On both processes, the first page is mapped to the same physical page. But the 3rd page on the second process (5678) is not mapped. What do you think will happen if the process will access this page?
At this point, the CPU will trigger a page fault, which the OS needs to handle. How is it going to do that? It can fetch the data from disk, but it doesn’t actually need to do so. What it needs to do is just update the page mapping to point to the already loaded page in memory. That is what is called a soft page fault (with hard page fault requiring us to go to disk).
Note that the CPU utilization above shows that the vast majority of the time is actually spent on system time, not user time. That means that the kernel is somehow doing a lot of work, but what is going on?
The issue with page fault delta was the key to understanding what exactly is going on here. When we looked deeper using ETW, we were able to capture the following trace:
What you can see in the image is that the vast majority of the time is spent in handling the page fault on the kernel side, as expected given the information that we have so far. However, the reason for this issue is that we are contending on an exclusive lock? What lock is that?
We worked with Microsoft to figure out what exactly is going on and we found that in order to modify the process mapping table, Windows needs to take a lock. That make sense, since you need to avoid concurrent modification to the page table. However, on Windows 2016, that is a process wide lock. Consider the impact of that. If you have a scenario where a lot of threads want to access pages that aren’t mapped to the process, what will happen?
On each thread, we’ll have a page fault and handle that. If the page fault is a hard page fault, we will issue a read and put the thread to sleep until this happen. But what if this is a soft page fault? Then we just need to take the process lock, update the mapping table and return. But what if I have a high degree of concurrency? Like 64 concurrent cores that all contend on the same exact lock? You are going to end up with the exact situation above. There is going to be a hotly contested lock and you’ll spend all your time at the kernel level.
The question now was, why did this happen? The design of RavenDB relies heavily on memory mapped I/O and it is something that we have been using for over a decade. What can cause us to have so many soft page faults?
The answer came from looking even more deeply into the ETW traces we took. Take a look at the following stack:
When we call FlushFileBuffers, as we need to do to ensure that the data is consistent on disk, there is a lot that is actually going on. However, one of the key aspects that seems to be happening is that Windows will remove pages that were written by FlushFileBuffers from the working set of the process. That will lead to page faults (soft ones). We confirmed with Microsoft that this is the expected behavior, calling FlushFileBuffers (fsync) will trim the modified pages from the process mapping table. This is done to improve coherency between the memory mapped pages and the page cache, I believe.
To reproduce this scenario, you’ll need to do something similar to:
- Map a large number of pages (in this case, hundreds of GB)
- Modify the data in those pages (in our case, write documents, indexes, etc)
- Call FlushFileBuffers on the data
- From many threads, access the recently flushed data (each thread ideally accessing a different page)
On Windows 2016, you’ll hit a spin lock contention issue and spend most of your time contending inside the kernel. The recommendation from Microsoft has been to move to Windows 2019, where the memory lock granularity has been increased, so they won’t all contend on the same lock. Indeed, testing on Windows 2019 we weren’t able to reproduce the problem.
The really strange thing here is that we have using the exact same code and approach in RavenDB for many years, and only recently did we see a shift with most of our customers running on Linux. That particular behavior is how we are used to be running, and I would expect it to be triggered often.
The annoying thing about this is that this is actually the case of too much of a good thing. Usually RavenDB will scale linearly with the number of cores for reads, the customer in question moved from RavenDB 3.5 to RavenDB 5.2, and they used the same size machine in both cases. RavenDB 5.2 is far more efficient, however. It was able to utilize the cores a lot better and trigger this behavior on a consistent basis. Using RavenDB 3.5, on the other hand, a lot of the CPU time was spent on doing other things, so we didn’t trigger this issue. Indeed, a workaround to improve performance was to reduce the number of cores on the system. That reduced the contention and made the whole system more stable.
The actual solution, however, was to run on Windows 2019, but that was a hard problem to solve. We tested pretty much any scenario that we could think to see what can help us here. And yes, we tested this on Linux, and didn’t see any indication of a similar problem.
More posts in "Production postmortem" series:
- (13 Dec 2021) The memory leak that only happened on Linux
- (17 Sep 2021) The Guinness record for page faults & high CPU
- (07 Jan 2021) The file system limitation
- (23 Mar 2020) high CPU when there is little work to be done
- (21 Feb 2020) The self signed certificate that couldn’t
- (31 Jan 2020) The slow slowdown of large systems
- (07 Jun 2019) Printer out of paper and the RavenDB hang
- (18 Feb 2019) This data corruption bug requires 3 simultaneous race conditions
- (25 Dec 2018) Handled errors and the curse of recursive error handling
- (23 Nov 2018) The ARM is killing me
- (22 Feb 2018) The unavailable Linux server
- (06 Dec 2017) data corruption, a view from INSIDE the sausage
- (01 Dec 2017) The random high CPU
- (07 Aug 2017) 30% boost with a single line change
- (04 Aug 2017) The case of 99.99% percentile
- (02 Aug 2017) The lightly loaded trashing server
- (23 Aug 2016) The insidious cost of managed memory
- (05 Feb 2016) A null reference in our abstraction
- (27 Jan 2016) The Razor Suicide
- (13 Nov 2015) The case of the “it is slow on that machine (only)”
- (21 Oct 2015) The case of the slow index rebuild
- (22 Sep 2015) The case of the Unicode Poo
- (03 Sep 2015) The industry at large
- (01 Sep 2015) The case of the lying configuration file
- (31 Aug 2015) The case of the memory eater and high load
- (14 Aug 2015) The case of the man in the middle
- (05 Aug 2015) Reading the errors
- (29 Jul 2015) The evil licensing code
- (23 Jul 2015) The case of the native memory leak
- (16 Jul 2015) The case of the intransigent new database
- (13 Jul 2015) The case of the hung over server
- (09 Jul 2015) The case of the infected cluster