A RavenDB user called us with a very strange issue. They are running on RavenDB 3.5 and have run out of disk space. That is expected, since they are storing a lot of data. Instead of simply increasing the disk size, they decided to take the time and simply increase the machine overall capacity. They moved from a 16 cores machine to a 48 cores machine with a much larger disk.
After the move, they found out something worrying. RavenDB now used a lot more CPU. If previous the average load was around 60% CPU utilization, now they were looking at 100% utilization on a much more powerful machine. That didn’t make sense to us, so we set out to figure out what was going on. A couple of mini dumps and we were able to figure out what was going on.
It got really strange because there was the following interesting observation:
- Under minimal load / idle – no CPU at all
- Under high load – CPU utilization in the 40%
- Under medium load – CPU utilization at 100%
That was strange. When there isn’t enough load, we are at a 100%? What gives?
The culprit was simple: BlockingCollection.
“Huh”, I can hear you say. “How can that be?”
A BlockingCollection should not be the cause of high CPU, right? It is in the name, it is blocking. Here is what happened. That blocking collection is used to manage tasks, and by default we are spawning threads to handle that at twice the number of available cores. All of these threads are sitting in a loop, calling Take() on the blocking collection.
The blocking collection internally is implemented as using a SemaphoreSlim, which call Wait() and Release() on the values as needed. Here is the Release() method notifying waiters:
What you can see is that if we have more than a single waiter, we’ll update all of them. The system in question had 48 cores, so we had 96 threads waiting for work. When we add an item to the collection, all of them will wake and try to pull an item from the collection. Once of them will succeed, and then rest will not.
Here is the relevant code:
As you can imagine, that means that we have 96 threads waking up and spending a full cycle just spinning. That is the cause of our high CPU.
If we have a lot of work, then the threads are busy actually doing work, but if there is just enough work to wake the threads, but not enough to give them something to do, they’ll set forth to see how hot they can make the server room.
The fix was to reduce the number of threads waiting on this queue to a more reasonable number.
The actual problem was fixed in .NET Core, where the SemaphoreSlim will only wake as many threads as it has items to free, which will avoid the spin storm that this code generates.