Self flagellation and the barbarians are at the gate
We got a report from a user about severe issues with RavenDB. It reports resource exhaustion with plenty of resources still available, and once that happens, it will refuse to even restart itself, forcing a process kill.
As you can imagine, that was a pretty big deal for us, so we set out to investigate. And we found some interesting results.
One of the things that we like to keep in mind with RavenDB is that it is a safe choice. Whenever we need to make a decision between various tradeoffs, we’ll always chose the safe choice. That means, among other things, that we are pretty careful about the way that we approach external input. And in this case, we are actively protecting ourselves from the outside world. One of the ways we do that is by limiting the number of requests that we will concurrently process.
The idea is that it is better to flat out reject requests than put such a load on the system that it will eventually crash. Indeed, that has been such a successful tactic that to this day, there has been exactly zero production issues with it. To my knowledge, it hasn’t ever been even noticed by any of our users.
The actual issue is that we have an internal limit that is set by default to 256 concurrent transactions. And by default, we will accept up to 192 concurrent requests. Then I looked at the actual logs, and I found:
And that explains much, but not nearly all. We had this in our code base for roughly 8 months. There are still other things that protect us from those issues, not the least of which is that it is actually hard to generate that number of requests against us (you really have to try very hard, usually from multiple machines). But there was one scenario that we didn’t consider for the purpose of protecting ourselves from the barbarians at the gate. Multi Get requests.
Multi Get requests allows you to package multiple requests to RavenDB into a single physical request. Those requests are going to cost you a single round trip to the server, and you can run as many of those as you want. In the dump we received, we could see 17 pending Multi Get request, and about 400 queries being executed, each of them requiring their own session. No wonder we got out of session errors.
Final note: for what it is worth, I changed our limits to 1,024 concurrent sessions and 512 concurrent requests, which is more reasonable considering the kind of hardware we usually run on. Multi Get has another 192 sessions that it can utilize, and the rest are dedicated for background processes.