Getting fatal out of memory errors because we are managing memory too well
We got a serious situation on one of our test cases. We put the system through a lot, pushing it to the breaking point and beyond. And it worked, in fact, it worked beautifully. Up until the point that we started to use too many resources and crashed. While normally that would be expected, it really bugged us, we had provisions in place to protect us against that. Bulkheads were supposed to be blocked, operations rolled back, etc. We were supposed to react properly, reduce costs of operations, prefer being up to being fast, the works.
That did not happen. From the outside, what happened is that we go to the point where we would trigger the “sky about the fall, let’s conserve everything we can”, but we didn’t see the reaction that we expected from the system. Oh, we were started to use a lot less resources, but the resources that we weren’t using? They weren’t going back to the OS, they were still held.
It’s easiest to talk about memory in this regard. We hold buffers in place to handle requests, and in order to avoid fragmentation, we typically make them large buffers, that are resident on the large object heap.
When RavenDB detects that there is a low memory situation, it starts to scale back. It releases any held buffers, completes ongoing works and starts working on much smaller batches, etc. We saw that behavior, and we certainly saw the slow down as RavenDB was willing to take less upon itself. But what we didn’t see is the actual release of resources as a result of this behavior.
And as it turned out, that was because we were too good about managing ourselves. A large part of the design of RavenDB 4.0 was around reducing the cost of garbage collections by reducing allocations as much as possible. This means that we are running very few GCs. In fact, GC Gen 2 collections are rare on our environment. However, we need these Gen 2 collections to be able to clean up stuff that is in the finalizer queue. In fact, we typically need two such runs before the GC can be certain that the memory is not in use and actually collect it.
In this particular situation, we were careful to code so we will get very few GC collections running, and that led us to crash because we would run out of resources before the GC could realize that we are actually not really using them at this point.
The solution, by the way, was to change the way we respond to low memory conditions. We’ll be less good about keeping all the memory around and if it isn’t being used, we’ll start discarding it a lot sooner, so the GC has better chance to actually realize that is isn’t being used and recover the memory. An instead of throwing the buffers away all at once when we have low memory and hope that the GC will be fast enough in collecting them, we’ll keep them around and reuse them, avoiding the additional allocations that processing more requests would have required.
Since the GC isn’t likely to be able to actually free them in time, we aren’t affecting the total memory consumed in this scenario but are able to reduce allocations by serving the buffers that are already allocated. This two actions, being less rigorous about policing our memory and not freeing things when we get low memory are confusingly enough to get both reduce the chance of getting into low memory and reduce the chance of actually using too much memory in such a case.