Getting fatal out of memory errors because we are managing memory too well

time to read 4 min | 621 words

We got a serious situation on one of our test cases. We put the system through a lot, pushing it to the breaking point and beyond. And it worked, in fact, it worked beautifully. Up until the point that we started to use too many resources and crashed. While normally that would be expected, it really bugged us, we had provisions in place to protect us against that. Bulkheads were supposed to be blocked, operations rolled back, etc. We were supposed to react properly, reduce costs of operations, prefer being up to being fast, the works.

That did not happen. From the outside, what happened is that we go to the point where we would trigger the “sky about the fall, let’s conserve everything we can”, but we didn’t see the reaction that we expected from the system. Oh, we were started to use a lot less resources, but the resources that we weren’t using? They weren’t going back to the OS, they were still held.

It’s easiest to talk about memory in this regard. We hold buffers in place to handle requests, and in order to avoid fragmentation, we typically make them large buffers, that are resident on the large object heap.

When RavenDB detects that there is a low memory situation, it starts to scale back. It releases any held buffers, completes ongoing works and starts working on much smaller batches, etc. We saw that behavior, and we certainly saw the slow down as RavenDB was willing to take less upon itself. But what we didn’t see is the actual release of resources as a result of this behavior.

And as it turned out, that was because we were too good about managing ourselves. A large part of the design of RavenDB 4.0 was around reducing the cost of garbage collections by reducing allocations as much as possible. This means that we are running very few GCs. In fact, GC Gen 2 collections are rare on our environment. However, we need these Gen 2 collections to be able to clean up stuff that is in the finalizer queue. In fact, we typically need two such runs before the GC can be certain that the memory is not in use and actually collect it.

In this particular situation, we were careful to code so we will get very few GC collections running, and that led us to crash because we would run out of resources before the GC could realize that we are actually not really using them at this point.

The solution, by the way, was to change the way we respond to low memory conditions. We’ll be less good about keeping all the memory around and if it isn’t being used, we’ll start discarding it a lot sooner, so the GC has better chance to actually realize that is isn’t being used and recover the memory. An instead of throwing the buffers away all at once when we have low memory and hope that the GC will be fast enough in collecting them, we’ll keep them around and reuse them, avoiding the additional allocations that processing more requests would have required.

Since the GC isn’t likely to be able to actually free them in time, we aren’t affecting the total memory consumed in this scenario but are able to reduce allocations by serving the buffers that are already allocated. This two actions, being less rigorous about policing our memory and not freeing things when we get low memory are confusingly enough to get both reduce the chance of getting into low memory and reduce the chance of actually using too much memory in such a case.

Tweet Share Share 13 comments

Tags:

Comments

20 Feb 2018
14:25 PM

Svick

instead of throwing the buffers away all at once when we have low memory and hope that the GC will be fast enough in collecting them, we’ll keep them around and reuse them, avoiding the additional allocations that processing more requests would have required.

Instead of throwing the buffers away or fully keeping them around, have you considered the middle ground of switching from strong references to weak references when memory is low?

That way, if GC is fast enough, it can reclaim the buffers, but if GC is slow, you can reuse them.

20 Feb 2018
14:40 PM

Oren Eini

Svick, That gives me far less control, and I would rather have better predictability here.

20 Feb 2018
16:16 PM

Federico Lois

@Svick when you are dealing with memory and hardware effects at the same time, you will always trade predictability for short term gains in performance because in the end predictability will give you better performance on the long run.

21 Feb 2018
12:32 PM

Rafal

From the description of this and similar cases in the past, it looks like you're playing some hide & seek with the OS and sometimes become a victim of that. First of all, how it's possible that when you run out of memory you can actually free some buffers? So in fact your program didnt' run out of memory, it just tricked itself in believing that it has no memory left and started to panic. And what is the goal of freeing the buffers? to return them back to the OS so then OS can give it to you again? Or just to lower the panic level back to normal? In any case, it looks like your program bravely solves a problem that it created in the first place.

21 Feb 2018
12:52 PM

Oren Eini

Rafal, In this case, the problem was that we had released the memory, but the GC didn't run (see earlier post about requiring two full GC Gen 2 runs), which meant that we didn't have access to the buffers but didn't have space to allocate new ones.

The case we had was a high memory notification. At this point, we would release any pending buffers, clear caches, etc. We would also move to a much more conservative mode, in which we can reduce our memory consumption at the expense of overall perf. That, in turn, meant that we would not be running GC to clear the managed resources, and when we had a big enough amount of work that did require us to allocate we would die.

21 Feb 2018
20:35 PM

Rafal

ok, but isn't the runtime supposed to run a GC pass to free memory if it can't allocate? Before throwing OOM? So how it's possible it forgot to do so in your case?

22 Feb 2018
06:54 AM

Oren Eini

Rafal, See this post: https://ayende.com/blog/181665-A/the-cost-of-finalizers Basically, when you have such a scenario, you'll need two such GC (at Gen 2) to actually know that the memory is free. So what happens is that the GC runs once, because it can't allocate any more. But it couldn't free enough memory. If it would run again, given our specific situation, it will have free memory, but it doesn't know that, and since it just run a GC, it will fail with OOM. The key here is if we weren't so careful about memory management we'll probably have more Gen2 collections earlier and the memory would already be marked as free :-)

22 Feb 2018
07:58 AM

Rafal

Missed that earlier post... nice trap indeed.

22 Feb 2018
09:14 AM

Pop Catalin

@Ayende,

Knowing that GC Gen 2 will always (actually often...) collect objects with Finalizers, not trigger a GC Gen 2 wait for pending finalizers then trigger a GC Gen 2 again ... this way the memory will surely be freed. Calling GC manually is far from Ideal however in Low memory situations I think it might be an acceptable tradeoff.

22 Feb 2018
09:31 AM

Oren Eini

Pop Catalin, The problem is that we are not called on that. The GC does this on its own, and we don't want to just sprinkle GC.Collect calls everywhere.

26 Feb 2018
13:07 PM

Paul Turner

Would those GC.Collect calls be everywhere, or just at the point where the "The Sky is About to Fall" switch is flung?

28 Feb 2018
08:59 AM

Oren Eini

Paul, How would you detect the sky is falling before it fell?

01 Mar 2018
13:11 PM

Paul Turner

From the outside, what happened is that we go to the point where we would trigger the “sky about the fall, let’s conserve everything we can”, but we didn’t see the reaction that we expected from the system.

I assumed that you already had this trigger-point in the system.

Comment preview

Comments have been closed on this topic.

Markdown turns plain text formatting into fancy HTML formatting.

Phrase Emphasis

*italic*   **bold**
_italic_   __bold__

Links

Inline:

An [example](http://url.com/ "Title")

Reference-style labels (titles are optional):

An [example][id]. Then, anywhere
else in the doc, define the link:
  [id]: http://example.com/  "Title"

Images

Inline (titles are optional):

![alt text](/path/img.jpg "Title")

Reference-style:

![alt text][id]
[id]: /url/to/img.jpg "Title"

Headers

Setext-style:

Header 1
========
Header 2
--------

atx-style (closing #'s are optional):

# Header 1 #
## Header 2 ##
###### Header 6

Lists

Ordered, without paragraphs:

1.  Foo
2.  Bar

Unordered, with paragraphs:

*   A list item.
    With multiple paragraphs.
*   Bar

You can nest them:

*   Abacus
    * answer
*   Bubbles
    1.  bunk
    2.  bupkis
        * BELITTLER
    3. burper
*   Cunning

Blockquotes

> Email-style angle brackets
> are used for blockquotes.
> > And, they can be nested.
> #### Headers in blockquotes
> 
> * You can quote a list.
> * Etc.

Horizontal Rules

Three or more dashes or asterisks:

---
* * *
- - - -

Manual Line Breaks

End a line with two or more spaces:

Roses are red,   
Violets are blue.

Fenced Code Blocks

Code blocks delimited by 3 or more backticks or tildas:

```
This is a preformatted
code block
```

Header IDs

Set the id of headings with {#<id>} at end of heading line:

## My Heading {#myheading}

Tables

Fruit    |Color
---------|----------
Apples   |Red
Pears	 |Green
Bananas  |Yellow

Definition Lists

Term 1
: Definition 1
Term 2
: Definition 2

Footnotes

Body text with a footnote [^1]
[^1]: Footnote text here

Abbreviations

MDD <- will have title
*[MDD]: MarkdownDeep

Oren Eini

Oren Eini

CEO of RavenDB