RavenDB 2.0 StopShip bug: Memory is nice, let us eat it all.

Dec 21 2012

RavenDB 2.0 StopShip bug: Memory is nice, let us eat it all.

time to read 2 min | 390 words

In the past few days, it sometimes felt like RavenDB is a naughty boy who want to eat all of the cake and leave none for others.

The issue is that under certain set of circumstances, RavenDB memory usage would spike until it would consume all of the memory on the machine. The problem is that we are pretty sure what is the root cause of the problem, it is the prefetching data that is killing us. Proven by the fact that when we disable that, we seem to be operating fine. And we did find quite a few such issues. And we got them fixed.

And still the problem persists… (picture torn hair and head banging now).

To make things worse, in our standard load tests, we couldn’t see this problem. It was our dog fooding tests that actually caught it. And it only happened after a relatively long time in production. That sucked, a lot.

The good news is that I eventually sat down and wrote a test harness that could pretty reliably reproduce this issue. That narrowed down things considerably. This issue is related to map/reduce and to prefetching, but we are still investigating.

Here are the details:

Run RavenDB on a machine that has at least 2 GB of free RAM.
Run the Raven.SimulatedWorkLoad, it will start writing documents and creating indexes
After about 50,000 – 80,000 documents have been imported, you’ll begin seeing memory rises rapidly, to use as much free memory as you have.

On my machine, it got to 6 GB before I had to kill it. I took a dump of the process memory at around 4.3GB, and we are analyzing this now. The frustrating thing is that the act of taking the mem dump dropped the memory usage to 1.2GB.

I wonder if we aren’t just creating so much memory garbage that the GC just let us consume all available memory. The problem with that is that it gets so bad that we start paging, and I don’t think the GC should allow that.

The dump file can be found here (160MB compressed), if you feel like taking a stab in it. Now, if you’ll excuse me, I need to open WinDBG and see what I can find.

Tweet Share Share 9 comments

Tags:

raven

Comments

21 Dec 2012
15:19 PM

do you try to explicit call gc when a block of (20k/40k) documents has been imported?

21 Dec 2012
15:19 PM

Ayende Rahien

mm, We have a way to call GC directly, yes, and no, it doesn't help.

21 Dec 2012
15:28 PM

so if you call Gc.Collect() every N documents during the test the memory still continue to increase?

21 Dec 2012
15:30 PM

Ayende Rahien

mm, That doesn't matter, if we call GC.Collect() when it is 4 GB in size, it should clear 4 GB of waste on its own. If it doesn't, it means that there is something else that is wrong.

21 Dec 2012
16:06 PM

yeah, sorry for my bad english, what i mean is if gc works or not to know if it's a memory leak problem

21 Dec 2012
16:49 PM

Jason

I had a very similar issue using Lucene in the past. The problem was I upgraded the project to .net 4.0 and Lucene .net was built against an older version. Upgrading to Lucene.Net 2.9.4g and all projects to .NET 4.0 fixed my issue. I feel your pain.

To track down my issue I just started deleting functionality until the issue stopped. WinDBG wasn't that helpful in my case.

21 Dec 2012
18:38 PM

Payam

Have you tried using a memory profiler like ANTS Memory Profiler or .NET Memory Profiler(http://memprofiler.com/)?

23 Dec 2012
08:42 AM

Rennie

Can this have anything to do with it? http://nikosbaxevanis.com/2010/10/20/adventures-using-rhino-servicebus/

23 Dec 2012
09:05 AM

Ayende Rahien

Rennie, As a matter of fact, no, that was another issue. See the next few posts.

Comment preview

Comments have been closed on this topic.

Oren Eini

Oren Eini

CEO of RavenDB