RavenDB 2.0 StopShip bug: Memory is nice, let us eat it all.
In the past few days, it sometimes felt like RavenDB is a naughty boy who want to eat all of the cake and leave none for others.
The issue is that under certain set of circumstances, RavenDB memory usage would spike until it would consume all of the memory on the machine. The problem is that we are pretty sure what is the root cause of the problem, it is the prefetching data that is killing us. Proven by the fact that when we disable that, we seem to be operating fine. And we did find quite a few such issues. And we got them fixed.
And still the problem persists… (picture torn hair and head banging now).
To make things worse, in our standard load tests, we couldn’t see this problem. It was our dog fooding tests that actually caught it. And it only happened after a relatively long time in production. That sucked, a lot.
The good news is that I eventually sat down and wrote a test harness that could pretty reliably reproduce this issue. That narrowed down things considerably. This issue is related to map/reduce and to prefetching, but we are still investigating.
Here are the details:
- Run RavenDB on a machine that has at least 2 GB of free RAM.
- Run the Raven.SimulatedWorkLoad, it will start writing documents and creating indexes
- After about 50,000 – 80,000 documents have been imported, you’ll begin seeing memory rises rapidly, to use as much free memory as you have.
On my machine, it got to 6 GB before I had to kill it. I took a dump of the process memory at around 4.3GB, and we are analyzing this now. The frustrating thing is that the act of taking the mem dump dropped the memory usage to 1.2GB.
I wonder if we aren’t just creating so much memory garbage that the GC just let us consume all available memory. The problem with that is that it gets so bad that we start paging, and I don’t think the GC should allow that.
The dump file can be found here (160MB compressed), if you feel like taking a stab in it. Now, if you’ll excuse me, I need to open WinDBG and see what I can find.
Comments
do you try to explicit call gc when a block of (20k/40k) documents has been imported?
mm, We have a way to call GC directly, yes, and no, it doesn't help.
so if you call Gc.Collect() every N documents during the test the memory still continue to increase?
mm, That doesn't matter, if we call GC.Collect() when it is 4 GB in size, it should clear 4 GB of waste on its own. If it doesn't, it means that there is something else that is wrong.
yeah, sorry for my bad english, what i mean is if gc works or not to know if it's a memory leak problem
I had a very similar issue using Lucene in the past. The problem was I upgraded the project to .net 4.0 and Lucene .net was built against an older version. Upgrading to Lucene.Net 2.9.4g and all projects to .NET 4.0 fixed my issue. I feel your pain.
To track down my issue I just started deleting functionality until the issue stopped. WinDBG wasn't that helpful in my case.
Have you tried using a memory profiler like ANTS Memory Profiler or .NET Memory Profiler(http://memprofiler.com/)?
Can this have anything to do with it? http://nikosbaxevanis.com/2010/10/20/adventures-using-rhino-servicebus/
Rennie, As a matter of fact, no, that was another issue. See the next few posts.
Comment preview