One of the really annoying things about doing production readiness testing is that you often run into the same bug over & over again. In this case, we have fixed memory obesity issues over and over again.
Just recently, we had the following major issues that we had to deal with:
- Optimizations gone wild, O(N!) memory leaks
- Debugging memory issues with RavenDB using WinDBG
- RavenDB 2.0 StopShip bug: Memory is nice, let us eat it all.
- RavenDB Memory Issue, The Process
Overall, not fun.
But the saga ain’t over yet. We had a test case, we figure out what was going on, and we fixed it, damn it. And then we went to prod and figured out that we didn’t fix it after all. I’ll spare you the investigative story, suffice to say that we finally ended up figuring out that we are to blame for optimizing for a specific scenario.
In this case, we have done a lot of work to optimize for very large batches (import scenario), and we set the Lucene merge factor at a very high level (way too high, as it turned out). That was perfect for batching scenarios. But not so good for non batching scenarios. That resulted in us having to hold in memory a lot of lucene segments. Segments aren’t expensive, but they each have their own data structures. That works, sure, but when you start having tens of thousands of those, we are back in the previous story, where a relatively small objects come together in unexpected ways to kill us in nasty ways. Reducing the merge factor meant that we would keep only very small amount of segments, and avoided the problem entirely.
The best thing about this? I had to chase a bunch of false leads and ended up fixing what would have been a separate memory leak that would have gone unnoticed otherwise .
And now, let us see if stopping work at quarter to six in the morning is conductive for proper rest, excuse me, I am off to bed.