Well, we got it. Dear DB, get your hands OFF my memory (unless you really need it, of course).
The actual issue was so hard to figure out because it was not a memory leak. It exhibit all of the signs for that, sure, but it was not.
Luckily for RavenDB, we have a really great team, and the guy who provided the final lead is Arek, from AIS.PL, who does really great job. Arek manage to capture the state in a way that showed that a lot of the memory was help by the OptimizedIndexReader class, to be accurate, about 2.45GB of it. That made absolutely no sense, since OIR is a relatively cheap class, and we don’t expect to have many of them.
Here is the entire interesting part of the class:
2: public class OptimizedIndexReader<T> where T : class
4: private readonly List<Key> primaryKeyIndexes;
5: private readonly byte bookmarkBuffer;
6: private readonly JET_SESID session;
7: private readonly JET_TABLEID table;
8: private Func<T, bool> filter;
10: public OptimizedIndexReader(JET_SESID session, JET_TABLEID table, int size)
12: primaryKeyIndexes = new List<Key>(size);
13: this.table = table;
14: this.session = session;
15: bookmarkBuffer = new byte[SystemParameters.BookmarkMost];
As you can see, this isn’t something that looks like it can hold 2.5GB. Sure, it has a collection, but the collection isn’t really going to be that big. It may get to a few thousands, but it is capped at around 131,072 or so. And the Key class is also small. So that can’t be it.
There was a… misunderstanding in the way I grokked the code. Instead of having one OIR with a collection of 131,072 items. No, the situation was a lot more involved. When using map/reduce indexes, we would have as many of the readers as we would have (keys times buckets). When talking about large map/reduce indexes, that meant that we might need tens of thousands of the readers to process a single batch. Now, each of those readers would usually contain just one or two items, so that wasn’t deemed to be a problem.
Except that we have this thing on line 15. BookmarkMost is actualy 1,001 bytes. With the rest of the reader, let us call this an even 1Kb. And we had up to of 131,072 of those around, per index. Now, we weren’t going to hang on to those guys for a long while, just until we were done indexing. Except… Since this took up a lot of memory, this also meant that we would create a lot of garbage memory for the GC to work on, that would slow everything down, and result in us needing to process larger and larger batches. As the size of the batches would increase, we would use more and more memory. And eventually we would start paging.
Once we did that, we were basically is slowville, carrying around a lot of memory that we didn’t really need. If we were able to complete the batch, all of that memory would instance turn to garbage, and we could move on. But if we had another batch with just as much work to do…
And what about prefetching? Well, as it turned out, we had our own problems with prefetching, but they weren’t relating to this. Prefetching simply made things so fast that they served the data to the map/reduce index at a rate fast enough to expose this issue, ouch!
We probably still need to go over some things, but this looks good.