Production postmortemThe case of the native memory leak

time to read 4 min | 785 words

This one is a pretty recent one. A customer complained about high memory usage in RavenDB under moderate usage. That was a cause for concern, since we care a lot about our memory utilization.

So we started investigating that, and it turned out that we were wrong, the problem wasn’t with RavenDB, it was with the RavenDB Client Library. The customer had a scenario where 100% of the time, after issuing a small number of requests (less than ten), the client process would be using hundreds of MB, for really no purpose at all. The client already turned off caching, profiling and pretty much anything else that both they and us could think of.

We got a process dump from them and looked at that, and everything seemed to be fine. The size of the heap was good, and there didn’t appear to be any memory being leaked. Our assumption at that point was that there is some sort of native memory leak from their application.

To continue the investigation further, NDAs was required, but we managed to go through that and we finally had a small repro that we could look at ourselves. The fact that the customer was able to create such a thing is really appreciated, because very often we have to work with a lot of missing information. Of course, when we run this on our own system, everything was just fine & dandy. There was no issue. We got back to the customer and they told us that the problem would only reproduce in IIS.

And indeed, once we have tested inside IIS (vs. using IIS Express), that problem was immediately obvious. That was encouraging (at least from my perspective). RavenDB doesn’t contain anything that would differentiate between IIS and IIS Express. So if the problem was only in IIS, that is not likely to be something that we did.

Once I had a repro of the problem, I sat down to observe what was actually going on. The customer was reading 1,024 documents from the server, and then sending them to the browser. That meant that each of the requests we tested was roughly 5MB in size. Note that this means that we have to:

  • Read the response from the server
  • Decompress the data
  • Materialize it to objects
  • Serialize the data back to send it over the network

That is a lot of memory that is being used here. And we never run into any such issues before.

I had a hunch, and I created the following:

[RoutePrefix("api/gc")]
public class GcController : ApiController
{
    [Route]
    public HttpResponseMessage Get()
    {
        GCSettings.LargeObjectHeapCompactionMode = GCLargeObjectHeapCompactionMode.CompactOnce;
        GC.Collect(2);
        return Request.CreateResponse(HttpStatusCode.OK);
    }
}

This allows me to manually invoke a rude GC, and indeed, when running this, memory utilization dropped quite nicely. Of course, that isn’t something that you want to run into your systems, but it is very important diagnostic tool.

Next, I tried to do the following:

[Route]
public HttpResponseMessage Get()
{
    var file = File.ReadAllText(@"D:\5mb.json");
    var deserializeObject = JsonConvert.DeserializeObject<Item>(file);

    return this.Request.CreateResponse(HttpStatusCode.OK, new { deserializeObject.item, deserializeObject.children });
}

This involves no RavenDB code, but it has roughly the same memory pressure semantics as the customer code. Indeed, the problem reproduced quite nicely there as well.

So the issue is about a request that uses a lot of memory (include at least one big buffer), likely causing some fragmentation in the heap that would bring memory utilization high. When the system had had enough, it would reclaim all of that, but unless there is active memory pressure, I’m guessing that it would rather leave it like that until it has to pay that price.

More posts in "Production postmortem" series:

  1. (03 Oct 2022) Do you trust this server?
  2. (15 Sep 2022) The missed indexing reference
  3. (05 Aug 2022) The allocating query
  4. (22 Jul 2022) Efficiency all the way to Out of Memory error
  5. (18 Jul 2022) Broken networks and compressed streams
  6. (13 Jul 2022) Your math is wrong, recursion doesn’t work this way
  7. (12 Jul 2022) The data corruption in the node.js stack
  8. (11 Jul 2022) Out of memory on a clear sky
  9. (29 Apr 2022) Deduplicating replication speed
  10. (25 Apr 2022) The network latency and the I/O spikes
  11. (22 Apr 2022) The encrypted database that was too big to replicate
  12. (20 Apr 2022) Misleading security and other production snafus
  13. (03 Jan 2022) An error on the first act will lead to data corruption on the second act…
  14. (13 Dec 2021) The memory leak that only happened on Linux
  15. (17 Sep 2021) The Guinness record for page faults & high CPU
  16. (07 Jan 2021) The file system limitation
  17. (23 Mar 2020) high CPU when there is little work to be done
  18. (21 Feb 2020) The self signed certificate that couldn’t
  19. (31 Jan 2020) The slow slowdown of large systems
  20. (07 Jun 2019) Printer out of paper and the RavenDB hang
  21. (18 Feb 2019) This data corruption bug requires 3 simultaneous race conditions
  22. (25 Dec 2018) Handled errors and the curse of recursive error handling
  23. (23 Nov 2018) The ARM is killing me
  24. (22 Feb 2018) The unavailable Linux server
  25. (06 Dec 2017) data corruption, a view from INSIDE the sausage
  26. (01 Dec 2017) The random high CPU
  27. (07 Aug 2017) 30% boost with a single line change
  28. (04 Aug 2017) The case of 99.99% percentile
  29. (02 Aug 2017) The lightly loaded trashing server
  30. (23 Aug 2016) The insidious cost of managed memory
  31. (05 Feb 2016) A null reference in our abstraction
  32. (27 Jan 2016) The Razor Suicide
  33. (13 Nov 2015) The case of the “it is slow on that machine (only)”
  34. (21 Oct 2015) The case of the slow index rebuild
  35. (22 Sep 2015) The case of the Unicode Poo
  36. (03 Sep 2015) The industry at large
  37. (01 Sep 2015) The case of the lying configuration file
  38. (31 Aug 2015) The case of the memory eater and high load
  39. (14 Aug 2015) The case of the man in the middle
  40. (05 Aug 2015) Reading the errors
  41. (29 Jul 2015) The evil licensing code
  42. (23 Jul 2015) The case of the native memory leak
  43. (16 Jul 2015) The case of the intransigent new database
  44. (13 Jul 2015) The case of the hung over server
  45. (09 Jul 2015) The case of the infected cluster