Production postmortemThe case of the native memory leak

time to read 4 min | 785 words

This one is a pretty recent one. A customer complained about high memory usage in RavenDB under moderate usage. That was a cause for concern, since we care a lot about our memory utilization.

So we started investigating that, and it turned out that we were wrong, the problem wasn’t with RavenDB, it was with the RavenDB Client Library. The customer had a scenario where 100% of the time, after issuing a small number of requests (less than ten), the client process would be using hundreds of MB, for really no purpose at all. The client already turned off caching, profiling and pretty much anything else that both they and us could think of.

We got a process dump from them and looked at that, and everything seemed to be fine. The size of the heap was good, and there didn’t appear to be any memory being leaked. Our assumption at that point was that there is some sort of native memory leak from their application.

To continue the investigation further, NDAs was required, but we managed to go through that and we finally had a small repro that we could look at ourselves. The fact that the customer was able to create such a thing is really appreciated, because very often we have to work with a lot of missing information. Of course, when we run this on our own system, everything was just fine & dandy. There was no issue. We got back to the customer and they told us that the problem would only reproduce in IIS.

And indeed, once we have tested inside IIS (vs. using IIS Express), that problem was immediately obvious. That was encouraging (at least from my perspective). RavenDB doesn’t contain anything that would differentiate between IIS and IIS Express. So if the problem was only in IIS, that is not likely to be something that we did.

Once I had a repro of the problem, I sat down to observe what was actually going on. The customer was reading 1,024 documents from the server, and then sending them to the browser. That meant that each of the requests we tested was roughly 5MB in size. Note that this means that we have to:

  • Read the response from the server
  • Decompress the data
  • Materialize it to objects
  • Serialize the data back to send it over the network

That is a lot of memory that is being used here. And we never run into any such issues before.

I had a hunch, and I created the following:

[RoutePrefix("api/gc")]
public class GcController : ApiController
{
    [Route]
    public HttpResponseMessage Get()
    {
        GCSettings.LargeObjectHeapCompactionMode = GCLargeObjectHeapCompactionMode.CompactOnce;
        GC.Collect(2);
        return Request.CreateResponse(HttpStatusCode.OK);
    }
}

This allows me to manually invoke a rude GC, and indeed, when running this, memory utilization dropped quite nicely. Of course, that isn’t something that you want to run into your systems, but it is very important diagnostic tool.

Next, I tried to do the following:

[Route]
public HttpResponseMessage Get()
{
    var file = File.ReadAllText(@"D:\5mb.json");
    var deserializeObject = JsonConvert.DeserializeObject<Item>(file);

    return this.Request.CreateResponse(HttpStatusCode.OK, new { deserializeObject.item, deserializeObject.children });
}

This involves no RavenDB code, but it has roughly the same memory pressure semantics as the customer code. Indeed, the problem reproduced quite nicely there as well.

So the issue is about a request that uses a lot of memory (include at least one big buffer), likely causing some fragmentation in the heap that would bring memory utilization high. When the system had had enough, it would reclaim all of that, but unless there is active memory pressure, I’m guessing that it would rather leave it like that until it has to pay that price.

More posts in "Production postmortem" series:

  1. (07 Apr 2025) The race condition in the interlock
  2. (12 Dec 2023) The Spawn of Denial of Service
  3. (24 Jul 2023) The dog ate my request
  4. (03 Jul 2023) ENOMEM when trying to free memory
  5. (27 Jan 2023) The server ate all my memory
  6. (23 Jan 2023) The big server that couldn’t handle the load
  7. (16 Jan 2023) The heisenbug server
  8. (03 Oct 2022) Do you trust this server?
  9. (15 Sep 2022) The missed indexing reference
  10. (05 Aug 2022) The allocating query
  11. (22 Jul 2022) Efficiency all the way to Out of Memory error
  12. (18 Jul 2022) Broken networks and compressed streams
  13. (13 Jul 2022) Your math is wrong, recursion doesn’t work this way
  14. (12 Jul 2022) The data corruption in the node.js stack
  15. (11 Jul 2022) Out of memory on a clear sky
  16. (29 Apr 2022) Deduplicating replication speed
  17. (25 Apr 2022) The network latency and the I/O spikes
  18. (22 Apr 2022) The encrypted database that was too big to replicate
  19. (20 Apr 2022) Misleading security and other production snafus
  20. (03 Jan 2022) An error on the first act will lead to data corruption on the second act…
  21. (13 Dec 2021) The memory leak that only happened on Linux
  22. (17 Sep 2021) The Guinness record for page faults & high CPU
  23. (07 Jan 2021) The file system limitation
  24. (23 Mar 2020) high CPU when there is little work to be done
  25. (21 Feb 2020) The self signed certificate that couldn’t
  26. (31 Jan 2020) The slow slowdown of large systems
  27. (07 Jun 2019) Printer out of paper and the RavenDB hang
  28. (18 Feb 2019) This data corruption bug requires 3 simultaneous race conditions
  29. (25 Dec 2018) Handled errors and the curse of recursive error handling
  30. (23 Nov 2018) The ARM is killing me
  31. (22 Feb 2018) The unavailable Linux server
  32. (06 Dec 2017) data corruption, a view from INSIDE the sausage
  33. (01 Dec 2017) The random high CPU
  34. (07 Aug 2017) 30% boost with a single line change
  35. (04 Aug 2017) The case of 99.99% percentile
  36. (02 Aug 2017) The lightly loaded trashing server
  37. (23 Aug 2016) The insidious cost of managed memory
  38. (05 Feb 2016) A null reference in our abstraction
  39. (27 Jan 2016) The Razor Suicide
  40. (13 Nov 2015) The case of the “it is slow on that machine (only)”
  41. (21 Oct 2015) The case of the slow index rebuild
  42. (22 Sep 2015) The case of the Unicode Poo
  43. (03 Sep 2015) The industry at large
  44. (01 Sep 2015) The case of the lying configuration file
  45. (31 Aug 2015) The case of the memory eater and high load
  46. (14 Aug 2015) The case of the man in the middle
  47. (05 Aug 2015) Reading the errors
  48. (29 Jul 2015) The evil licensing code
  49. (23 Jul 2015) The case of the native memory leak
  50. (16 Jul 2015) The case of the intransigent new database
  51. (13 Jul 2015) The case of the hung over server
  52. (09 Jul 2015) The case of the infected cluster