Production postmortemThe case of the native memory leak
This one is a pretty recent one. A customer complained about high memory usage in RavenDB under moderate usage. That was a cause for concern, since we care a lot about our memory utilization.
So we started investigating that, and it turned out that we were wrong, the problem wasn’t with RavenDB, it was with the RavenDB Client Library. The customer had a scenario where 100% of the time, after issuing a small number of requests (less than ten), the client process would be using hundreds of MB, for really no purpose at all. The client already turned off caching, profiling and pretty much anything else that both they and us could think of.
We got a process dump from them and looked at that, and everything seemed to be fine. The size of the heap was good, and there didn’t appear to be any memory being leaked. Our assumption at that point was that there is some sort of native memory leak from their application.
To continue the investigation further, NDAs was required, but we managed to go through that and we finally had a small repro that we could look at ourselves. The fact that the customer was able to create such a thing is really appreciated, because very often we have to work with a lot of missing information. Of course, when we run this on our own system, everything was just fine & dandy. There was no issue. We got back to the customer and they told us that the problem would only reproduce in IIS.
And indeed, once we have tested inside IIS (vs. using IIS Express), that problem was immediately obvious. That was encouraging (at least from my perspective). RavenDB doesn’t contain anything that would differentiate between IIS and IIS Express. So if the problem was only in IIS, that is not likely to be something that we did.
Once I had a repro of the problem, I sat down to observe what was actually going on. The customer was reading 1,024 documents from the server, and then sending them to the browser. That meant that each of the requests we tested was roughly 5MB in size. Note that this means that we have to:
- Read the response from the server
- Decompress the data
- Materialize it to objects
- Serialize the data back to send it over the network
That is a lot of memory that is being used here. And we never run into any such issues before.
I had a hunch, and I created the following:
[RoutePrefix("api/gc")] public class GcController : ApiController { [Route] public HttpResponseMessage Get() { GCSettings.LargeObjectHeapCompactionMode = GCLargeObjectHeapCompactionMode.CompactOnce; GC.Collect(2); return Request.CreateResponse(HttpStatusCode.OK); } }
This allows me to manually invoke a rude GC, and indeed, when running this, memory utilization dropped quite nicely. Of course, that isn’t something that you want to run into your systems, but it is very important diagnostic tool.
Next, I tried to do the following:
[Route] public HttpResponseMessage Get() { var file = File.ReadAllText(@"D:\5mb.json"); var deserializeObject = JsonConvert.DeserializeObject<Item>(file); return this.Request.CreateResponse(HttpStatusCode.OK, new { deserializeObject.item, deserializeObject.children }); }
This involves no RavenDB code, but it has roughly the same memory pressure semantics as the customer code. Indeed, the problem reproduced quite nicely there as well.
So the issue is about a request that uses a lot of memory (include at least one big buffer), likely causing some fragmentation in the heap that would bring memory utilization high. When the system had had enough, it would reclaim all of that, but unless there is active memory pressure, I’m guessing that it would rather leave it like that until it has to pay that price.
More posts in "Production postmortem" series:
- (12 Dec 2023) The Spawn of Denial of Service
- (24 Jul 2023) The dog ate my request
- (03 Jul 2023) ENOMEM when trying to free memory
- (27 Jan 2023) The server ate all my memory
- (23 Jan 2023) The big server that couldn’t handle the load
- (16 Jan 2023) The heisenbug server
- (03 Oct 2022) Do you trust this server?
- (15 Sep 2022) The missed indexing reference
- (05 Aug 2022) The allocating query
- (22 Jul 2022) Efficiency all the way to Out of Memory error
- (18 Jul 2022) Broken networks and compressed streams
- (13 Jul 2022) Your math is wrong, recursion doesn’t work this way
- (12 Jul 2022) The data corruption in the node.js stack
- (11 Jul 2022) Out of memory on a clear sky
- (29 Apr 2022) Deduplicating replication speed
- (25 Apr 2022) The network latency and the I/O spikes
- (22 Apr 2022) The encrypted database that was too big to replicate
- (20 Apr 2022) Misleading security and other production snafus
- (03 Jan 2022) An error on the first act will lead to data corruption on the second act…
- (13 Dec 2021) The memory leak that only happened on Linux
- (17 Sep 2021) The Guinness record for page faults & high CPU
- (07 Jan 2021) The file system limitation
- (23 Mar 2020) high CPU when there is little work to be done
- (21 Feb 2020) The self signed certificate that couldn’t
- (31 Jan 2020) The slow slowdown of large systems
- (07 Jun 2019) Printer out of paper and the RavenDB hang
- (18 Feb 2019) This data corruption bug requires 3 simultaneous race conditions
- (25 Dec 2018) Handled errors and the curse of recursive error handling
- (23 Nov 2018) The ARM is killing me
- (22 Feb 2018) The unavailable Linux server
- (06 Dec 2017) data corruption, a view from INSIDE the sausage
- (01 Dec 2017) The random high CPU
- (07 Aug 2017) 30% boost with a single line change
- (04 Aug 2017) The case of 99.99% percentile
- (02 Aug 2017) The lightly loaded trashing server
- (23 Aug 2016) The insidious cost of managed memory
- (05 Feb 2016) A null reference in our abstraction
- (27 Jan 2016) The Razor Suicide
- (13 Nov 2015) The case of the “it is slow on that machine (only)”
- (21 Oct 2015) The case of the slow index rebuild
- (22 Sep 2015) The case of the Unicode Poo
- (03 Sep 2015) The industry at large
- (01 Sep 2015) The case of the lying configuration file
- (31 Aug 2015) The case of the memory eater and high load
- (14 Aug 2015) The case of the man in the middle
- (05 Aug 2015) Reading the errors
- (29 Jul 2015) The evil licensing code
- (23 Jul 2015) The case of the native memory leak
- (16 Jul 2015) The case of the intransigent new database
- (13 Jul 2015) The case of the hung over server
- (09 Jul 2015) The case of the infected cluster
Comments
So why would this show up in IIS, and not in IIS express?
soums, I don't acutally know. I assume that this is related to the way IIS is actually sending data over the network, or how it manages the GC
Interesting post :)
Same question as soums... Was it maybe that the IIS Express version being used was the 32-bit version, but when running under IIS it was running 64-bits instead? I personally use 64-bit IIS Express just to avoid another difference between my environment and production but VS tends to default to 32-bit and most devs don't change it (or know it can be changed) it seems.
It may not be the most relevant post, but reading your blog I appreciate your opinion. I came across Stackify a .NET APM, log management and error tracking tool and I was wondering if you had any experience with it, or have any opinion on it.
John, I'm not familiar with it.
Comment preview