Unlike previous posts in this series, this is actually something that happened to our own production server today. It resulted in our website being inaccessible for a a couple of hours, and like most such stories, its root cause is tremedously simple, and through a series of unfortunate accidents, it had escalated to a major issue.
First, this post is dedicted to this book, which should be read by any self respecting developer whose code is expected to hit production.
This post is also written a few hours only after the incident was resolved. Before we actually implemented anything except a temporary workaround. I’ll probably have another post in a couple of days to talk about the steps we are going to take to alleviate a repeat of this incident.
The incident started innocently enough, when one of the guys on the team discovered that the startup time of a certain instance jumped by a lot. Investigating into why he realized that the issue was extremely slow responses from our server. That was a cause of triple concern, actually. First, why are we accepting such slow responses instead of time limiting them? Second, why are we making a remote syncronous call during startup? And third, why on Earth is our server so slow?
Logging into the server, it didn’t take long to see what the problem was. The www.RavenDB.net website (the code that runs the RavenDB website, not RavenDB itself) was consuming a lot of CPU and quite a bit of memory. In a bit to restore the other services which reside on the same box, we reset the process. Our main concern at the time was to restore service as soon as possible, and we planned on investigating further through the logs.
However, in a few minutes, the www.RavenDB.net website started consuming more and more resources. At that point, we started considering a DoS attach of some sort and looked a the logs. The logs did show a very high number of requests, much more than I would expect. But looking further into them, it looked like they were mostly bots of various kinds indexing our site.
Considering that this might be the case of Google hammering us, we configured a robots.txt on the site and waited to see if this would have an impact. It didn’t.
The next step was to take a process dump of the process, and then analyze it. During this period, we had to shut down www.RavenDB.net because it was killing all other services running on the server.
Looking at the dump in WinDBG, we started with the obvious commands.
- !runaway – to find out the thread cpu times
- switching to the busiest threads
- !clrstack - to see what it is doing
Honestly, this is a much nicer way of looking at this, though:
As you can see, the threads are currently actually doing parsing of a Razor template, and seems to be doing that on a fairly continous basis, consuming all system resources.
At that point, I started getting concerned to the well being of the poor guy’s inbox as a result of this code. That was the point where we actually did what should have probably been our first action, and looked at the error log of the website.
Previously, we looked at the live metrics and the request log, but didn’t consider looking into the error log for the system. The error log for the website, for today only, was 6GB in size, and was pretty full of errors such as:
- error: (221, 88) 'HibernatingRhinos.Orders.Common.EmailProcessing.EmailTemplates.RavenDBWebsite.Models.DownloadQuestionMailInput' does not contain a definition for 'Unsubscribe' and no extension method 'Unsubscribe' accepting a first argument of type
And at that point, we had enough to point a suspicious finger. We have an email that we send out, and we used to have a valid template. At some point, the code was changed and the Unsubscribe was removed. Nothing broke because the template is just a file, not actually compiling code. However, in production, when we tried to send the email, Razor would parse the text, fail compilation because of the missing member, and basically thorw a hissy fit.
Update: We investigated further, and it looks like the following was the actual “solution” to the outage:
The “solution” is in quotes, becasue this fixes the problem, but we need to still implement steps to ensure that something like that doesn’t repeat.
Unforuntately, at that point, we would consider this email as failing, and move on to the next one. That next one would also fail, and so would the next one, etc. Because all of them failed, they would get picked up again next time this run.
Once we knew where the problem was. The workaround was to deploy a version with no email sending. For this weekend, that will work. But come Sunday, someone is going to go over this piece of code with a veyr fine comb. I’ll post more about it once this actually roll around.
More posts in "Production Postmortem" series:
- (23 Mar 2020) high CPU when there is little work to be done
- (21 Feb 2020) The self signed certificate that couldn’t
- (31 Jan 2020) The slow slowdown of large systems
- (07 Jun 2019) Printer out of paper and the RavenDB hang
- (18 Feb 2019) This data corruption bug requires 3 simultaneous race conditions
- (25 Dec 2018) Handled errors and the curse of recursive error handling
- (23 Nov 2018) The ARM is killing me
- (22 Feb 2018) The unavailable Linux server
- (06 Dec 2017) data corruption, a view from INSIDE the sausage
- (01 Dec 2017) The random high CPU
- (07 Aug 2017) 30% boost with a single line change
- (04 Aug 2017) The case of 99.99% percentile
- (02 Aug 2017) The lightly loaded trashing server
- (23 Aug 2016) The insidious cost of managed memory
- (05 Feb 2016) A null reference in our abstraction
- (27 Jan 2016) The Razor Suicide
- (13 Nov 2015) The case of the “it is slow on that machine (only)”
- (21 Oct 2015) The case of the slow index rebuild
- (22 Sep 2015) The case of the Unicode Poo
- (03 Sep 2015) The industry at large
- (01 Sep 2015) The case of the lying configuration file
- (31 Aug 2015) The case of the memory eater and high load
- (14 Aug 2015) The case of the man in the middle
- (05 Aug 2015) Reading the errors
- (29 Jul 2015) The evil licensing code
- (23 Jul 2015) The case of the native memory leak
- (16 Jul 2015) The case of the intransigent new database
- (13 Jul 2015) The case of the hung over server
- (09 Jul 2015) The case of the infected cluster