An epic bug story
This is a fictitious example of such a bug, but this would probably sound familiar.
Jane was soundly sleeping around 3AM when the phone started ringing off the hook. Dazed and not quite in full possession of her faculties, she picked up the phone and heard Tom from Operations screaming that the Northwind System was gone. It wasn’t just one application that died, it was all of them, and Operations had no idea what is going on, any CxO in the company was pissed and they were bleeding money and reputation by the second.
While driving to work, Jane took part in a conference call trying to figure out what happened. The following things became clear:
- There had been no changes to the production environment in the last week.
- The last change was in the reporting system, which was humming along for the last two weeks or so.
- Database servers were pegged by ridiculous amounts of queries from all applications.
- Cache servers were up and functioning.
- The production logs verbosity was adjusted and it seemed that every single query was a cache miss.
It became clear that something cause every single application to skip the cache and go directly to the database, but no one knew what.
Upon arriving at the office, they run into more puzzlement, frustration and anger. The problem was definitely related to the cache, but there was absolutely no explanation for that. The cache code was written several years ago, it was a pretty complex piece of code, and it was tested in production for almost as long.
The first suspicion fell on some sort of communication breakage between the application servers and the cache servers, but that was quickly ruled out. Looking at the logs, it was clear that the problem was 100% cache misses causing every single query to hit the database, which responded by throwing up its hands and sulking, refusing to talk to anyone.
The problem was absolutely non reproducible in the test or development environment.
The team decided to appropriate a single production server and run their tests on it. In production, it was absolutely reproducible, but no one could figure out why. Network traces from that server showed that it never even tried to connect to the cache servers for reading the queries, however, it would happily connect to the cache servers for writing the queries.
The root problem was found, then, somehow, there was a problem in reading data from the cache servers. Three developers and at least one IT guy were willing to swear on the religion of your choice that this is absolutely not possible, but evidence suggested otherwise.
Installing a kernel level network trace was suggested. After installing the trace and rebooting the machine, the team hovered over the admin head while they started up the application again and watched the network trace from the kernel. This time, however, they could see it reading from the cache servers.
Not only that, but it looked like the application was functioning normally! Installing the kernel trace fixed the problem! Since this was an issue of production being down, they tried the same method on a second server, and that came back up normally as well. Every single IT guy in the building was suddenly busy installing a kernel network trace, and lo and behold, the entire system came back up.
By that time, it was close to 4 PM of the following day, and just about everyone was exhausted. The CTO sent everyone whom and scheduled a meeting at 9 AM to discuss what happened.
At 9 AM the following day, no one could really suggest a working hypothesis for what actually happened. The Administration Module Team Lead suggested that their system has run into some threading bug in the network stack and installing the kernel level trace had changed the timing just enough to make the problem go away. The Reporting Module Team Lead snapped that it is not possible to have a network bug that only affect reading from a particular socket. The IT group insisted that there were no modifications to the production environment, all changed firewall rules in the last month were reviewed, and they couldn’t do that.
The CTO closed the meeting by quoting the damage that this downtime has done to the company, and directed everyone to investigate it further. No actual smoking gun could be found, and eventually the matter was dropped. One of the results of this change was an added emphasis on the production worthiness of all the parts of the system. Traces and logs were enhanced, circuit breakers were introduced and in general the team thought that they are pretty safe. The kernel trace fixed the problem, of course, but next time they will have a lot more information about what is going on.
Four months later…
The production issue that brought down the site was almost forgotten by this time, the system was running along happily for months, the changes to make the system more robust in production were already there, and everyone was happy.
Until the 3AM call on Thursday’s night. While driving back to work on empty roads, Jane had a strong sense of deja-vu. Again, the problem was that the system was hammering the database, again the log showed 100% cache misses. Again, there was no significant change to the production environment.
This time, the team already knew who the culprit was, and the caching system was given a thorough overview. No one could really find something wrong with it, but even more logging was added, and the change was pushed to production, where it didn’t seem to do any good at all.
Trying to ignore the CxO screaming match in the nearby conference room, the Infrastructure team tried to figure out what was wrong exactly. Jane thought that the problem was with the cache invalidation logic. The caching system used a fairly complex system to decide whatever a value in the cache was fresh enough or not.
Disabling that piece of the system was risking stale data scenarios, but stale data was better than no data. They commented that part out and push that to production. Unfortunately, it didn’t change one iota, and the system kept producing cache misses.
That was patently impossible, of course, but there was it. The IT group, in the meantime, came to the rescue. Trevor, the new guy on the team, decided to try deploying the application to a brand new server that was waiting to be provisioned. On the new server, the system worked, and that server did not have the kernel network trace installed.
Acting on instinct, they tried uninstalling the kernel network trace on one of the existing production servers. Like magic, that server came back to life and started working normally seconds after the reboot was completed. A rolling uninstallion of the magical kernel trace fixed the issue.
But this was being the second failure this year, and both could be traced to the same component. The development team was at a loss to explain what was going on there. For the next month or so, no developer dared walk by the executives part of the building. External consultants were brought in to do an in depth study of the code base, trying to find the problem. The consultants couldn’t really find anything to blame, but they didn’t trust the existing home grown cache servers. The company bought a commercial cache package, to the tune of 2,000 $ per CPU, replacing their caching backend.
For a while, it seemed like that would do it…
Jane woke up a few seconds before 3 AM, two months after the previous incident. It was Thursday again, and the phone was ringing. After briefly considering running away to the Sahara, she picked up the phone. It was the exact same scenario.
After arriving at the office, Jane didn’t bother to join the already violent row in the conference room, she sat down on her desk and brought up the Caching Module. She sat down and added a logging statement after each line in the codebase. Then she called Operations and had them push the new caching module to production.
That change didn’t fix the problem, but it gave Jane an exception. And that exception just wasn’t possible.
System.IndexOutOfRangeException: Index was outside the bounds of the array.
at System.Array.InternalGetReference(Void* elemRef, Int32 rank, Int32* pIndices)
at System.Array.GetValue(Int32 index)
at Northwind.CachingModule.CacheSelector.GetServer()
at Northwind.CachingModule.Cache.Get(string key)
That didn’t make any sort of sense, but at least she had a single method to check for.
private CacheServer GetServer()
{
if(aliveAndWellServes.Count == 0)
return null;
var server = aliveAndWellServes[Environment.TickCount % aliveAndWellServes.Length];
if(server.IsAlive == false)
{
aliveAndWellServes.Remove(server);
server.Dispose();
}
return server;
}
The idea was to spread the load between the cache servers, so there wouldn’t be a single cache server being hit repeatedly. Somehow, that method was failing, when it failed, it caused the caching system generic error handler to consider this as a downed cache server, and produce a cache miss.
Jane pushed another update to production, this time adding pervasive logging to the GetServer method. The output said it all.
Alive and well servers: 4
Tick count: -137538094
Server index: -2
Looking at the documentation of the Environment.TickCount, it all became clear. That method will return a negative number if the system has been running for over 25 days. When that happened, the cache implementation would fail, leading to overload on the database and closing the entire system down.
The reason that the error was patches meant that the servers were occasionally rebooted, so you had to wait until enough time passed since the last reboot. Installing and uninstalling the kernel trace resulted in a reboot, which temporarily fix the problem.
An interesting final tidbit about this error. No one mentioned to the executives that all source control systems come with a blame feature.
Comments
Probable typo in the first paragraph: I guess you meant to say the system was gone, not done.
Why this story is fictitious: no one would have the courage to wake up the female member of the team in the middle of the night with a problem like that. Think of the repercussions. :)
For a moment I thought I was reading this at codinghorror.com
Good bug catch, but i wonder why they didn't reboot the machines before changing the kernel, at least, that is the universal solution in IT
That's why one of the first things I used to do when I was flown in for these kinds of issues was to connect to the app with WinDBG and monitor for First Chance Exceptions. In fact, a quick story...
We had a similar call to what you had above. The CTO for a major company and the CEO of their development consulting were on the phone, along with their account manager from MSFT and myself as the tech lead. They were having a major issue with errors on their e-commerce site that was causing them to lose a lot of money. I suspected it was an application error, but the CTO was blaming us and the vendor, and the vendor was blaming us.
While we were on the conference call, I asked them to send me what they had deployed and when the issue started occurring. They sent it and told me. While they were yelling back and forth, I cracked open the code and saw at the top of one of the methods:
//MJF: Modified 10/2/2009 to handle XYZ scenario
10/2/2009 happened to be when the issue started occurring, and also happened to be the only comment with such a date. I broke into the conversation and asked the vendor if they had a developer with the initials MJF, because the code I saw could cause an error condition under a certain set of scenarios - which wasn't being handled. There was dead silence on the vendor side, and within 2 minutes they came back and said they had found the issue and would be resolving it.
It was one of the best feelings of my life, to find a perfect smoking gun. In this case, there wasn't the option to protect the vendor (I tried to be as gentle as possible), but at least they were able to find the issue and get it back up.
So, great story about knowing your framework, and about finding the root cause. I was amazed they didn't start looking for patterns during the second outage, but at least someone eventually did.
if(server.IsAlive == false)
first dispose server then return a reference to it?
Have you tried turning it off and on again?
That's why you should never restart a development machine. If you restart, you miss obvious bugs that happens for machines running over 25 days ;)
Hmm.
Still not sure what Environment.TickCount has to do with the size of a scrollbar in a textbox, but sure is a neat story!
=)
Once we found that our development web server runs out of space and all space was eaten by constantly growing event log. The log was full of exception messages on invalid web service calls.
We found this bug only when one developer was sick, but his computer stays running and one tricky page stays opened in his browser. This page does request to web service every minute in normal situation. But once we changed signature of web service method. And this page became mad. If request to service fails it was tried to repeat it after one second and repeated this infinitely making DOS attack to server. After that we changed this page to repeat calls only after 5 minutes after few unsuccessful tries.
Cute story, but anyone knows that rebooting the server is the FIRST thing they would have tried. OK, maybe the second, after resetting IIS. So the whole thing is kinda disappointing when I got to the end.
Yup. Burn-in testing is a must for enterprise systems. It's not good enough that the test team verifies the functionality delivered from the requirements but that the entire system continues to run longer than a simple regression cycle.
PS. Oren, yes I know I still owe you answers from Alex James - i haven't forgot. Just in the middle of proposal land right now and it's killing me a little inside every day =D
In reality, once they figured out that rebooting solved the problem they would have setup a schedule to reboot the servers every week and moved on. I know of large organisations that reboot their servers every 48 hours because of bugs like these.
In my university they reboot our RADIUS servers repeatedly controlling the Eduroam wi-fi, because appearently it will leak some kind of handles and once it has lost a handle it cannot get a new one. The problem is that there is only a fixed number of handles, so after a certain number of lost handles, it will refuse new users to log on to the network.
It is pretty dumb, but it is one of those cases where we are using generic software so the vendor cannot make a bug fix.
There are many horror stories regarding pseudo random number generators (like TickCount % n, or very similar), I wish people would use Random() instead of inventing their own pseudo random generators or pseudo ID generators.
I guess this is where the true difference between developers is made, developers that try to write clever code and those that try to write solid code. I wonder if all those shaved microseconds from using tick count added up payed for those down times.
This is not an epic bug this is epic fail, there are so many TickCount horror stories out there, that any non ignorant developer (especially those writing server applications) would always see big warning signs over code that uses it.
The 1st thing: what happens if the system timer resolution changes ? (like a new piece of hardware, a software update or something else) then you've got a terribly biased pseudo random number generator or what's the distribution like in the fist place? I wonder how can anyone think to use such a mechanism in the first place ...
The 2nd thing: such an micro optimization won't have a measurable impact in the final application (application that executes queries over network) no matter how you try to measure it. Those microseconds will be entirely shadowed by the network latency times.
Man, I hate clever hacks so much, I've been bitten by them quite a few times (hacks written by others).
i wonder why they didn't just try rebooting the production servers. the problem would have been solved immediately.
Moral of the story:
When in doubt, get Jane to look at the code.
These sorts of bugs are made by the same sorts of people who write code like this to prevent users from resizing forms.
//Handle resize event.
void Resize()
{
}
rather than just setting the forms border property to one of the fixed size values and letting windows handle it.
Its realy just a lack of knowledge about the environment or framework being used to code the solution.
Re: Why didn't they just restart the Production Servers.
I've never worked anywhere where IT would reboot a Production Box unless it was a last resort.
So many people were asking for a reboot ... a reboot does not "fix" or help find the bug.
Reminds me of a joke ...
Nail Biting. I read through this story in one single sitting. ;-)
so the fix is to make a scheduled task which reboots the system at midnight every couple of days?
;)
Comment preview