Ayende @ Rahien

Ayende @ Rahienhttp://ayende.comAyende @ RahienCopyright (C) Ayende Rahien 2004 - 2021 (c) 202660MF commented on An epic bug storyso the fix is to make a scheduled task which reboots the system at midnight every couple of days? ;) http://ayende.com/4270/an-epic-bug-story#comment20http://ayende.com/4270/an-epic-bug-story#comment20Wed, 04 Nov 2009 07:55:04 GMTHereBeDragon commented on An epic bug storyNail Biting. I read through this story in one single sitting. ;-) http://ayende.com/4270/an-epic-bug-story#comment19http://ayende.com/4270/an-epic-bug-story#comment19Mon, 02 Nov 2009 12:42:34 GMTPop Catalin commented on An epic bug storySo many people were asking for a reboot ... a reboot does not "fix" or help find the bug. Reminds me of a joke ... http://ayende.com/4270/an-epic-bug-story#comment18http://ayende.com/4270/an-epic-bug-story#comment18Fri, 30 Oct 2009 02:37:20 GMTAndrew commented on An epic bug storyRe: Why didn't they just restart the Production Servers. I've never worked anywhere where IT would reboot a Production Box unless it was a last resort. http://ayende.com/4270/an-epic-bug-story#comment17http://ayende.com/4270/an-epic-bug-story#comment17Thu, 29 Oct 2009 17:24:50 GMTRob commented on An epic bug storyThese sorts of bugs are made by the same sorts of people who write code like this to prevent users from resizing forms. //Handle resize event. void Resize() { this.Width = 200; this.Height = 200; } rather than just setting the forms border property to one of the fixed size values and letting windows handle it. Its realy just a lack of knowledge about the environment or framework being used to code the solution. http://ayende.com/4270/an-epic-bug-story#comment16http://ayende.com/4270/an-epic-bug-story#comment16Thu, 29 Oct 2009 16:59:31 GMTJoe commented on An epic bug storyMoral of the story: When in doubt, get Jane to look at the code. http://ayende.com/4270/an-epic-bug-story#comment15http://ayende.com/4270/an-epic-bug-story#comment15Thu, 29 Oct 2009 16:40:46 GMTtobi commented on An epic bug storyi wonder why they didn't just try rebooting the production servers. the problem would have been solved immediately. http://ayende.com/4270/an-epic-bug-story#comment14http://ayende.com/4270/an-epic-bug-story#comment14Thu, 29 Oct 2009 14:51:44 GMTPop Catalin commented on An epic bug storyThere are many horror stories regarding pseudo random number generators (like TickCount % n, or very similar), I wish people would use Random() instead of inventing their own pseudo random generators or pseudo ID generators. I guess this is where the true difference between developers is made, developers that try to write clever code and those that try to write solid code. I wonder if all those shaved microseconds from using tick count added up payed for those down times. This is not an epic bug this is epic fail, there are so many TickCount horror stories out there, that any non ignorant developer (especially those writing server applications) would always see big warning signs over code that uses it. The 1st thing: what happens if the system timer resolution changes ? (like a new piece of hardware, a software update or something else) then you've got a terribly biased pseudo random number generator or what's the distribution like in the fist place? I wonder how can anyone think to use such a mechanism in the first place ... The 2nd thing: such an micro optimization won't have a measurable impact in the final application (application that executes queries over network) no matter how you try to measure it. Those microseconds will be entirely shadowed by the network latency times. Man, I hate clever hacks so much, I've been bitten by them quite a few times (hacks written by others). http://ayende.com/4270/an-epic-bug-story#comment13http://ayende.com/4270/an-epic-bug-story#comment13Thu, 29 Oct 2009 10:32:12 GMTPhilip Løventoft commented on An epic bug storyIn my university they reboot our RADIUS servers repeatedly controlling the Eduroam wi-fi, because appearently it will leak some kind of handles and once it has lost a handle it cannot get a new one. The problem is that there is only a fixed number of handles, so after a certain number of lost handles, it will refuse new users to log on to the network. It is pretty dumb, but it is one of those cases where we are using generic software so the vendor cannot make a bug fix. http://ayende.com/4270/an-epic-bug-story#comment12http://ayende.com/4270/an-epic-bug-story#comment12Thu, 29 Oct 2009 09:21:08 GMTLiam McLennan commented on An epic bug storyIn reality, once they figured out that rebooting solved the problem they would have setup a schedule to reboot the servers every week and moved on. I know of large organisations that reboot their servers every 48 hours because of bugs like these. http://ayende.com/4270/an-epic-bug-story#comment11http://ayende.com/4270/an-epic-bug-story#comment11Wed, 28 Oct 2009 20:10:34 GMTJimmy Zimmerman commented on An epic bug story>That's why you should never restart a development machine. If you restart, you miss obvious bugs that happens for machines running over 25 days ;) Yup. Burn-in testing is a must for enterprise systems. It's not good enough that the test team verifies the functionality delivered from the requirements but that the entire system continues to run longer than a simple regression cycle. PS. Oren, yes I know I still owe you answers from Alex James - i haven't forgot. Just in the middle of proposal land right now and it's killing me a little inside every day =D http://ayende.com/4270/an-epic-bug-story#comment10http://ayende.com/4270/an-epic-bug-story#comment10Wed, 28 Oct 2009 18:20:39 GMTReboot commented on An epic bug storyCute story, but anyone knows that rebooting the server is the FIRST thing they would have tried. OK, maybe the second, after resetting IIS. So the whole thing is kinda disappointing when I got to the end. http://ayende.com/4270/an-epic-bug-story#comment9http://ayende.com/4270/an-epic-bug-story#comment9Wed, 28 Oct 2009 17:22:33 GMTAndrey Titov commented on An epic bug storyOnce we found that our development web server runs out of space and all space was eaten by constantly growing event log. The log was full of exception messages on invalid web service calls. We found this bug only when one developer was sick, but his computer stays running and one tricky page stays opened in his browser. This page does request to web service every minute in normal situation. But once we changed signature of web service method. And this page became mad. If request to service fails it was tried to repeat it after one second and repeated this infinitely making DOS attack to server. After that we changed this page to repeat calls only after 5 minutes after few unsuccessful tries. http://ayende.com/4270/an-epic-bug-story#comment8http://ayende.com/4270/an-epic-bug-story#comment8Wed, 28 Oct 2009 14:49:40 GMTNicholas Piasecki commented on An epic bug storyHmm. Still not sure what Environment.TickCount has to do with the size of a scrollbar in a textbox, but sure is a neat story! =) http://ayende.com/4270/an-epic-bug-story#comment7http://ayende.com/4270/an-epic-bug-story#comment7Wed, 28 Oct 2009 14:34:58 GMTconfigurator commented on An epic bug storyThat's why you should never restart a development machine. If you restart, you miss obvious bugs that happens for machines running over 25 days ;) http://ayende.com/4270/an-epic-bug-story#comment6http://ayende.com/4270/an-epic-bug-story#comment6Wed, 28 Oct 2009 13:37:10 GMTRoy commented on An epic bug storyHave you tried turning it off and on again? http://ayende.com/4270/an-epic-bug-story#comment5http://ayende.com/4270/an-epic-bug-story#comment5Wed, 28 Oct 2009 12:24:13 GMTDan commented on An epic bug story if(server.IsAlive == false) { aliveAndWellServes.Remove(server); server.Dispose(); } return server; first dispose server then return a reference to it? http://ayende.com/4270/an-epic-bug-story#comment4http://ayende.com/4270/an-epic-bug-story#comment4Wed, 28 Oct 2009 12:02:35 GMTCory Foy commented on An epic bug storyThat's why one of the first things I used to do when I was flown in for these kinds of issues was to connect to the app with WinDBG and monitor for First Chance Exceptions. In fact, a quick story... We had a similar call to what you had above. The CTO for a major company and the CEO of their development consulting were on the phone, along with their account manager from MSFT and myself as the tech lead. They were having a major issue with errors on their e-commerce site that was causing them to lose a lot of money. I suspected it was an application error, but the CTO was blaming us and the vendor, and the vendor was blaming us. While we were on the conference call, I asked them to send me what they had deployed and when the issue started occurring. They sent it and told me. While they were yelling back and forth, I cracked open the code and saw at the top of one of the methods: //MJF: Modified 10/2/2009 to handle XYZ scenario 10/2/2009 happened to be when the issue started occurring, and also happened to be the only comment with such a date. I broke into the conversation and asked the vendor if they had a developer with the initials MJF, because the code I saw could cause an error condition under a certain set of scenarios - which wasn't being handled. There was dead silence on the vendor side, and within 2 minutes they came back and said they had found the issue and would be resolving it. It was one of the best feelings of my life, to find a perfect smoking gun. In this case, there wasn't the option to protect the vendor (I tried to be as gentle as possible), but at least they were able to find the issue and get it back up. So, great story about knowing your framework, and about finding the root cause. I was amazed they didn't start looking for patterns during the second outage, but at least someone eventually did. http://ayende.com/4270/an-epic-bug-story#comment3http://ayende.com/4270/an-epic-bug-story#comment3Wed, 28 Oct 2009 11:25:59 GMTjokin commented on An epic bug storyFor a moment I thought I was reading this at codinghorror.com Good bug catch, but i wonder why they didn't reboot the machines before changing the kernel, at least, that is the universal solution in IT http://ayende.com/4270/an-epic-bug-story#comment2http://ayende.com/4270/an-epic-bug-story#comment2Wed, 28 Oct 2009 11:17:42 GMTPeterFox commented on An epic bug storyProbable typo in the first paragraph: I guess you meant to say the system was gone, not done. Why this story is fictitious: no one would have the courage to wake up the *female* member of the team in the middle of the night with a problem like that. Think of the repercussions. :) http://ayende.com/4270/an-epic-bug-story#comment1http://ayende.com/4270/an-epic-bug-story#comment1Wed, 28 Oct 2009 11:12:16 GMT