The reverse correlation between size of change and length of investigation
Something that I have noticed is that there is a strong reverse correlation between how long it takes to resolve a problem and the size of the change. In other words, the more time you spend on investigating an issue, the less code will be required to fix the issue.
Case in point, we just closed an issue that took one of the best guys in the team almost a month of investigation to fix. The size of the change? 3 lines of code. My personal best is a 15 man weeks over 3 weeks period with 5 people head done trying to resolve a problem that ended up being a missing ToList() call.
This is usually when there are race conditions, hardware or very long test cycles involved. In this case, this was a problem that could only be reproduced on ARM devices with slow I/O and a particular race condition after we created a very big database.
Thinking about this, it make sense. The more time the investigation takes, the more things you rule out, so eventually it ends up with something subtle that doesn’t work. It make sense, but it can be frustrating for the developer, “I spent all this time, and that is the result?”.
What is your best bug story?
Comments
Earlier this year, I was documenting a large SSIS package that has been largely ignored for far too long. Took about a month to document. About a week in, I noticed it was performing a particular calculation wrongly. Badly wrongly. As I continued to work through the package, I kept being needled by the knowledge of this bug and wondering "why has nobody reported this?". Until I reached the end and found that all of the interim work was thrown away and the data was re-extracted and the correct calculation then performed and used.
So, about 3 weeks or tracking a bug and it resulted in 0 changes being made (other than a recommendation that, were it not for the fact we plan to replace this package in the medium term, a lot of the code there can be removed). And at least the data flow is now documented.
I remember a C++ project we were working on in the mid 90s. It was crashing sometimes, sometimes not at all, sometimes immediately, really frustrating ... We hunted the bug for a week: reading, re-reading, dissecting, bisecting, debugging the code, but none of us could pinpoint the problem. I do not recall how we eventually found it, but there was a for loop that ran one element too far ...
You can imagine my excitement when I heard that C# would include a foreach loop. C# just had to be much better than C++! 😜
// Ryan
In some cases, it's because the investigation was long and arduous that the fix can be small. I'm sure every developer's encountered bugs where the quick fix involves modifying various things to try to quench the breakage, but deeper investigation eventually reveals some subtle error which requires only a one-line (or one-character!) fix. Usually, that subtle error had a bunch of other subtler side-effects that the aforementioned quick fix would've done nothing to prevent!
I'm usually suspicious of large, far-reaching bugfixes. They often indicate that the real problem has not been understood yet. Bugfixes should be minimal, specific, correct, and testable. Armouring the system against similar issues in future should really be a separate commit...
In 2002, in an programming course for beginners a student wrote a loop like below.
The instructor had a very hard time to find out why this loop were only printing "10, ". After all this years this one is still my favorite.
Sorry, of course "0, ".
Was variable i scoped to the for block in C\C++? I guess it was not, so it should print "10, " 😁
// Ryan
Maybe not my best but the freshest in my mind because I'm stuck in suck a situation RIGHT NOW! I've spent 7 days so far on a bug where running our app in Japanese will fail to show the IME (Input Method Editor) UI in one small area of editing a script using some editor code derived from Scintilla. It's a WPF application and 've narrowed it down to this: it only shows the bad behavior if focus was previously on a push button or a toggle (check/radio) button. Focusing the editor from any other control? The IME works fine. So, it seems to be an issue with ButtonBase and I can see code in the WPF ButtonBase which forces the IME to be hidden when focus is in these controls yet it affects things when focus moves to the NEXT control (my user control). So the spelunking, trail and error continues through FocusScopes, InputMethod, etc...
I'm at that stage where I can craft a workaround but the workaround nearly worse than the original issue.
Karsten, I missed the error _twice_. But this should be a compilation error, no?
It's not a compilation error in C++; it's used in code like:
In D, contrariwise, you must indicate an empty block with
{}
instead of just;
, so it is a compilation error.My best bug story is almost unbelievable, but I swear it's true and accurate.
I joined a small company providing mail house print archiving software, and was immediately tasked with fixing a rare server crash which they had been trying to fix it for over amost 2 years.
It took me 18 months, and I fixed dozens of other bugs and performance bottlenecks before eventually being in a routine of load testing the software every weekend and then spending most of Monday looking for patterns that predicated the crash.. (the crash normally occurred after 30-40 hours of high load)
Finally, after 18 months of solid searching I fixed the bug by deleting 207 lines of mostly assembly code, and replaced it with nothing.
You read that correctly - it took me 18 months to discover that deleting 207 lines of code permanently fixed a 3 year old bug!
After that, my weekly assignment in the development meetings was to find more code to delete!
P.S. The code was a custom written thread dispatcher that had been required with an earlier version of Delphi, but was no longer required.
Back in college I worked for the College in IT. A good fraction of the staff were students working part time.
After about a year another older student graduated and was handing off unfinished work before leaving. She handed me a thick folder that documented a bug. She told me that she had been given it 2 years ago when another student had handed off to her. That student had worked on it for over a year.
The bug was a rare miscalculation of grade point average. It was rare and noticeable when it happened so manual checks had been put in place.
By pure accident I found it almost 2 years later when I worked on a program that I hadn't looked at before and saw the grade point average calculation code duplicated but not quite the same. It was only few characters different. I would have factored it out to a shared implementation but the system was slated for replacement the next year.
Comment preview