We recently had to deal with a stress test that was failing (very) occasionally. We looked into that, and we figured out that this was actually exposing a high severity bug. So we looked into it pretty seriously.
And we keep hitting a wall.
There are a few reasons why you would have a non-reproducible error. The most common issue is if you have a race. For example, if two threads are modifying the same variable without proper synchronization, this will cause those kind of symptoms. We have a lot of experience in dealing with that, and all the signs were there. But we still hit a wall.
You see, the problem was that the entire section of the code we were looking at was protected by a lock. There was no possibility of an error because of threading, since this whole part of the code just couldn’t run concurrently anyway.
So why was it failing only occasionally? If it is single threaded, it should be predictable. In fact, the reason there was a lock there, instead of the more complex merge operations we used to have, was specifically to support reproducibility. The other kind of issue that can create this sort is I/O (which has its own timing issues), but the error happened in a part of the code that was operating on internal data structures, so it wasn’t that.
Eventually we figured it out. This was a long background operation, and because we needed to have the lock in place, we had a piece of code similar to this:
Because this is a long running operation, under lock, we do this in stages, and make sure that other things can run while we do that. But this was exactly what introduced the variability in the test results, and that made it so random and annoying. Once we figured that this was the cause for the difference, all we had to do was write the proper log of operations, and execute it again.
The rest was just finding out which of the 200,000 transactions executed actually caused the problem, mere detail work .