There are at least 3 puns in the title of this blog post. I’m sorry, but I’m writing this after several days of tracking an impossible bug. I’m actually writing a set of posts to wind down from this hunt, so you’ll have to suffer through my more prosaic prose.
This bug is the kind that leaves you questioning your sanity after days of pursuit, the kind that I’m sure I’ll look back on and blame for any future grey hair I have. I’m going to have another post talking about the bug since it is such a doozy. In this post, I want to talk about the general approach I take when dealing with something like this.
Beware, this process involves a lot of hair-pulling. I’m saving that for when the real nasties show up.
The bug in question was a race condition that defied easy reproduction. It didn’t show up consistently—sometimes it surfaced, sometimes it didn’t. The only “reliable” way to catch it was by running a full test suite, which took anywhere from 8 to 12 minutes per run. If the suite hung, we knew we had a hit. But that left us with a narrow window to investigate before the test timed out or crashed entirely. To make matters worse, the bug was in new C code called from a .NET application.
New C code is a scary concept. New C code that does multithreading is an even scarier concept. Race conditions there are almost expected, right?
That means that the feedback cycle is long. Any attempt we make to fix it is going to be unreliable - “Did I fix it, or it just didn’t happen?” and there isn’t a lot of information going on.The first challenge was figuring out how to detect the bug reliably.
Using Visual Studio as the debugger was useless here—it only reproduced in release mode, and even with native debugging enabled, Visual Studio wouldn’t show the unmanaged code properly. That left us blind to the C library where the bug resided. I’m fairly certain that there are ways around that, but I was more interested in actually getting things done than fighting the debugger.
We got a lot of experience with WinDbg, a low-level debugger and a real powerhouse. It is also about as friendly as a monkey with a sore tooth and an alcohol addiction. The initial process was all about trying to reproduce the hang and then attach WinDbg to it.
Turns out that we never actually generated PDBs for the C library. So we had to figure out how to generate them, then how to carry them all the way from the build to the NuGet package to the deployment for testing - to maybe reproduce the bug again. Then we could see in what area of the code we are even in.
Getting WinDbg attached is just the start; we need to sift through the hundreds of threads running in the system. That is where we actually started applying the proper process for this.
This piece of code is stupidly simple, but it is sufficient to reduce “what thread should I be looking at” from 1 - 2 minutes to 5 seconds.
SetThreadDescription(GetCurrentThread(), L"Rvn.Ring.Wrkr");
I had the thread that was hanging, and I could start inspecting its state. This was a complex piece of code, so I had no idea what was going on or what the cause was. This is when we pulled the next tool from our toolbox.
void alert() {
while (1) {
Beep(800, 200);
Sleep(200);
}
}
This isn’t a joke, it is a super important aspect. In WinDbg, we noticed some signs in the data that the code was working on, indicating that something wasn’t right. It didn’t make any sort of sense, but it was there. Here is an example:
enum state
{
red,
yellow,
green
};
enum state _currentState;
And when we look at it in the debugger, we get:
0:000> dt _currentState
Local var @ 0x50b431f614 Type state
17 ( banana_split )
That is beyond a bug, that is some truly invalid scenario. But that also meant that I could act on it. I started adding things like this:
if(_currentState != red &&
_currentState != yellow &&
_currentState != green) {
alert();
}
The end result of this is that instead of having to wait & guess, I would now:
- Be immediately notified when the issue happened.
- Inspect the problematic state earlier.
- Hopefully glean some additional insight so I can add more of those things.
With this in place, we iterated. Each time we spotted a new behavior hinting at the bug’s imminent trigger, we put another call to the alert function to catch it earlier. It was crude but effective—progressively tightening the noose around the problem.
Race conditions are annoyingly sensitive; any change to the system—like adding debug code—alters its behavior. We hit this hard. For example, we’d set a breakpoint in WinDbg, and the alert function would trigger as expected. The system would beep, we’d break in, and start inspecting the state. But because this was an optimized release build, the debugging experience was a mess. Variables were often optimized away into registers or were outright missing, leaving us to guess what was happening.
I resorted to outright hacks like this function:
__declspec(noinline) void spill(void* ptr) {
volatile void* dummy = ptr;
dummy; // Ensure dummy isn't flagged as unused
}
The purpose of this function is to force the compiler to assign an address to a value. Consider the following code:
if (work->completed != 0) {
printf("old_global_state : %p, current state: %p\n",
old_global_state, handle_ptr->global_state);
alert();
spill(&work);
}
Because we are sending a pointer to the work value to the spill function, the compiler cannot just put that in a register and must place it on the stack. That means that it is much easier to inspect it, of course.
Unfortunately, adding those spill calls led to the problem being “fixed”, we could no longer reproduce it. Far more annoyingly, any time that we added any sort of additional code to try to narrow down where this was happening, we had a good chance of either moving the behavior somewhere completely different or masking it completely.
Here are some of our efforts to narrow it down, if you want to see what the gory details look like.
At this stage, the process became a grind. We’d hypothesize about the bug’s root cause, tweak the code, and test again. Each change risked shifting the race condition’s timing, so we’d often see the bug vanish, only to reappear later in a slightly different form. The code quality suffered—spaghetti logic crept in as we layered hacks on top of hacks. But when you’re chasing a bug like this, clean code takes a back seat to results. The goal is to understand the failure, not to win a style award.
Bug hunting at this level is less about elegance and more about pragmatism. As the elusiveness of the bug increases, so does code quality and any other structured approach to your project. The only thing on your mind is, how do I narrow it down?. How do I get this chase to end?
Next time, I’ll dig into the specifics of this particular bug. For now, this is the high-level process: detect, iterate, hack, and repeat. No fluff—just the reality of the chase. The key in any of those bugs that we looked at is to keep narrowing the reproduction to something that you can get in a reasonable amount of time.
Once that happens, when you can hit F5 and get results, this is when you can start actually figuring out what is going on.