Throw away the tests, I’m debugging this in production mode

time to read 2 min | 218 words

Prismatic-Sketched-Brain-Silhouette-300pxToday I had a really strange revelation. We had an issue that related to race conditions in a distributed group, and we could just not figure out what was going on from the tests.

Then we switch to a production mode, where each node was a separate process, and we debugged one of them. And it was easy, and it was obvious, and everything just worked.  That was a profound revelation.

What happened was that we build our system so we can service them in production, that includes the internal design and exposed instrumentation data that we have. When we were trying to figure things out from the tests, we were running everything in a single process, and the act of trying to debug a single thread would cause all other threads to stop. That would trigger cascade behaviors (since timeout would be fired, and the cluster would move into recovery mode).

With us debugging just a single node in the cluster, the rest of the cluster just thought it was slow, it didn’t have any impact, but allow us to observe the behavior of the system very easily. The solution was quite obvious once we got into that stage.