Then we switch to a production mode, where each node was a separate process, and we debugged one of them. And it was easy, and it was obvious, and everything just worked. That was a profound revelation.
What happened was that we build our system so we can service them in production, that includes the internal design and exposed instrumentation data that we have. When we were trying to figure things out from the tests, we were running everything in a single process, and the act of trying to debug a single thread would cause all other threads to stop. That would trigger cascade behaviors (since timeout would be fired, and the cluster would move into recovery mode).
With us debugging just a single node in the cluster, the rest of the cluster just thought it was slow, it didn’t have any impact, but allow us to observe the behavior of the system very easily. The solution was quite obvious once we got into that stage.