I’m doing a pretty major refactoring inside of RavenDB right now. I was able to finish a bunch of work and submitted things to the CI server for testing. RavenDB has several layers of tests, which we run depending on context.
During development, we’ll usually run the FastTests. About 2,300 tests are being run to validate various behaviors for RavenDB, and on my machine, they take just over 3 minutes to complete. The next tier is the SlowTests, which run for about 3 hours on the CI server and run about 26,000 tests. Beyond that we actually have a few more layers, like tests that are being run only on the nightly builds and stress tests, which can take several minutes each to complete.
In short, the usual process is that you write the code and write the relevant tests. You also validate that you didn’t break anything by running the FastTests locally. Then we let CI pick up the rest of the work. At the last count, we had about 9 dedicated machines as CI agents and given our workload, an actual full test run of a PR may complete the next day.
I’m mentioning all of that to explain that when I reviewed the build log for my PR, I found that there were a bunch of tests that failed. That was reasonable, given the scope of my changes. I sat down to grind through them, fixing them one at a time. Some of them were quite important things that I didn’t take into account, after all. For example, one of the tests failed because I didn’t account for sorting on a dynamic numeric field. Sorting on a numeric field worked, and a dynamic text field also worked. But dynamic numeric field didn’t. It’s the sort of thing that I would never think of, but we got the tests to cover us.
But when I moved to the next test, it didn’t fail. I ran it again, and it still didn’t fail. I ran it in a loop, and it failed on the 5th iteration. That… sucked. Because it meant that I had a race condition in there somewhere. I ran the loop again, and it failed again on the 5th. In fact, in every iteration I tried, it would only fail on the 5th iteration.
When trying to isolate a test failure like that, I usually run that in a loop, and hope that with enough iterations, I’ll get it to reproduce. Having it happen constantly on the 5th iteration was… really strange. I tried figuring out what was going on, and I realized that the test was generating 1000 documents using a random. The fact that I’m using Random is the reason it is non-deterministic, of course, except that this is the code inside my test base class:
So this is properly initialized with a seed, so it will be consistent.
Read the code again, do you see the problem?
That is a static value. So there are two problems here:
- I’m getting the bad values on the fifth run in a consistent manner because that is the set of results that reproduce the error.
- This is a shared instance that may be called from multiple tests at once, leading to the wrong result because Random is not thread safe.
Before fixing this issue so it would run properly, I decided to use an ancient debugging technique. It’s called printf().
In this case, I wrote out all the values that were generated by the test and wrote a new test to replay them. That one failed consistently.
The problem was that it was still too big in scope. I iterated over this approach, trying to end up with a smaller section of the codebase that I could invoke to repeat this issue. That took most of the day. But the end result is a test like this:
As you can see, in terms of the amount of code that it invokes, it is pretty minimal. Which is pretty awesome, since that allowed me to figure out what the problem was:
I’ve been developing software professionally for over two decades at this point. I still get caught up with things like that, sigh.