Putting Claude up against our test suite

time to read 6 min | 1001 words

I’m convinced that in hell, there is a special place dedicated to making engineers fix flaky tests.

Not broken tests. Not tests covering a real bug. Flaky tests. Tests that pass 999 times out of 1000 and fail on the 1,000th run for no reason you can explain with a clean conscience.

If you've ever shipped a reasonably complex distributed system, you know exactly what I'm talking about. RavenDB has, at last count, over 32,000 tests that are run continuously on our CI infrastructure. I just checked, and in the past month, we’ve had hundreds of full test runs.

That is actually a problem for our scenario, because with that many tests and that many runs, the law of large numbers starts to apply. Assuming we have tests that have 99.999% reliability, that means that 1 out of every 100,000 test runs may fail. We run tens of millions of those tests in a month.

In a given week, something between ten and twenty of those tests will fail. Given the number of test runs, that is a good number in percentage terms. But each such failure means that we have to investigate it.

Those test failures are expensive. Every ticket is a developer staring at logs, trying to figure out whether this is a genuine bug in the product, a bug in the test itself, or something broken in the environment. In almost all cases, the problem is with the test itself, but we have to investigate.

A test that consistently fails is easy to fix. A test that occasionally fails is the worst.

With a flaky test, you don't just fix something and move on. You spend two days isolating it. Reproducing it. Building a mental model of a race condition that only manifests under specific timing, load, and cosmic alignment.

The tests that do this are almost always the integration tests. The ones that test complex distributed behavior across many parts of the system simultaneously. By definition, they are also the hardest to reason about.

The fact that, in most cases, those test failures add nothing to the product (i.e., they didn’t actually discover a real bug) is just crushed glass on top of the sewer smoothie. You spend a lot of time trying to find and fix the issue, and there is no real value except that the test now consistently passes.

We have a script that runs weekly, collects all test failures, and dumps them into our issue tracker. This is routine maintenance hygiene, to make sure we stay in good shape.

I was looking at the issue tracker when the script ran, and the entire screen lit up with new issues.

Just looking at that list of new annoyances was enough to ruin my mood.

And then, without much deliberate planning, I did something dumb and impulsive: I copy-pasted all of those fresh issues into Claude and told it to fix them. Then I went and did other things. I had very low expectations about this, but there was not much to lose.

A few hours later, I got a notification about a pull request. To be honest, I expected Claude to mark the flaky tests as skipped, or remove the assertions to make them pass.

I got an actual pull request, with real fixes, to my shock. Some of them were fixes applied to test logic. Some were actually fixes in the underlying code.

And then there was this one that stopped me cold. Claude had identified that in one of our test cases, we were waiting on the wrong resource. Not wrong in an obvious way — wrong in the kind of way that works perfectly 99.9998% of the time and silently fails 0.0002% of the time.

The (test) code looked right. We were waiting for something to happen; we just happened to wait on the wrong thing, and usually the value we asserted on was already set by the time we were done waiting.

Claude found it. In one pass. For the price of a subscription I was already paying. For reference, that single “let me throw Claude at it” decision probably saved enough engineering time to cover the cost of Claude for the entire team for that month.

Let me be precise about what happened and what didn't. Claude did not fix everything. Some of the "fixes" it produced were pretty bad, surface-level patches that didn't address the real cause, or things that were legitimately out of scope.

You still need an engineer reviewing the output. And you still need judgment.

But it got things fixed, quickly, without needing two days to context-switch into the problem space. And the things it did fix well, it fixed really well.

The work it compressed would have realistically taken one developer a week or two to grind through — and that's assuming you could get a developer to focus on it for that long in the first place. Flaky test investigation is the kind of work that quietly kills team morale.

Engineers start dreading CI. They start treating red builds as background noise. That's how quality degrades silently. Leaving aside new features or higher velocity, being able to offload the most annoying parts of the job to a machine to do is… wow.

Based on this, we're building this into our actual workflow as an integral part of how we handle test maintenance. Failures are collected, routed to Claude, and it takes a first pass at triage and repair. Then we create an issue in the bug tracker with either an actual fix or a summary of Claude’s findings.

By the time a human reviews this, significant progress has already been made.

It doesn't replace the engineer. But it means the engineer is doing the interesting part of the work: judgment, review, architectural reasoning. Skipping the part that requires staring at race condition logs until your vision blurs.

This isn’t the most exciting aspect of using a coding agent, I’m aware. But it may be one of the best aspects in terms of quality of life.