We measured our LLM debugging agent against 32 real bugs and decided not to build RAG

A bug report shows up in our team chat. Not a stack trace, not a reproduction, just a person typing something like: “customer says their order is stuck, can’t move it forward, been like this since this morning.” That’s the raw material. At a previous job, at a large e-commerce marketplace, my team ran an LLM debugging agent right there in the channel. It read that message, decided on its own what to look at, and came back with a verdict.

The agent had read-only access to a Postgres replica, our application logs, the issue tracker, and our GitLab. Given a free-form report, it would query the replica to find the order, search logs for the relevant errors, check whether a ticket already existed, then post one of three verdicts: user_error, code_bug, or unknown. It posted its evidence alongside the verdict: the rows it saw, the log lines, the ticket link. A human still made the call, but the agent did the grepping that used to eat the tedious opening of every triage.

It worked well enough that people started trusting it, which is exactly when you should get nervous.

The feature everyone wanted

The next item on the roadmap wrote itself: retrieval. We had years of solved bug threads sitting in chat history. Someone reports a stuck order; surely there’s a near-identical thread from three months ago with the fix already written out. Index the old threads, embed them, retrieve the most similar ones at triage time, feed them to the agent as context. RAG. It’s the default move, and half the room could already picture the demo.

We almost built it. What stopped us was a boring question: how would we know it helped? The agent is nondeterministic. It already gave good answers most of the time. If we bolted on retrieval and the next few reports went well, that would tell us nothing, because the next few reports would probably have gone well anyway. We had no way to see a real change against the noise.

So we built the eval first, and the eval is the actual story here.

Thirty-two cards

We built a corpus of case cards. Each card is one real historical bug report with a known resolution: what the report said, what the correct verdict turned out to be, what evidence a good triage should surface, and a link back to the thread where it was actually resolved. Thirty-two of them.

Two decisions about the corpus mattered more than the count.

First, we mined it from threads that had verified resolutions. Not “this looks like a bug,” but threads where someone had confirmed the root cause and closed it. That gives you a real answer key instead of a vote on vibes.

Second, and this is the part I’d push anyone to copy: we did not build the corpus ourselves. The same people who tune an agent are the worst people to write its exam, because you unconsciously write questions you already know it passes. So we handed the mining to a separate coding agent (Codex, running autonomously), pointed it at the historical threads and the live replica and logs, and had it reconstruct each card and verify that the evidence still checked out against the real data. The agent under evaluation and the agent that wrote the test had no shared authorship.

A card looked roughly like this (invented, with neutral fields):

id: card-014
report: >
  Contractor says the app won't let them submit the completed
  estimate. Button is greyed out. Started after lunch.
expected_verdict: user_error
expected_evidence:
  - order is in a state that blocks estimate submission
  - status history shows a required approval step was skipped
  - no matching error signature in application logs
resolution: thread/2f9c1   # confirmed: operator moved the order back a step
notes: >
  The greyed button is correct behavior for this state. A wrong
  agent will go hunting for a code bug that isn't there.

Note what card-014 is really testing: whether the agent can tell “the system correctly refused” from “the system broke.” That distinction is where most of our failures lived, and no amount of retrieval was going to help with it.

N=3, and why one run is a lie

You cannot grade a nondeterministic agent on a single run. Same card, same data, run it three times and you can get a clean pass, a hedge, and a confident wrong answer. So every card was graded over three live runs against the real replica and logs. Three is a floor, not a comfortable number, but it was enough to stop us from mistaking a lucky run for a capability or an unlucky one for a defect.

Then, instead of collapsing everything into one pass rate, we tagged every failure with a type. Six buckets:

over_claim: the agent asserts more than its evidence supports. It found a suggestive log line and declared a code bug.
under_claim: the agent had the evidence in hand and still hedged to unknown.
wrong_table: it queried the wrong place, so it reasoned over the wrong rows.
wrong_verdict: right evidence, wrong conclusion.
no_ground: a verdict with no evidence attached at all.
format: the answer was right but not in the shape the channel expected.

The taxonomy is the thing. A single pass rate tells you how often you’re wrong; it tells you nothing about how to get less wrong. The distribution tells you where to spend the next month. When we tallied ours, the failures clustered hard in two buckets, and neither of them was the kind of thing a similar old thread fixes.

The RAG-addressable number

Here’s the calculation that killed the feature. For each failed run we asked one question: would having the most similar past bug thread in context have plausibly fixed this run? If a failure was “the agent hedged to unknown when a near-identical solved thread would have shown it the answer,” that’s RAG-addressable. If a failure was “the agent queried the wrong table and confidently reasoned over the wrong order,” a retrieved old thread does nothing, because the problem is grounding, not recall.

We called that share the RAG-addressable failure signal. It was small.

Most of our failures were over-claims and wrong queries. Those are grounding and calibration problems. And retrieval doesn’t just fail to fix them; it can make them worse. Picture the agent over-claiming already, and now you hand it a confidently retrieved old thread that looks similar to this report. That isn’t a correction, that’s fuel. A plausible-looking neighbor is exactly the thing that shoves a shaky agent from unknown to a wrong, confident code_bug. We would have been paying engineering time to amplify our single most common failure mode.

So we didn’t build RAG. We spent the effort on the failures the data actually showed: tightening the tool prompts so the agent picked the right table, adding a give-up rule so a thin evidence set produced unknown instead of a guess, and reworking the verdict step to force it to cite the specific rows and log lines behind any non-unknown call. Unglamorous work, aimed at real numbers.

A note on the harness

The eval is the star, but the agent underneath it took a few tries to get right, and the shape of those tries is worth a paragraph.

The first version wrapped a general coding-agent sidecar. It had a sub-agent tool we couldn’t cleanly turn off, and on real reports it would spawn sub-agents and sit there until it timed out. Twenty-minute triages, which is worse than useless in a live channel. We moved to a leaner runtime we could drive over a simple protocol, with a fixed set of read-only tools and a hard budget on tool calls and wall-clock time.

The direction of every iteration was toward less agent, not more. We read the public work that was landing at the time: HolmesGPT’s on-call approach, and Cognition’s argument for not building multi-agent systems when a single loop will do. That matched what our own logs kept telling us. Every time we added autonomy, we added ways to be confidently wrong. A single loop with a tight tool budget and a give-up rule beat everything cleverer we tried.

What I’d do differently

Thirty-two cards is small. It was enough to see a clear two-bucket clustering and to make the RAG call with a straight face, but it’s too coarse to compare two decent prompt revisions that differ by a few points. If I did it again I’d push for a few hundred cards, and I’d let the separate mining agent run longer to get there, since the corpus build was the cheap part of all this.

N=3 is a floor, and I treated it like one, but I sometimes wanted it to carry more weight than three runs can. Cards that failed once out of three sat in an uncomfortable middle where I couldn’t tell flaky from broken. Five runs would have paid for itself on exactly those cards.

The grading rubric had real ambiguity, especially at the over_claim versus wrong_verdict border. If the agent reached the right verdict for a slightly wrong reason, was that a pass with a note or a wrong_verdict? We made per-case calls and mostly stayed consistent, but “mostly” isn’t a rubric. I’d write the edge cases down before grading, not during.

The thing I actually got wrong: we didn’t log full tool traces from day one. Early on we kept the final verdicts and a short summary, and when we went back to categorize failures we sometimes couldn’t tell why a run went wrong, only that it did. We ended up re-running cards just to watch the tool calls. Capture the whole trace from the first run. Storage is cheap; a failure you can’t explain is expensive.

And the deliverable that’s easiest to undersell: “no” was the output. We spent a couple of weeks to decide not to build a feature, and that was the right trade. The eval didn’t just answer the RAG question. It became the thing we ran against every prompt change afterward, which meant we could finally tell improvement from luck. A feature we didn’t build, and a ruler we now can’t work without.