Three months after a RAG system launches, the eval suite is usually broken. Tests fail for unclear reasons, no one updates the golden set, and the team quietly stops looking at the dashboard. The system is still in production. It’s still answering customers. Nobody knows if it’s getting better or worse.
This pattern is common enough to be worth naming. It happens because the eval suite was built as a quality artifact rather than as a decision artifact. It measures things, but it doesn’t answer a question anyone is about to act on.
The fix is structural, not technical
The most reliable way to keep evals alive is to wire them directly into a release decision. Every capability — retrieval over policy documents, summarization of clinical notes, structured extraction from contracts — gets a release gate. The gate is a threshold on a specific eval metric. The eval is what tells you whether you can ship.
When the eval is the gate, the team has no choice but to maintain it. Skipping the eval means skipping the release. That’s a forcing function the dashboard alone never produces.
The four signals we score on retrieval
For RAG systems specifically, four metrics carry most of the load:
- Groundedness — does the answer cite source material that actually supports the claim? Scored by an LLM judge against the retrieved context, calibrated against a small human-labeled set.
- Recall — when there is a correct answer in the corpus, does retrieval find it? Scored as a hit/miss against a golden set of question + expected document IDs.
- Precision — when retrieval returns chunks, are they relevant? Scored as the fraction of returned chunks rated useful by a judge or by humans.
- Latency — p50 and p95 retrieval-to-answer time. Often the first metric to drift; often the last one anyone notices.
The ratios matter more than the absolute numbers. A 5% drop in groundedness from one week to the next is usually the signal of a real problem — a model upgrade, a corpus drift, a prompt change someone forgot to flag. The team that catches it within 24 hours is the team running the eval on every PR.
What “50 examples” buys you
A common failure mode is over-investing in golden set size. Teams build 2,000-example sets, then discover that nobody on the team is willing to maintain them as the corpus grows. The set goes stale. The eval becomes theater.
Fifty examples per capability — chosen by a domain expert, with thought given to edge cases — turns out to be enough to detect most regressions worth catching. It’s also small enough that a single hour of maintenance per quarter keeps it current. We’ve consistently seen 50-example sets outperform 2,000-example sets on the metric that actually matters: did the team know within a day that the system regressed.
The hand-off
The other thing that keeps evals alive is a clean owner. Not the engineer who built the system. Not the data scientist who labeled the first set. Someone whose job description includes “decide if this can ship today” — usually a product manager or domain lead. They don’t have to write the eval; they have to understand its verdict.
When the verdict-owner and the eval are aligned, evals become an asset. When they aren’t, evals become a graveyard.