How to tell if your AI is quietly getting worse

The AI tool worked the day you launched it. You watched it answer the test questions, it got them right, and you rolled it out. The harder question is whether it still works today — and most teams have no real way to answer it.

This is the quiet failure of AI. Unlike ordinary software, an AI tool can get worse without anything breaking: no error, no alarm, no crash. The provider updates the model underneath it, or your own documents change, or a setting drifts, and the answers slowly get less accurate. The system keeps running and keeps answering customers. The first sign of trouble is usually a client noticing, not a dashboard.

Why most quality checks quietly die

Plenty of teams do set up some kind of quality check at launch. A month later it is abandoned — nobody updates it, nobody looks at it, and it slowly stops reflecting reality. The reason is almost never technical. It is that the check does not feed a decision anyone actually makes.

A number on a dashboard that changes nothing gets ignored. A check that decides something — do we keep trusting this tool, do we go back to the vendor, do we keep paying for it — gets maintained, because skipping the check means skipping the decision. If you want a quality check to survive, tie it to a real decision and put it in front of the person who makes that call.

A quality check only survives if it feeds a decision someone actually has to make.

The four ways an AI answer goes wrong

Most business AI tools work by answering from your own material — a policy manual, a set of contracts, patient records, a knowledge base. When one of those answers is wrong, it is usually wrong in one of four plain ways, and you do not need to be technical to check for any of them:

It makes things up. The answer sounds confident, but your documents do not actually say it. This is the most dangerous one, because it is the hardest to catch.
It misses. The answer was sitting in your files, and the tool did not find it — so the user is told there is nothing when there is something.
It pulls the wrong material. It reaches for the wrong document and reasons from it, so a confident answer rests on the wrong source.
It slows down. The answer is right, but it takes long enough that people quietly stop using the tool.

Watch the change more than the score. Answer quality that drops noticeably from one month to the next almost always means something specific happened — a model update, a batch of new documents, a setting someone changed. Catching that within days, instead of at the next client complaint, is the entire point of checking.

Keep the check small enough to survive

The instinct is to build a big, thorough test — hundreds or thousands of examples. Resist it. A giant test that nobody maintains goes stale within months and quietly stops meaning anything. It feels rigorous and tells you nothing.

A few dozen real examples — actual questions paired with the answers you know are right, chosen by someone who understands the work — catches most of the problems worth catching. And it is small enough that keeping it current takes an hour or two now and then, which means it actually happens. The check that gets run beats the thorough check that gets abandoned, every time.

Someone has to own the answer

The last thing that keeps a check alive is a clear owner — and not the person who built the tool. It should be someone whose job already includes deciding whether the work is good enough to rely on: an operations lead, a practice manager, a department head. They do not need to understand how the tool works underneath. They need to own one question — “is this still good enough to trust this week?” — and be willing to act when the answer changes.

When the person who owns that decision and the check that informs it are the same loop, the check becomes an asset. When they are disconnected, it becomes a graveyard of charts nobody reads.

Whether you bought it or built it

How you put this in place depends on where the tool came from.

If you had the tool built — in-house or by a partner — set the check up directly: a small set of real cases, the four questions above, a named owner, run on a schedule and again after any change to the system or its content.

If you bought it from a vendor, you will not run their internal testing, and you should not expect to. But you can ask how they catch their own problems before an update reaches you, and you can keep your own small spot-check on your real cases — the same few dozen examples, run once a quarter. It costs you very little, and it means you learn that the vendor’s product slipped before your clients do. That same spot-check, by the way, is one of the most useful things you can run before you buy, too — it is the heart of how to evaluate an AI vendor before you buy.

The AI tool that is quietly wrong is more dangerous than the one that is obviously broken, because no one is looking for the problem. Keeping it honest is not a major technical program. It is a small check, tied to a real decision, owned by someone who will act on it — more a matter of discipline than technology. If you want that check designed and handed to your team so it actually survives, that is what LLM and RAG evaluation is for.

Questions

Common questions

How do I know if my AI tool is getting worse?

Keep a small set of real questions with answers you know are right, and re-check them on a schedule and after any change. Watch whether the answers are still accurate, complete, drawn from the right source, and fast. A noticeable drop from one month to the next — not the absolute number — is the signal that something changed.

How big does the test need to be?

A few dozen real examples per task is usually enough to catch the problems worth catching, and small enough that someone will actually keep it current. A giant test nobody maintains goes stale and tells you nothing — the check that gets run beats the thorough one that gets abandoned.

We bought our AI tool from a vendor — what can we do?

You won't run the vendor's internal testing, and you shouldn't expect to. But you can ask how they catch their own regressions before an update reaches you, and you can keep your own small spot-check on your real cases, run about once a quarter, so you find out their product slipped before your clients do.

Whose job is this?

One person who owns the decision 'is this still good enough to rely on' — usually an operations or practice lead, not necessarily someone technical. They don't need to understand how the tool works; they need to own the call and act when the quality slips.

Next step

Have an AI decision that needs a senior outside view?

Start with the context, the owner, and what needs to be decided or built. If the fit is real, the first call stays focused.

Start a conversation More insights