How to evaluate legal AI before it touches client work

TL;DR

Firms run real diligence on most vendors and then evaluate legal AI on a demo. The demo is built to hide what matters.
The fabricated citation is the failure you can catch. The dangerous one is the confident, complete-looking answer that quietly omits the case that mattered.
Test four things on your own matters: grounded citations, what it leaves out, confidentiality under retrieval, and how long an answer takes a partner to verify.
Decide what "good enough to rely on" means first, then test for it. Fifty questions from your own files beat any vendor benchmark.
The same agentic AI that makes these tools risky can also check it: agents that verify citations, hunt for the missing case, and argue the other side, with a partner holding the last word.

Law firms know how to run diligence. A new practice-management platform gets months of evaluation, reference calls, a security review, and a committee sign-off. The AI research tool that will sit between an associate’s draft and a partner’s signature often gets a demo and a price.

The demo is the problem. It is a sales artifact, built from questions chosen because the tool answers them well, delivered in a calm and sourced voice that is hard to push back on across a conference table. It tells you nothing about how the system behaves on the question an associate actually types at the end of a long day: the one with a bad fact pattern, a document that cuts the other way, or an answer that turns on a case nobody thought to check.

The failure a demo won’t show you

By now the loud failure is familiar. Since the first sanctions in 2023, the list of lawyers caught filing briefs built on cases their AI tool invented has only grown, and the story rarely changes. The tool sounded certain. The lawyer was on a deadline. No one checked. And it is not only careless lawyers who are exposed: a 2024 Stanford study of the leading AI legal-research tools found that even systems built for law, and marketed as curbing hallucinations, still returned incorrect or unsupported answers on a significant share of queries. But a fabricated citation is at least the failure you can catch, because a case either exists or it does not. The one that should worry a managing partner is quieter. Ask a tool for the firm’s exposure on a non-compete and it can hand back a clean, well-cited summary that happens to rest on authority from before the amendment that made those clauses unenforceable in your state. It reads well. It cites real cases. It is wrong, and nothing on the page says so. The associate moves on. You find out when the other side raises it.

A demo cannot surface that, because you only notice an omission if you already know what should have been there. That is the real point. The question is not whether a legal AI tool is impressive. It is whether you can let someone rely on it, for what, and with how much checking. That is an evaluation, and an evaluation looks nothing like a demo.

What an evaluation actually measures

It begins before anyone opens the software, by writing down what “good enough to rely on” means for your firm. Then you test for that, on your own work rather than the vendor’s. A few things are worth measuring, and most firms measure none of them.

It reads well. It cites real cases. It is wrong, and nothing on the page says so.

Start with grounding. Every claim and every citation should point to a real document that genuinely supports it, not a real case bent to fit a proposition it never stood for. This is the easiest thing to check and the first thing skipped. Give the tool questions where you already know the controlling sources, and see whether what it cites is real and on point.

Then test for what it leaves out. A system that answers fluently while dropping the adverse case is more dangerous than one that admits it is not sure, because confident omission reads exactly like a complete answer. Ask questions where a specific case or clause ought to surface, and watch whether it does.

Test the walls. Inside a firm, retrieval reaches across matters and clients, and a tool that draws on everything it can see can pull one client’s material into another client’s answer. The permissions that look clean in a settings panel are not the ones that hold when the model is scavenging for context. Ask something whose best answer sits behind an ethical wall, and confirm the wall stays up.

Then time it. The point of these tools is to save senior people time, and that has a number: how long it takes a qualified reviewer to verify what the tool produced. If checking an answer takes as long as writing it would have, you have gained nothing and added a place for a mistake to hide. Almost no one measures this, which is why so many tools that “work” never save a real hour.

Here is the whole test on one page. For every tool you are weighing, on your own matters, score four things:

Grounding. Does every citation resolve to a real document that supports the point it is attached to?
Omission. When a controlling case or an adverse clause should surface, does it?
Confidentiality. Does the answer stay inside the right matter and client walls?
Review cost. How long does it take a qualified person to verify the answer?

Fail the first two and the tool does not belong near client work. Pass them but fail the fourth, and you have added a step, not removed one.

Turn the same AI on itself

Here is the part worth being optimistic about. The same agentic advances that make people nervous about legal AI are also the most practical way to check it. You do not have to rely on one model producing one answer and hoping it holds. You can point a set of agents at the draft and have them attack it from every side before a person ever reads it.

One agent does nothing but verify citations, and every authority the brief leans on has to resolve to a real document that supports the proposition it is cited for, or it gets flagged. Another hunts for what is missing, searching out the adverse cases and contrary clauses the draft quietly skipped. Another reads as opposing counsel and writes the strongest rebuttal it can, which surfaces the soft spots while there is still time to fix them. Another checks that nothing crossed a matter or client boundary it should not have. Each one is narrow, adversarial, and tireless in a way an associate at midnight is not.

This is not magic, and it is exactly where overconfidence creeps back in. A checker built on the same models as the tool it is checking can share the same blind spots, so the stack of agents has to be evaluated with the same discipline as everything else; a verifier you have not tested is just another confident voice. It does not remove the partner either, and it is not meant to: under ABA Formal Opinion 512, the duty to verify the work and guard client confidences stays with the lawyer no matter what produced the draft. What it changes is the economics of the review that was the real bottleneck. It puts the likely failures on the table first, so a partner’s time goes to judgment instead of hunting. The useful way to think about it is not AI you trust, but AI built to distrust itself, with a person holding the last word.

Make it answer a real decision

None of it matters until it answers a decision you are about to make. “Can associates use this for first-draft research memos, with a partner signing every citation?” is a decision. A score on a dashboard that changes nothing about who is allowed to do what is not an evaluation; it is reassurance. Write the decision down, build the test to answer it, and expect an answer that comes with conditions. “Yes for internal knowledge retrieval with a named reviewer, not yet for anything that leaves the building” is a real result, and a defensible one.

The work is cheaper than it sounds. Pull fifty or so questions from your own matters, the real ones with the awkward facts, and note the sources a sound answer should rest on. Run each tool you are weighing against that set before you sign, and score the four things above. You can do it on sanitized or representative matters under ordinary confidentiality terms, so nothing privileged passes through a system you have not cleared. The hard part is judgment, not technology: deciding what good means here, putting it in writing, and declining to let a demo stand in for it.

So before one of these tools goes near client work, ask the vendor to run your questions instead of theirs, including the ones where the honest answer is “that is not in these documents.” A vendor who can show you those numbers is worth a serious conversation. One who cannot has already answered the only question that mattered.

Questions

Common questions

Can't we trust the vendor's benchmark?

Vendor benchmarks run on generic questions chosen to look good. Your risk lives in your edge cases: the contradictory document, the close call, the matter behind an ethical wall. Test on questions you supply, drawn from your own work.

What about hallucinated citations?

Treat them as the headline risk. Since 2023 a growing list of lawyers has been sanctioned for filing briefs built on cases an AI tool invented. Require citation-accuracy testing: every cited authority has to be real, retrievable, and actually support the claim attached to it.

How many test questions do we need?

Around fifty matter-representative questions per use case is enough to start. Large enough to expose real failure patterns, small enough that the people who know the matters will keep it current.

Do we have to put live client data through it to test?

No. The evaluation can run on sanitized or representative matters under project-specific confidentiality terms. You can learn how a tool fails without sending privileged material through it.

Next step

Have an AI decision that needs a senior outside view?

Start with the context, the owner, and what needs to be decided or built. If the fit is real, the first call stays focused.

Start a conversation More insights