How to run technical due diligence on an AI startup

Every demo is built to win. That is its job. The team picked the inputs, rehearsed the path, and cut anything that breaks. Technical due diligence exists to find what the demo was designed to keep out of frame, and with AI that gap is wider than most buyers expect. A genuinely impressive demo can be assembled in a weekend on top of someone else’s model, with very little underneath that the company actually owns. The work of diligence is to tell a product from a wrapper before the money moves.

Start with what is actually theirs

The first question is not whether it works. It is what here belongs to them. Most AI products call a foundation model from OpenAI, Anthropic, or Google, and that is fine, because almost everyone does. It is also not a moat. “We use GPT-4” describes a dependency, not an advantage, because the target’s competitor can make the same API call by Friday.

The defensible part, if there is one, lives somewhere else: proprietary data the model is tuned or grounded on, a workflow that is genuinely hard to rebuild, an evaluation system that keeps quality honest, or distribution the team already owns. Diligence should locate that layer explicitly and size it. If the only thing between the user and a public API is a prompt, you are paying for a prompt.

Ask to see the evals, not the demo

A demo shows you the good case. An evaluation tells you how often the good case happens and what the bad case looks like. Ask the team how they measure quality: what the test set is, who built it, and how they catch a regression when a model updates or the data drifts. A serious AI company has an answer and can show you the harness. A fragile one measures quality by looking at the demo, which means nobody knows whether last month’s change made the product worse.

This matters more than it sounds. Most AI projects that fail do not fail on model quality. A 2024 RAND study traced the leading causes to problems framed wrong, data that was not ready, and infrastructure that could not support the work. An absence of evaluation is how those failures stay invisible until after the deal closes.

Evaluation scores how often the good case actually happens. The demo only ever shows you the good case.

Follow the data

Data is where AI diligence gets uncomfortable, which is exactly why it earns its place. Where did the training or retrieval data come from, and does the company have the right to use it? A model tuned on data they did not have permission to use is a liability, and it transfers to the buyer.

If the product touches personal, health, or financial information, ask how that data is stored, who can see it, whether it leaves the company’s control when a third-party model is called, and whether that arrangement would survive a privacy review. Plenty of acquisitions look clean until someone asks where the data came from.

Reproduce the demo on your own terms

Take the demo away from the team and run it on inputs they did not choose. Bring your own documents, your own edge cases, the messy real examples the product will actually meet in production. The distance between the rehearsed demo and the same system on your data is the single most informative measurement in AI diligence. A product that holds up is worth what they are asking. A product that only works on the three examples in the deck is a research project with a sales team.

The distance between the rehearsed demo and the same system on your data is the most informative measurement you can take.

Cost, and what happens when the model moves

AI products carry a cost structure traditional software does not, because every request can come with a real inference bill. Ask what it costs to serve a user at the volume the projections assume, and whether the unit economics still work when usage grows tenfold rather than shrinking. Then ask about the part the company does not control. If the underlying model provider raises its price, deprecates the version they depend on, or changes its terms, how much of the product survives? A team that has thought about this has a fallback. One that has not is exposed to a vendor it cannot influence.

The team has to operate it, not just build it

Building a prototype and running a production AI system are different skills. The prototype proves the idea. Production means monitoring, handling the cases the model gets wrong, re-grounding or retraining as the data shifts, and keeping the system working long after launch. Diligence should ask who owns that work and whether the team has done it before or only demoed. The quietest failure in AI is a company that can build an impressive thing once and cannot keep it working.

The diligence test

By the end, a technical diligence report should answer a short list of plain questions:

What part of this is theirs, and what is rented from a model provider?
How do they know it works, and how would they know if it stopped?
Where did the data come from, and do they have the right to use it?
Does it survive contact with our data and our edge cases?
What breaks if the model provider changes the deal?
Can this team operate it, or only build it?

If those answers are clear and they hold up, the AI is real and you can price it with confidence. If they are vague, the demo was the product, and the number should say so.

Technical diligence is not about doubting that the demo worked. It is about learning what the demo was not allowed to show. Done well, it tells an investment committee or an acquirer what is real, what is fragile, and what is missing, in language they can act on before they commit. For a scoped read on a single internal decision rather than an outside product, an AI Decision Memo is the closer fit, and the AI roadmap piece covers the build-versus-buy call from the inside.

Questions

Common questions

What is the difference between a demo and real AI due diligence?

A demo shows the cases the team chose. Diligence measures how often those cases hold, what the failures look like, and what the company actually owns versus rents from a model provider. The work is to learn what the demo was built to keep out of frame.

Is "we use GPT-4" a competitive advantage?

No. Calling a foundation model is a dependency almost every AI company shares, not a moat. The defensible layer, if it exists, is proprietary data, a hard-to-rebuild workflow, a real evaluation system, or distribution. Diligence should locate and size that layer.

What is the biggest AI-specific risk in an acquisition?

Two stand out: data the company did not have the right to use, which becomes the buyer's liability, and a dependency on a model provider that can change pricing or deprecate the version the product relies on. Both are easy to miss and expensive to inherit.

How do you test an AI product during diligence?

Run it on your own data and edge cases rather than the demo's. Ask for the evaluation harness and the test set. Check unit economics at projected volume. Confirm the team can operate a production system, not only build a prototype.

Next step

Have an AI decision that needs a senior outside view?

Start with the context, the owner, and what needs to be decided or built. If the fit is real, the first call stays focused.

Start a conversation More insights