What it takes to run AI agents in production

AI agents have moved from novelty into real work: recurring research, drafting and routing documents, triaging a support queue, keeping a process moving without a person on every step. Getting one to do that reliably, though, turns out to have less to do with the model than most people expect. We learned that building Packwolf, a workbench that runs a team of agents on a schedule. The clearest way to show the work is to watch a run.

At nine in the morning a scheduled run wakes the research agent to refresh the week’s competitive picture. One of its sources times out and returns nothing. Instead of failing the run, the agent records the empty result, falls back to the sources that answered, and hands a draft to the analyst agent. The analyst notices two numbers that moved and passes them to a reviewer agent with a note. The reviewer is ready to send a summary to the team, but the outbound message stops at an approval gate and waits for a person to read it and click send. By the time anyone looks, the trace already shows the whole chain: the tool that failed and why, what the run cost in tokens, and the point where a human stepped in. No one re-explained the workflow that morning, and no one sat and watched it happen.

Almost nothing in that run is the model doing clever reasoning. It is the parts around the model: the schedule that woke the agent, the memory that held the workflow, the handoff between specialists, the approval that caught the outbound message, and the trace that recorded the cost and the failure. Those parts are what you build when you put agents into production, and they are what the rest of this is about.

A production agent carries its own memory

Start with memory, because it is what let the morning run happen without you in it. The agent has to hold the playbook, the context from earlier steps, and what it learned last week, rather than being re-briefed each time. The morning the workflow lives in the agent instead of your head is the morning the agent has actually taken the job.

A multi-step process usually wants more than one agent, too. Research, drafting, review, and follow-up are different kinds of work, and a team of specialists that hand off to each other is often easier to operate than one generalist trying to hold all of it in a single thread. In Packwolf each agent keeps its own identity, memory, and brief, and they delegate among themselves, pulling in a person only when a decision needs one.

It runs on a schedule, and recovers when something breaks

The research agent ran because its run was scheduled, not because someone pressed go, and that is the normal case in production: agents work on a cadence while your attention is elsewhere. The engineering that matters most here is the loop, not the reasoning. Steps fail all the time. An API times out, a model returns nonsense, a tool gets rate-limited. A production agent assesses, executes, and recovers, the way the research agent fell back when its source went quiet, and it keeps each run’s outcome and priority on record so the next run knows what to pick up. In Packwolf that scheduled assess-execute-recover cycle is the heartbeat. An agent’s reliability is mostly a measure of how well that loop handles the bad days.

A production agent runs as a team with memory, a schedule, approvals, and a trace.

Build it so you can see everything it does

When the reviewer’s message stopped at the approval gate, you could see exactly why, because the run was observable from the start. Every generation, every tool call, every handoff lands in a trace you can replay, with the cost, the latency, and the failure category attached. That visibility is not paperwork. It is the only way to operate something that runs without you watching. In the first week of real use you tend to find a few unglamorous truths: some fraction of tool calls failing on empty arguments, one task quietly running up tokens, a model upgrade that shifted behavior overnight. A trace turns those from things you discover late into numbers you act on early.

An agent's reliability is mostly a measure of how well its loop handles the bad days.

Give agents real tools and real boundaries

Agents earn their keep when they can take actions, and stay safe when those actions have edges. Two habits do most of that work. Scope each agent’s tool access to what its job actually needs, so the research agent can read sources but never email a client. And gate the risky actions behind a named human approval, the way the outbound summary waited that morning, with a rollback path designed in wherever the action can be undone. Hand an agent broad permissions because it is convenient, and you eventually discover the blast radius in production.

Evaluate agents like any system you depend on

An agent that was right last week can drift, because the model under it changed, the data moved, or the workflow evolved. Evaluation is how you notice before someone downstream does: a test set tied to the work the agent actually does, run on a schedule, so a regression shows up as a number rather than a complaint. Agents need this more than most AI systems, because they act on their conclusions instead of only presenting them.

What a production-ready agent looks like

A production-ready agent passes a short checklist:

It holds the workflow itself instead of relying on you to re-teach it.
It runs on a schedule and recovers when a step fails.
Every action it takes is visible: what it did, what it cost, and why it failed.
Its risky actions sit behind a named human approval.
Its tool access is scoped to its job.
A regression shows up as a number after a model change, not as a surprise later.

Get those right and you have an agent doing real work. That checklist is most of the build.

The reasoning was never the hard part of that morning run. Everything that made it trustworthy sat around the model rather than inside it, and that surrounding system is most of what we build when a client wants agents they can rely on instead of babysit. We later put numbers to this in a field study of an autonomous agent workforce: across 942 runs about 70% succeeded, and almost every failure was operational — bad configuration, timeouts, abandoned tool calls — rather than the model reasoning wrong. If you are weighing where agents fit in your operation, that is where our AI agents and automation work starts, and the same evaluation discipline runs through keeping retrieval honest after launch.

Questions

Common questions

What is the hardest part of putting an AI agent into production?

Not the reasoning. It is the system around the agent: memory so it carries the workflow, a schedule that recovers when steps fail, approvals on risky actions, scoped tool access, traces so you can see cost and failures, and evaluation so you catch regressions. The model can usually do the task; the operating system around it is the work.

Do I need multiple agents or just one?

For a single task, one is fine. For a multi-step process with research, drafting, review, and follow-up, a team of specialists that hand off to each other is usually easier to operate than one generalist holding everything in a single chat. The structure is what lets work move without you re-briefing at every step.

How do you keep an AI agent from doing something harmful?

Scope its tool access so it can only touch what its job requires, gate the risky or irreversible actions behind a named human approval, design a rollback path where the action can be undone, and log every action to a trace you can inspect. Broad permissions for convenience are how teams get burned.

How do you know if a production agent is still working correctly?

Evaluation. A test set tied to the work the agent actually does, run on a schedule, so a regression shows up as a number before a customer reports it. Agents need this more than most AI systems because they act on their conclusions rather than just presenting them.

Next step

Have an AI decision that needs a senior outside view?

Start with the context, the owner, and what needs to be decided or built. If the fit is real, the first call stays focused.

Start a conversation More insights