AI agents have moved from novelty into real work: recurring research, drafting and routing documents, triaging a support queue, keeping a process moving without a person on every step. Getting one to do that reliably, though, turns out to have less to do with the model than most people expect. We learned that building Packwolf, a workbench that runs a team of agents on a schedule. The clearest way to show the work is to watch a run.
At nine in the morning a scheduled run wakes the research agent to refresh the week’s competitive picture. One of its sources times out and returns nothing. Instead of failing the run, the agent records the empty result, falls back to the sources that answered, and hands a draft to the analyst agent. The analyst notices two numbers that moved and passes them to a reviewer agent with a note. The reviewer is ready to send a summary to the team, but the outbound message stops at an approval gate and waits for a person to read it and click send. By the time anyone looks, the trace already shows the whole chain: the tool that failed and why, what the run cost in tokens, and the point where a human stepped in. No one re-explained the workflow that morning, and no one sat and watched it happen.
Almost nothing in that run is the model doing clever reasoning. It is the parts around the model: the schedule that woke the agent, the memory that held the workflow, the handoff between specialists, the approval that caught the outbound message, and the trace that recorded the cost and the failure. Those parts are what you build when you put agents into production, and they are what the rest of this is about.
A production agent carries its own memory
Start with memory, because it is what let the morning run happen without you in it. The agent has to hold the playbook, the context from earlier steps, and what it learned last week, rather than being re-briefed each time. The morning the workflow lives in the agent instead of your head is the morning the agent has actually taken the job.
A multi-step process usually wants more than one agent, too. Research, drafting, review, and follow-up are different kinds of work, and a team of specialists that hand off to each other is often easier to operate than one generalist trying to hold all of it in a single thread. In Packwolf each agent keeps its own identity, memory, and brief, and they delegate among themselves, pulling in a person only when a decision needs one.
It runs on a schedule, and recovers when something breaks
The research agent ran because its run was scheduled, not because someone pressed go, and that is the normal case in production: agents work on a cadence while your attention is elsewhere. The engineering that matters most here is the loop, not the reasoning. Steps fail all the time. An API times out, a model returns nonsense, a tool gets rate-limited. A production agent assesses, executes, and recovers, the way the research agent fell back when its source went quiet, and it keeps each run’s outcome and priority on record so the next run knows what to pick up. In Packwolf that scheduled assess-execute-recover cycle is the heartbeat. An agent’s reliability is mostly a measure of how well that loop handles the bad days.
Build it so you can see everything it does
When the reviewer’s message stopped at the approval gate, you could see exactly why, because the run was observable from the start. Every generation, every tool call, every handoff lands in a trace you can replay, with the cost, the latency, and the failure category attached. That visibility is not paperwork. It is the only way to operate something that runs without you watching. In the first week of real use you tend to find a few unglamorous truths: some fraction of tool calls failing on empty arguments, one task quietly running up tokens, a model upgrade that shifted behavior overnight. A trace turns those from things you discover late into numbers you act on early.
An agent's reliability is mostly a measure of how well its loop handles the bad days.
Give agents real tools and real boundaries
Agents earn their keep when they can take actions, and stay safe when those actions have edges. Two habits do most of that work. Scope each agent’s tool access to what its job actually needs, so the research agent can read sources but never email a client. And gate the risky actions behind a named human approval, the way the outbound summary waited that morning, with a rollback path designed in wherever the action can be undone. Hand an agent broad permissions because it is convenient, and you eventually discover the blast radius in production.
Evaluate agents like any system you depend on
An agent that was right last week can drift, because the model under it changed, the data moved, or the workflow evolved. Evaluation is how you notice before someone downstream does: a test set tied to the work the agent actually does, run on a schedule, so a regression shows up as a number rather than a complaint. Agents need this more than most AI systems, because they act on their conclusions instead of only presenting them.
What a production-ready agent looks like
A production-ready agent passes a short checklist:
- It holds the workflow itself instead of relying on you to re-teach it.
- It runs on a schedule and recovers when a step fails.
- Every action it takes is visible: what it did, what it cost, and why it failed.
- Its risky actions sit behind a named human approval.
- Its tool access is scoped to its job.
- A regression shows up as a number after a model change, not as a surprise later.
Get those right and you have an agent doing real work. That checklist is most of the build.
The reasoning was never the hard part of that morning run. Everything that made it trustworthy sat around the model rather than inside it, and that surrounding system is most of what we build when a client wants agents they can rely on instead of babysit. If you are weighing where agents fit in your operation, that is where our AI agents and automation work starts, and the same evaluation discipline runs through keeping retrieval honest after launch.