Key findings
- Twenty specialized agents — a product owner, engineering, UX, business, and QA — ran unattended 24/7 on a single local machine (an Apple M3 Ultra), doing real project work with no cloud and for the cost of electricity.
- They co-authored a real product's plan and full technical spec: ~50,000 words across 11 documents (666 tracked versions, 17 contributing agents) and a four-part technical specification — architecture, data model, API, and UI/UX — plus a running decision log. They completed 238 tasks and peer-reviewed each other's work.
- It held up: across 942 scheduled runs, 69.9% succeeded (80–85% on settled days), with work distributed across 19 of the 20 agents.
- Of 260 failures, ~99% were operational — invalid model identifiers, malformed API parameters, local-model timeouts, and abandoned tool calls. Fewer than 1% were the model reasoning wrong.
- Running the whole team cost about $12 in modeled token cost; on local open-weight models the real marginal cost was electricity. The constraint was never compute — it was coordination and control.
Why this study exists
The industry argues about whether AI agents are reliable enough for production, and it argues mostly with surveys and synthetic benchmarks. Benchmarks have their own problems — researchers showed in 2026 that leading agent benchmarks can be gamed to near-perfect scores, and enterprise teams report a large gap between benchmark performance and what they see in deployment. What is scarce is the other thing: an operator running a real autonomous agent workforce on real work, and publishing what actually happened.
This is that. We run Packwolf, a control plane that runs a team of agents on a schedule with shared memory, tool access, human approvals, and a full audit trail. We recorded every run, and this paper reports a three-week window of steady-state operation, grounded in the operational database rather than impressions. It is a single operator studying its own system, so read the limitations section before you generalize — but the patterns are specific, measured, and, we think, useful.
What the workforce built
This was not one assistant answering prompts. It was a standing team of 20 specialized agents — a product owner, technical agents, UX agents, business and market agents, and a QA reviewer — each with its own brief, memory, and tools, running on a schedule around the clock. Their standing assignment was to take a real consumer software product from idea to a buildable plan and specification.
Over the window they produced roughly 50,000 words of cross-functional product documentation, and not boilerplate:
| What the workforce produced | Detail |
|---|---|
| Cross-functional documents | 11 living docs (~193k characters) — strategy, market research, business plan, GTM, marketing, sales, metrics, features, UX, decisions — across 666 tracked versions, 17 contributing agents |
| Technical specification | Four parts — architecture, data model, API, UI/UX — across 123 versions, authored by the engineering and design agents and routed through a QA reviewer |
| Tasks completed | 238, across 19 of the 20 agents |
| Coordination | 601 task assignments · 1,387 status transitions · 160 delegations · 88 agent-to-agent messages |
| Shared memory | 184 entities, 192 relationships, from 24,005 memory events |
| Human plan-version approvals | 12 |
The architecture they specified is coherent enough to build from: a mobile client with an offline cache, a retrieval-and-reranking layer over a vector-enabled database, and a model-routing layer with a primary model and a fallback — drawn as a system diagram, with the data flow written out step by step. The data model defines load-bearing and satellite entities with real constraints. The decision log records product and architecture choices with status, date, rationale, and who approved them. These are the artifacts a competent product team produces in its first weeks, and the agents produced them.
They worked like a team, not a fan-out. Work moved through a shared backlog, and agents handed sub-tasks to each other: in one cycle the product owner reviewed a UI/UX spec, returned it to draft with seven specific findings, and the design agent revised it; in another, one agent resolved an architecture change request by confirming an infrastructure detail with a second agent before updating the spec. They blocked on real dependencies — tasks tagged waiting-on-docs or waiting on a named teammate — and reconciled conflicts rather than ploughing ahead. They drew on and extended the shared memory as they went, the institutional knowledge the team built about the product.
A person stayed in the loop where it mattered. The owner approved plan versions twelve times in the window; model choices, a launch budget, and licensing were escalated to and signed off by a human; and the permission layer approved or rejected individual tool requests in flight — granting one agent the right to delegate, denying another a tool and telling it to find another way. The agents ran the project; the human steered it. That this ran unattended for weeks on one machine and produced a real, versioned artifact — not a demo — is the point. The rest of this paper is how dependably it held up.
The system and the data
A "run" here is one execution of an agent's loop: the agent wakes on a trigger, assesses what needs doing, optionally executes work through tools, and records the outcome. Runs are triggered by a schedule, by a daily-planning cycle, by backlog triage, by a task being assigned, or by a direct mention. Every run is written to the database with its trigger, status, start and finish time, the number of tasks it checked and executed, an error string if it failed, and token-level cost for every model generation.
This study covers the steady-state operating window of April 19 – May 7, 2026: 942 runs across 20 agents — how the system ran once it had stabilized. We exclude an initial build-out period and two days of anomalous automated scheduling, which are not representative of normal operation; the figures below are the settled state.
Every generation ran on local open-weight models — Qwen and MiniMax variants served through LM Studio on a single Apple M3 Ultra workstation with 256 GB of unified memory. There were no hosted-API calls. That detail matters for reading the failures: a self-hosted inference server has its own failure modes a managed endpoint would not, which we flag where it applies.
How dependably it ran
A workforce is only useful if it runs without a babysitter, so the question is how dependably it did. Across the 942 runs in the window, the outcome split was:
| Outcome | Runs | Share |
|---|---|---|
| Succeeded | 658 | 69.9% |
| Failed | 260 | 27.6% |
| Timed out | 23 | 2.4% |
| Cancelled | 1 | 0.1% |
The headline 70% understates the settled state, because the first day in the window was rough. Day one (April 19) succeeded on 9 of 93 runs while a misconfiguration was shaken out; the next several days ran 82–85% (April 20: 85%, April 22: 83%, April 23: 82%). The honest reading is that a freshly changed agent deployment is unreliable for a short break-in period and then settles into the high 70s to mid 80s.
Reliability also depended on what triggered the run. More tightly scoped triggers succeeded more often than open-ended ones:
| Trigger | Runs | Success rate |
|---|---|---|
| Backlog-triage | 262 | 77.5% |
| Task-assigned | 46 | 76.1% |
| Scheduled | 426 | 69.7% |
| Daily-planning (open-ended) | 201 | 58.2% |
Open-ended daily-planning was the least reliable category by a wide margin. The more a run looked like "decide for yourself what matters today," the more often it failed; the more it looked like "here is a specific item, handle it," the more often it succeeded. That is a practical lever: scope the trigger.
Per agent, the team of 20 ran between 1 and 102 runs each, with settled success rates clustered in the 60s and 70s:
| Agent | Runs | Success | Tasks executed |
|---|---|---|---|
| steel | 102 | 75% | 19 |
| nexus | 98 | 73% | 37 |
| canvas | 83 | 72% | 29 |
| forge | 81 | 62% | 25 |
| atlas | 70 | 74% | 24 |
| clarity | 60 | 63% | 19 |
| sentinel | 50 | 70% | 12 |
| + 13 more agents | ≤49 each | 56–75% | — |
Successful runs took about 5.5 minutes on average (332 seconds), but with a long tail: 112 of 658 ran longer than ten minutes, and the longest approached an hour before completing. The 23 timeouts sit at the end of that tail. Long-running agent work is where reliability erodes, which argues for hard per-run time budgets rather than open-ended execution.
One pattern is worth stating plainly because it surprises people: most runs did almost nothing. On average a run checked about four backlog items and executed work on roughly a quarter of runs (231 task executions across 942 runs). That is correct behavior, not waste — an agent waking on a schedule should usually look, find nothing actionable, and stand down. The cost of "deciding not to act" is most of what an agent team does.
What actually failed
This is the finding that matters most. We categorized all 260 steady-state failures by their recorded error. The result is lopsided:
| Failure category | Count | Share |
|---|---|---|
Configuration / API errors (invalid model identifier, malformed response_format, model reloaded/unloaded) | 135 | 52% |
| Abandoned tool calls (a tool call started and never resolved) | 75 | 29% |
| Local-model timeouts (request or queue wait) | 48 | 18% |
| Actual tool-execution failures | 2 | <1% |
Almost none of the failures were the model producing a wrong answer. They were plumbing: a model identifier that did not match what the inference server had loaded (80 runs), an API parameter the endpoint rejected (49 runs), a local model reloading or timing out under load (about 54 runs), and tool calls the agent opened and never closed (75 runs). The audit log tells the same story from another angle — across the window the system logged 5,438 tool executions, of which 848 (~13.5%) were abandoned and 84 were denied outright by a guardrail before running.
The honest caveat: inference ran on a self-hosted stack — LM Studio on a single M3 Ultra — which inflates the configuration and timeout categories. The "invalid model identifier" and "model reloaded/unloaded" errors are LM Studio swapping models in and out of memory; the timeouts are one workstation under concurrent load from 20 agents. A hosted API would erase most of that infrastructure noise. But strip it out and the picture only sharpens: the largest remaining failure class is abandoned tool calls — an orchestration problem that does not go away on a managed endpoint — while the model reasoning wrong accounts for 2 of 260 failures either way. The system around the model fails far more often than the model does.
Economics
Every generation ran locally on one Apple M3 Ultra (256 GB) through LM Studio, so there was no API bill — the real marginal cost was electricity. The system still records a modeled token cost of $12.12 across the window (2.95 million input and 270 thousand output tokens), which is what these tokens would have cost at open-weight API rates: roughly 1.3¢ per run and 7¢ per completed task. The model mix:
| Model | Generations | Cost |
|---|---|---|
| qwen3.6-35b-a3b-turboquant-mlx | 633 | $6.95 |
| qwen3.5-122b-a10b-mlx | 212 | $2.51 |
| qwen3.6-35b-a3b-mlx | 264 | $2.08 |
| other local variants | 47 | $0.58 |
None of the window's generations used prompt caching, a clear and unrealized optimization. But the headline is the order of magnitude: a team of 20 agents ran around the clock for three weeks on a single workstation, for the cost of the electricity to keep it on. When marginal inference cost is effectively zero, the binding constraint stops being spend and becomes operational control — how often you let agents run, how you bound a long-running one, and who reviews the risky actions.
What this implies for running agents
The study lines up behind two claims. First, an autonomous agent workforce doing real, sustained, collaborative work is feasible now — on hardware you can put under a desk. Second, what makes it dependable is an operations problem, not a model-intelligence problem. Five things follow directly from the numbers.
- Scope the trigger. Specific work (77% success) beats open-ended planning (58%). The more a run knows exactly what it is for, the more often it finishes.
- Budget the long tail. Reliability eroded on long runs — 112 exceeded ten minutes and 23 timed out — and open-ended planning failed most. Hard per-run time limits and run-rate caps are cheap, high-leverage safety controls.
- Track and recover abandoned tool calls. They were the single largest non-infrastructure failure class (29% of failures, ~13.5% of all tool executions). An agent that opens a tool call and walks away needs to be detected and recovered, not left hanging.
- Expect a break-in period. Day one after a change ran at 10% before settling into the 80s. Treat a freshly changed deployment as unreliable until it proves otherwise.
- Keep humans on the risky actions. The guardrail layer denied 84 tool calls and routed actions to human approval. With agents acting on a schedule, the approval gate is what bounds the blast radius.
Almost none of this is about a better prompt. It is the loop, the schedule, the recovery, the budget, and the review — the system around the model. That is where the engineering is, and it is where reliability is won or lost.
Limitations
This is a single operator studying its own system, and it should be read that way. The deployment is our own product running our own work, not a controlled trial or a multi-tenant production service. The product the team worked on is a real, unreleased application we are building; we report the work and its scale but keep the product itself anonymized. Inference ran on a self-hosted stack — LM Studio on a single M3 Ultra — which inflates the configuration and timeout failure categories relative to a hosted API. The window is a three-week steady-state slice; we exclude an initial build-out period and two days of anomalous automated scheduling as non-representative of normal operation. The agents were doing ideius's own operations and product work, not a standardized task suite, so the success rate reflects those tasks, not agent capability in the abstract. Detailed per-run event history is pruned by design, so the analysis rests on the durable run, cost, task, approval, and audit records rather than full traces. We report these aggregates as a field observation. We are not claiming an industry-wide rate, and we would treat any single number here as directional until corroborated by other operators publishing their own.
How to cite this
Ahmed, F. (2026). Agent reliability in production: a field study of a 20-agent team. ideius. https://www.ideius.com/papers/agent-reliability-in-production/
The figures are derived from aggregate operational telemetry; no run content is published. For questions about methodology or definitions, reach out — and if you are putting agents into production and want this kind of instrumentation and reliability discipline on your own system, that is where our AI agents and automation work starts.