“An agent is an intern with a to-do list and a phone. Useful for errands; dangerous with your credit card.”
An agent is just a loop
There is no magic. An agent is: a system prompt, a set of tools, and a loop that runs the model, executes any tool it calls, feeds the result back, and repeats until the model produces a final answer or hits a step limit. That's the entire idea. Everything else is engineering discipline around that loop.
messages = [system_prompt, user_message]
for step in range(MAX_STEPS):
response = model.call(messages, tools=TOOLS)
if response.is_final_answer:
return response.text
result = run_tool(response.tool_call) # your code, with all the guards
messages += [response, result]
raise StepLimitExceeded() # never loop forever
Why agents fall over: compounding error
The reason long agent runs fail isn't that the model is dumb. It's arithmetic. If each step is 95% reliable, a 10-step task succeeds 0.95^10 ≈ 60% of the time. A 20-step task: 36%. Error compounds multiplicatively, and there's no prompt that repeals multiplication.
The honest state of agents: excellent for well-scoped tasks of a few steps ("triage this ticket using these tools," "reconcile these two records"), unreliable for open-ended, long-horizon goals. Scope is the lever.
How to build one that works
Every reliable production agent does the same handful of things. None are exotic; all are discipline.
- Narrow scope — one job, a small toolset, a clear definition of done. The narrower the scope, the more tractable the eval.
- Hard step cap — the loop must always terminate. An agent with no cap is a bill with no ceiling.
- Idempotent tools — every tool safe to retry, because the agent will retry.
- Human-in-the-loop on high-stakes actions — propose, don't execute, when the action is irreversible.
- Log everything — every prompt, response, tool call, and result. You will need it within the week.
And prefer structure over autonomy. A fixed workflow with a model at each decision point is more reliable than a fully autonomous agent deciding its own plan — and far easier to eval. Reach for open-ended autonomy only when the task genuinely demands it, which is rarer than the demos suggest.
Multi-agent systems: usually premature
It's tempting to build a team of specialised agents that talk to each other. Occasionally that's the right design. Usually it multiplies the compounding-error problem — now you have several fragile loops coordinating — while making the system far harder to debug and eval. Make a single agent work first. Add agents only when you can point to the specific limit a single one hit.
In one line each
- An agent is a model in a loop with tools and a step cap. There is no magic — just engineering around the loop.
- Agents fall over because error compounds: 95% per step is ~60% over 10 steps. Multiplication doesn't care about your prompt.
- Build them narrow: one job, small toolset, hard step cap, idempotent tools, human gates, full logging.
- Prefer workflows over autonomy and a single agent over many — add complexity only against a limit you actually hit.
Where to go next