“Shipping a model is not launching a rocket. It's opening a kitchen — the hard part is the lunch rush, every day.”
The request path in production
A served AI feature isn't a model call; it's a pipeline. A request hits a cache, passes an input guardrail, reaches the model, passes an output guardrail, and returns — with a fallback path for when the model is slow or down. Each stage is ordinary infrastructure, and each is where you control cost, safety, and reliability.
Cost and latency are design parameters
Training is capex; inference is opex — you pay per call, forever. At scale, model cost becomes a real line item, and the choices that control it are architectural, made early. The big levers: model size (use the smallest model that passes your eval), context length (every token in the prompt costs on every call), and caching.
Caching is the highest-leverage cost lever and the most overlooked. Many requests are near-duplicates; an exact-match or semantic cache can serve them for free. Prompt caching — reusing the cost of a long, stable system prompt across calls — cuts the bill further.
Latency is a product decision, not just a number. Streaming hides it — users tolerate a slow answer that starts immediately far better than a fast one that arrives all at once after a pause. And agent latency stacks: a 10-step agent at two seconds a step is twenty seconds, which is a different product than a one-second answer.
Reliability: the model will fail
Providers have outages. Models get rate-limited, time out, and occasionally return garbage. Your feature must degrade, not collapse. The defences are the familiar ones from distributed systems: timeouts, retries with backoff, a fallback (a smaller model, a cached answer, or an honest "try again shortly"), and a circuit breaker so one provider's bad afternoon doesn't take you down with it.
Changing a live system without breaking it
AI systems change constantly: prompts get tuned, models get upgraded, retrieval gets adjusted, providers deprecate versions out from under you. Every one of those is a chance to silently regress. The safe-change discipline is the same as any production system, applied to a probabilistic component.
- Gate every change on the eval set — no eval pass, no ship (chapter 6).
- Roll out gradually — canary a slice of traffic, watch the metrics, then widen.
- Pin model versions — never let "latest" change your behaviour without your knowledge.
- Keep a rollback — prompts and model choices revert as cleanly as code.
- Watch production, not just evals — the cases users send will surprise your test set.
Where to go from here
You now have the shape of a real AI system: a thin model in a thick, deterministic shell, fed by retrieval, empowered by tools, kept honest by evals, and operated like any other production service. Two directions deepen it: securing it against the new attack surface this all opens up, and the prompting and RAG guides for hands-on patterns.
In one line each
- A served feature is a pipeline — cache, guardrails, model, fallback — not a bare model call.
- Cost and latency are architectural: right-size the model, cache aggressively, trim context, stream the output.
- The model will fail; degrade with timeouts, retries, fallbacks, and a multi-provider abstraction.
- Change a live system safely: gate on evals, roll out gradually, pin versions, keep a rollback, watch production.