Skip to content
Chapter 07 · 12 min

Shipping & operating

The model works in the notebook. Now it has to work for thousands of users, within a budget, without going down, while you keep changing it. This chapter is the operational reality: cost, latency, reliability, and the safe-change discipline that lets you improve a live AI system without breaking it.

A request's path through a served AI featureLeft to right: a request hits a cache, passes an input guardrail, reaches the model, passes an output guardrail, and returns to the user. A fallback path catches model failures. The model is one stage among several.requestcachehit? returnguardrailfilter inmodelguardrailfilter outuserfallback on timeout / error

Shipping a model is not launching a rocket. It's opening a kitchen — the hard part is the lunch rush, every day.

The request path in production

A served AI feature isn't a model call; it's a pipeline. A request hits a cache, passes an input guardrail, reaches the model, passes an output guardrail, and returns — with a fallback path for when the model is slow or down. Each stage is ordinary infrastructure, and each is where you control cost, safety, and reliability.

A request's path through a served AI featureLeft to right: a request hits a cache, passes an input guardrail, reaches the model, passes an output guardrail, and returns to the user. A fallback path catches model failures. The model is one stage among several.requestcachehit? returnguardrailfilter inmodelguardrailfilter outuserfallback on timeout / error
The model is one stage among several. Cache, guardrails, and a fallback path are what make the feature cheap, safe, and reliable.

Cost and latency are design parameters

Training is capex; inference is opex — you pay per call, forever. At scale, model cost becomes a real line item, and the choices that control it are architectural, made early. The big levers: model size (use the smallest model that passes your eval), context length (every token in the prompt costs on every call), and caching.

Caching is the highest-leverage cost lever and the most overlooked. Many requests are near-duplicates; an exact-match or semantic cache can serve them for free. Prompt caching — reusing the cost of a long, stable system prompt across calls — cuts the bill further.

Context windows comparedHorizontal bars comparing context-window sizes: 4 thousand tokens (about 6 pages), 32 thousand (50 pages), 128 thousand (a 300-page book), and 1 million tokens (around 7 novels).4k≈ 6 pages32k≈ 50 pages128k≈ a 300-page book1M≈ 7 novelsCONTEXT WINDOW (TOKENS)1 token ≈ 0.75 English words
Bigger context costs more on every call and can degrade quality. More tokens is a lever, not a default.

Latency is a product decision, not just a number. Streaming hides it — users tolerate a slow answer that starts immediately far better than a fast one that arrives all at once after a pause. And agent latency stacks: a 10-step agent at two seconds a step is twenty seconds, which is a different product than a one-second answer.

Reliability: the model will fail

Providers have outages. Models get rate-limited, time out, and occasionally return garbage. Your feature must degrade, not collapse. The defences are the familiar ones from distributed systems: timeouts, retries with backoff, a fallback (a smaller model, a cached answer, or an honest "try again shortly"), and a circuit breaker so one provider's bad afternoon doesn't take you down with it.

Changing a live system without breaking it

AI systems change constantly: prompts get tuned, models get upgraded, retrieval gets adjusted, providers deprecate versions out from under you. Every one of those is a chance to silently regress. The safe-change discipline is the same as any production system, applied to a probabilistic component.

  • Gate every change on the eval set — no eval pass, no ship (chapter 6).
  • Roll out gradually — canary a slice of traffic, watch the metrics, then widen.
  • Pin model versions — never let "latest" change your behaviour without your knowledge.
  • Keep a rollback — prompts and model choices revert as cleanly as code.
  • Watch production, not just evals — the cases users send will surprise your test set.

Where to go from here

You now have the shape of a real AI system: a thin model in a thick, deterministic shell, fed by retrieval, empowered by tools, kept honest by evals, and operated like any other production service. Two directions deepen it: securing it against the new attack surface this all opens up, and the prompting and RAG guides for hands-on patterns.

In one line each

  • A served feature is a pipeline — cache, guardrails, model, fallback — not a bare model call.
  • Cost and latency are architectural: right-size the model, cache aggressively, trim context, stream the output.
  • The model will fail; degrade with timeouts, retries, fallbacks, and a multi-provider abstraction.
  • Change a live system safely: gate on evals, roll out gradually, pin versions, keep a rollback, watch production.
Shipping & operating · AI courses · SDEN