Every change runs the eval set before it ships

“An eval is the smoke detector. Annoying until the night it saves the house.”

Evals score outputs against expectations; observability watches them in production.

What an eval is

An eval is a set of inputs paired with a way to judge the output, plus a script that runs your system over them and reports a score. That's it. The discipline isn't complicated; it's just rarely done. Most teams "evaluate" by trying a few prompts by hand and going with what feels better, which is how regressions ship.

The eval is the gate between a change and production. Score holds or improves → ship. Score regresses → block.

Start absurdly small. Twenty to fifty real inputs, each one a case your system should handle, with the expected answer or a way to grade it. Add every failure you discover in production. This grows into the single most valuable asset your AI team owns.

Three kinds of eval, ranked by leverage

Regression eval: real input/output cases, run on every prompt or model change. Catches "the fix that broke ten things."
Adversarial eval: inputs designed to break the system: ambiguous requests, prompt injection, irrelevant context, edge cases. Run before every release.
Calibration eval: does the system know when it's unsure? Track whether high-confidence answers are actually right more often.

The regression eval is the one to build first and run constantly. The others matter, but a regression eval that runs on every change is what turns AI development from guesswork into engineering.

How to grade outputs

Three grading methods, in order of preference. Exact or rule-based matching where the answer is structured (a number, a category, valid JSON): cheap, deterministic, trustworthy. LLM-as-judge where the answer is open-ended (a summary, an explanation): a model grades against a rubric. And human review for the cases that matter most.

LLM-as-judge is seductive because it scales, but it's noisy and biased: judges favour longer answers, their own style, the first option shown. Pin it down with a clear rubric, validate it against human grades on a sample, and pair it with exact matching wherever you can. Never trust a judge you haven't audited.

Observability: evals for production reality

Evals tell you about the cases you thought of. Observability tells you about the cases users actually send. Trace every request: the full prompt, the retrieved context, every tool call, the raw output, latency, and cost. When something goes wrong, and it will, you need to replay exactly what happened.

The loop that compounds quality: production traces surface real failures; real failures become new eval cases; the eval set gets sharper; the system gets measurably better. Teams that close this loop pull away from teams that don't.

In one line each

An eval is inputs plus a way to grade outputs plus a script. Start with 20 real cases and grow from failures.
Three kinds: regression (run always), adversarial (before releases), calibration. Build the regression eval first.
Grade with exact matching where you can, LLM-as-judge (audited) where you can't, humans for what matters most.
Observability closes the loop: production traces become new eval cases. Never delete failing cases.

Where to go next

Chapter 7: Shipping & operating