“An eval is the smoke detector. Annoying until the night it saves the house.”
What an eval is
An eval is a set of inputs paired with a way to judge the output, plus a script that runs your system over them and reports a score. That's it. The discipline isn't complicated; it's just rarely done. Most teams "evaluate" by trying a few prompts by hand and going with what feels better — which is how regressions ship.
Start absurdly small. Twenty to fifty real inputs, each one a case your system should handle, with the expected answer or a way to grade it. Add every failure you discover in production. This grows into the single most valuable asset your AI team owns.
Three kinds of eval, ranked by leverage
- Regression eval — real input/output cases, run on every prompt or model change. Catches "the fix that broke ten things."
- Adversarial eval — inputs designed to break the system: ambiguous requests, prompt injection, irrelevant context, edge cases. Run before every release.
- Calibration eval — does the system know when it's unsure? Track whether high-confidence answers are actually right more often.
The regression eval is the one to build first and run constantly. The others matter, but a regression eval that runs on every change is what turns AI development from guesswork into engineering.
How to grade outputs
Three grading methods, in order of preference. Exact or rule-based matching where the answer is structured (a number, a category, valid JSON) — cheap, deterministic, trustworthy. LLM-as-judge where the answer is open-ended (a summary, an explanation) — a model grades against a rubric. And human review for the cases that matter most.
LLM-as-judge is seductive because it scales, but it's noisy and biased — judges favour longer answers, their own style, the first option shown. Pin it down with a clear rubric, validate it against human grades on a sample, and pair it with exact matching wherever you can. Never trust a judge you haven't audited.
Observability: evals for production reality
Evals tell you about the cases you thought of. Observability tells you about the cases users actually send. Trace every request: the full prompt, the retrieved context, every tool call, the raw output, latency, and cost. When something goes wrong — and it will — you need to replay exactly what happened.
The loop that compounds quality: production traces surface real failures; real failures become new eval cases; the eval set gets sharper; the system gets measurably better. Teams that close this loop pull away from teams that don't.
In one line each
- An eval is inputs plus a way to grade outputs plus a script. Start with 20 real cases and grow from failures.
- Three kinds: regression (run always), adversarial (before releases), calibration. Build the regression eval first.
- Grade with exact matching where you can, LLM-as-judge (audited) where you can't, humans for what matters most.
- Observability closes the loop: production traces become new eval cases. Never delete failing cases.
Where to go next