Skip to content
Chapter 06 · 12 min

Evals & observability

This is the chapter that separates teams who build AI products from teams who demo them. Without evals you cannot tell whether a change helped or hurt, which means you cannot improve on purpose. Everything else in this course is undermined if you skip this.

Every change runs the eval set before it shipsA change to a prompt, model, or retrieval step is run against an eval set. If the score holds or improves, it ships. If it regresses, it is blocked. The eval is the gate between a change and production.changeprompt · modeleval set50–500 casesship ✓block ✗no eval, no signal — you are shipping on vibes

An eval is the smoke detector. Annoying until the night it saves the house.

What an eval is

An eval is a set of inputs paired with a way to judge the output, plus a script that runs your system over them and reports a score. That's it. The discipline isn't complicated; it's just rarely done. Most teams "evaluate" by trying a few prompts by hand and going with what feels better — which is how regressions ship.

Every change runs the eval set before it shipsA change to a prompt, model, or retrieval step is run against an eval set. If the score holds or improves, it ships. If it regresses, it is blocked. The eval is the gate between a change and production.changeprompt · modeleval set50–500 casesship ✓block ✗no eval, no signal — you are shipping on vibes
The eval is the gate between a change and production. Score holds or improves → ship. Score regresses → block.

Start absurdly small. Twenty to fifty real inputs, each one a case your system should handle, with the expected answer or a way to grade it. Add every failure you discover in production. This grows into the single most valuable asset your AI team owns.

Three kinds of eval, ranked by leverage

  • Regression eval — real input/output cases, run on every prompt or model change. Catches "the fix that broke ten things."
  • Adversarial eval — inputs designed to break the system: ambiguous requests, prompt injection, irrelevant context, edge cases. Run before every release.
  • Calibration eval — does the system know when it's unsure? Track whether high-confidence answers are actually right more often.

The regression eval is the one to build first and run constantly. The others matter, but a regression eval that runs on every change is what turns AI development from guesswork into engineering.

How to grade outputs

Three grading methods, in order of preference. Exact or rule-based matching where the answer is structured (a number, a category, valid JSON) — cheap, deterministic, trustworthy. LLM-as-judge where the answer is open-ended (a summary, an explanation) — a model grades against a rubric. And human review for the cases that matter most.

LLM-as-judge is seductive because it scales, but it's noisy and biased — judges favour longer answers, their own style, the first option shown. Pin it down with a clear rubric, validate it against human grades on a sample, and pair it with exact matching wherever you can. Never trust a judge you haven't audited.

Observability: evals for production reality

Evals tell you about the cases you thought of. Observability tells you about the cases users actually send. Trace every request: the full prompt, the retrieved context, every tool call, the raw output, latency, and cost. When something goes wrong — and it will — you need to replay exactly what happened.

The loop that compounds quality: production traces surface real failures; real failures become new eval cases; the eval set gets sharper; the system gets measurably better. Teams that close this loop pull away from teams that don't.

In one line each

  • An eval is inputs plus a way to grade outputs plus a script. Start with 20 real cases and grow from failures.
  • Three kinds: regression (run always), adversarial (before releases), calibration. Build the regression eval first.
  • Grade with exact matching where you can, LLM-as-judge (audited) where you can't, humans for what matters most.
  • Observability closes the loop: production traces become new eval cases. Never delete failing cases.
Evals & observability · AI courses · SDEN