Skip to content
Chapter 06 · 12 min

Defending AI systems

The previous chapters were the threats. This one is the playbook: the concrete, layered defences that make an AI system hard to break and survivable when it is. The organising idea is defence in depth — assume any single control will fail, and make sure the next one catches it.

Defence in depth around the modelConcentric layers around the model: input and output filtering, least-privilege tools, and monitoring. No single layer is trusted to hold; each assumes the one inside it may fail.modelinput / output filteringleast-privilege toolsmonitor & log everything

Build a castle, not a wall. Walls fall; layers buy you time to notice and respond.

Defence in depth, because the model will betray you

You cannot trust the model. Not because it's malicious, but because it follows instructions from untrusted content and you can't reliably stop that. So you wrap it in layers, each assuming the layer inside might fail: filter what goes in, constrain what the model can do, filter what comes out, and watch everything. No layer is load-bearing alone.

Defence in depth around the modelConcentric layers around the model: input and output filtering, least-privilege tools, and monitoring. No single layer is trusted to hold; each assumes the one inside it may fail.modelinput / output filteringleast-privilege toolsmonitor & log everything
Input/output filtering, least-privilege tools, and monitoring wrap the model. Each layer assumes the one inside it may fail.

Layer 1 — control the input

Before content reaches the model, you have a chance to reduce risk. Validate and constrain user input where the format allows it. Scan retrieved and user-supplied content for obvious injection patterns and known-bad payloads. Strip or neutralise hidden text (invisible characters, white-on-white, suspicious metadata) in documents. This catches the lazy attacks and raises the cost of the rest — but treat it as friction, never as a wall, because a determined payload gets through.

Layer 2 — constrain what the model can do

This is the strongest layer, and the one that holds regardless of how clever the attack is. If the model can only do a little, a compromised model can only do a little. Everything here is about capability, not detection.

  • Least privilege — every tool and data source scoped to exactly what the task needs, nothing "just in case."
  • Separate trust zones — untrusted content and privileged actions never meet in one model call without a gate (the dual-LLM idea).
  • Human-in-the-loop — irreversible or high-stakes actions require approval; the model proposes, a person disposes.
  • Sandboxing — tool execution and any model-generated code run isolated, with no path to the rest of your system.
  • Rate and budget limits — per-user caps so abuse can't exhaust resources or run up the bill.

Layer 3 — check the output

Before the model's output reaches a user or triggers an action, inspect it. Validate it against the schema you expect — and reject anything malformed. Scan for leaked secrets, PII, or other data that shouldn't be in the response. For actions, confirm the proposed tool call is within the allowed set and parameters. The output filter is your last chance to catch a compromise the input and capability layers missed.

Layer 4 — monitor and respond

You will not prevent every attack, so you must be able to see them. Log every prompt, retrieval, tool call, output, and decision (the same traces the evals chapter of the building course asked for — they serve double duty as a security audit trail). Watch for anomalies: spikes in refusals, unusual tool-call patterns, attempts to extract the system prompt, runaway costs. And have an incident path: how do you detect, contain, and respond when — not if — something gets through?

Monitoring is also how you learn. Every real attack you catch becomes an adversarial eval case and a tuning signal for your filters. The security posture of a live AI system isn't a state you reach; it's a loop you run.

Red-teaming: attack yourself first

Don't wait for an attacker to find the holes. Red-teaming means deliberately attacking your own system — trying every injection, jailbreak, and capability abuse you can think of — before it ships and continuously after. The output isn't just a list of bugs; it's a growing adversarial eval set that guards against regressions.

The red-team loopA cycle: attack the system to find a break, fix it with a patch and a guard, and add the attack to the regression eval so it can never return silently. Repeat forever.attackfind a breakfixpatch + guardregressadd to evalsecurity is a loop, not a checkbox
Attack to find a break, fix it with a patch and a guard, add it to the regression eval so it can't return silently. Repeat.

Make it a loop, not an event. Each finding gets a fix and a permanent test, so the same hole can never reopen unnoticed. This is exactly the eval discipline from the building course, pointed at security instead of quality — and it's the difference between a system that's secure today and one that stays secure as you change it.

In one line each

  • Defend in depth: assume the model will follow hostile instructions and wrap it in layers that each catch the last one's failure.
  • Control input (friction), constrain capability (the strong layer), check output (last catch), and monitor everything (see attacks).
  • Capability constraint beats detection: the question is what a compromised model can do, not whether it can be tricked.
  • Red-team continuously, turning every finding into a fix plus a permanent adversarial eval — security is a loop, not a checkbox.
Defending AI systems · AI courses · SDEN