“Build a castle, not a wall. Walls fall; layers buy you time to notice and respond.”
Defence in depth, because the model will betray you
You cannot trust the model. Not because it's malicious, but because it follows instructions from untrusted content and you can't reliably stop that. So you wrap it in layers, each assuming the layer inside might fail: filter what goes in, constrain what the model can do, filter what comes out, and watch everything. No layer is load-bearing alone.
Layer 1 — control the input
Before content reaches the model, you have a chance to reduce risk. Validate and constrain user input where the format allows it. Scan retrieved and user-supplied content for obvious injection patterns and known-bad payloads. Strip or neutralise hidden text (invisible characters, white-on-white, suspicious metadata) in documents. This catches the lazy attacks and raises the cost of the rest — but treat it as friction, never as a wall, because a determined payload gets through.
Layer 2 — constrain what the model can do
This is the strongest layer, and the one that holds regardless of how clever the attack is. If the model can only do a little, a compromised model can only do a little. Everything here is about capability, not detection.
- Least privilege — every tool and data source scoped to exactly what the task needs, nothing "just in case."
- Separate trust zones — untrusted content and privileged actions never meet in one model call without a gate (the dual-LLM idea).
- Human-in-the-loop — irreversible or high-stakes actions require approval; the model proposes, a person disposes.
- Sandboxing — tool execution and any model-generated code run isolated, with no path to the rest of your system.
- Rate and budget limits — per-user caps so abuse can't exhaust resources or run up the bill.
Layer 3 — check the output
Before the model's output reaches a user or triggers an action, inspect it. Validate it against the schema you expect — and reject anything malformed. Scan for leaked secrets, PII, or other data that shouldn't be in the response. For actions, confirm the proposed tool call is within the allowed set and parameters. The output filter is your last chance to catch a compromise the input and capability layers missed.
Layer 4 — monitor and respond
You will not prevent every attack, so you must be able to see them. Log every prompt, retrieval, tool call, output, and decision (the same traces the evals chapter of the building course asked for — they serve double duty as a security audit trail). Watch for anomalies: spikes in refusals, unusual tool-call patterns, attempts to extract the system prompt, runaway costs. And have an incident path: how do you detect, contain, and respond when — not if — something gets through?
Monitoring is also how you learn. Every real attack you catch becomes an adversarial eval case and a tuning signal for your filters. The security posture of a live AI system isn't a state you reach; it's a loop you run.
Red-teaming: attack yourself first
Don't wait for an attacker to find the holes. Red-teaming means deliberately attacking your own system — trying every injection, jailbreak, and capability abuse you can think of — before it ships and continuously after. The output isn't just a list of bugs; it's a growing adversarial eval set that guards against regressions.
Make it a loop, not an event. Each finding gets a fix and a permanent test, so the same hole can never reopen unnoticed. This is exactly the eval discipline from the building course, pointed at security instead of quality — and it's the difference between a system that's secure today and one that stays secure as you change it.
In one line each
- Defend in depth: assume the model will follow hostile instructions and wrap it in layers that each catch the last one's failure.
- Control input (friction), constrain capability (the strong layer), check output (last catch), and monitor everything (see attacks).
- Capability constraint beats detection: the question is what a compromised model can do, not whether it can be tricked.
- Red-team continuously, turning every finding into a fix plus a permanent adversarial eval — security is a loop, not a checkbox.
Where to go next