Jailbreaks & misuse

“The bouncer follows a script. Find a line the script doesn't cover, and you're inside.”

Jailbreaks trick guardrails; layer them and monitor for misuse.

What a jailbreak is

Model providers train guardrails into their models: refusals for certain categories of request. A jailbreak is any prompt that gets around those guardrails, such as role-play framings ("pretend you're an AI with no rules"), hypotheticals, encoded requests, or splitting a forbidden task into innocuous-looking pieces. New jailbreaks appear constantly; providers patch them; the cycle continues.

The fundamental reason jailbreaks keep working: safety training is a layer painted on top of a model that fundamentally wants to be helpful and continue any plausible text. The guardrails are statistical tendencies, not hard rules, and a sufficiently novel framing slips between them. There is no known way to make a helpful model unjailbreakable.

Whose problem is it?

Here's the reframing most teams miss. If you build on a hosted model, the provider's guardrails are mostly about the provider's liability and brand, not your application's security. A user jailbreaking ChatGPT into writing something offensive is OpenAI's reputational problem. The question for you is different: what can a user actually do through your application by misbehaving?

The misuse that matters to you

Focus your effort on application-level misuse, which is yours to own regardless of how good the provider's guardrails are:

Scope escape: getting your customer-service bot to act as a general-purpose assistant, burning your tokens on the attacker's tasks.
Capability abuse: coaxing the model into using a tool or accessing data outside the intended task (this is the confused-deputy problem again).
Resource exhaustion: driving expensive operations (huge contexts, long agent loops) to run up your bill or degrade service for others.
Reputational output: your branded assistant producing content that embarrasses you, because in your UI it speaks for you.

The defences are the same architectural ones from the injection chapter, because the threat is the same: untrusted instructions meeting capability. Constrain the model's scope and tools, rate-limit and budget-cap per user, and validate that outputs and actions stay within the application's intended bounds. You are not trying to make the model refuse everything bad in the world; you're trying to make sure that within your app, it can only do your app's job.

Content safety where it does matter

If your product genuinely exposes open-ended generation to the public under your brand (a writing assistant, a public chatbot) then content safety is part of your problem, and provider guardrails alone won't cover your specific risks. Add an output-moderation layer (a classifier or a moderation API) tuned to the categories that matter for your context and audience, and log and review what gets flagged. Match the control to the actual exposure, rather than treating every app as if it's one jailbreak away from catastrophe.

In one line each

A jailbreak gets around the model's trained guardrails; new ones appear constantly because safety is a tendency, not a hard rule.
Provider guardrails are mostly about the provider's liability; your problem is what a user can do through your app.
Focus on application misuse: scope escape, capability abuse, resource exhaustion, reputational output.
Defend with the same architectural controls as injection; add real content moderation only where you expose open-ended public generation.

Where to go next

Chapter 5: The AI supply chain