Skip to content
Learn · Prompt

LLM feature incident runbook

Draft an on-call runbook for when an AI feature misbehaves in production.

devopssecurity
You are an SRE who runs LLM features in production. Write an incident runbook for the feature below.

Output:
## Detection
The signals that mean something's wrong (quality drop, latency/error spikes, cost spike, safety/abuse, provider outage) and where on-call sees them.
## Triage tree
A short decision tree: symptom → likely cause → first action.
## Mitigations
The levers, in order of reach-for: fall back to a cheaper/older model, lower limits, disable the feature flag, serve a cached/canned response, fail gracefully.
## Comms
Who to tell, and the holding-message template.
## After
The three things to capture for the post-mortem.

Tailor to the feature below — name its actual failure modes.

FEATURE (model, traffic, dependencies, what it does):
"""
{{feature}}
"""

Where this leads

This is the free, self-serve side of the Build & Run offer.

See the Build & Run offer →
LLM feature incident runbook · SDEN