AI SRE Incident Response
Apply SRE rigor to AI systems where incidents include quality regressions, unsafe outputs, and budget explosions.
AI Incident Classes
-
Availability incident: model/provider unavailable, timeout storm.
-
Quality incident: answer accuracy or tool success drops below SLO.
-
Safety incident: harmful or policy-violating outputs increase.
-
Cost incident: unexpected token or provider spend spike.
Severity Framework (Example)
-
SEV1: user-facing outage, critical compliance risk, or active data leak.
-
SEV2: major degradation affecting key flows.
-
SEV3: limited impact or internal-only issue.
Golden Signals for AI Services
-
Request success rate
-
Latency (queue + generation + tool execution)
-
Hallucination/groundedness proxy metrics
-
Cost per minute and per tenant
-
Guardrail violation rate
Response Playbooks
Model Outage
-
Freeze deployments.
-
Shift traffic to fallback model/provider.
-
Enforce stricter rate limits.
-
Communicate ETA and mitigation.
Quality Regression
-
Roll back prompt/model version.
-
Disable risky optimization flags.
-
Increase sampling for trace review.
-
Re-run latest eval baseline.
Cost Spike
-
Identify top tenants/routes/models.
-
Enable cache + cheaper fallback path.
-
Apply temporary token caps.
-
Open postmortem with prevention actions.
Postmortem Requirements
-
Timeline with detector and responder timestamps
-
Blast radius by tenant and feature
-
Missed signals and alert tuning actions
-
Concrete hardening tasks with owners and due dates
Related Skills
-
incident-response - Standard incident process and evidence
-
alerting-oncall - Paging and escalation policy
-
llm-cost-optimization - Spend controls and efficiency patterns