ai-sre-incident-response

AI SRE Incident Response

Safety Notice

This listing is imported from skills.sh public index metadata. Review upstream SKILL.md and repository scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "ai-sre-incident-response" with this command: npx skills add bagelhole/devops-security-agent-skills/bagelhole-devops-security-agent-skills-ai-sre-incident-response

AI SRE Incident Response

Apply SRE rigor to AI systems where incidents include quality regressions, unsafe outputs, and budget explosions.

AI Incident Classes

  • Availability incident: model/provider unavailable, timeout storm.

  • Quality incident: answer accuracy or tool success drops below SLO.

  • Safety incident: harmful or policy-violating outputs increase.

  • Cost incident: unexpected token or provider spend spike.

Severity Framework (Example)

  • SEV1: user-facing outage, critical compliance risk, or active data leak.

  • SEV2: major degradation affecting key flows.

  • SEV3: limited impact or internal-only issue.

Golden Signals for AI Services

  • Request success rate

  • Latency (queue + generation + tool execution)

  • Hallucination/groundedness proxy metrics

  • Cost per minute and per tenant

  • Guardrail violation rate

Response Playbooks

Model Outage

  • Freeze deployments.

  • Shift traffic to fallback model/provider.

  • Enforce stricter rate limits.

  • Communicate ETA and mitigation.

Quality Regression

  • Roll back prompt/model version.

  • Disable risky optimization flags.

  • Increase sampling for trace review.

  • Re-run latest eval baseline.

Cost Spike

  • Identify top tenants/routes/models.

  • Enable cache + cheaper fallback path.

  • Apply temporary token caps.

  • Open postmortem with prevention actions.

Postmortem Requirements

  • Timeline with detector and responder timestamps

  • Blast radius by tenant and feature

  • Missed signals and alert tuning actions

  • Concrete hardening tasks with owners and due dates

Related Skills

  • incident-response - Standard incident process and evidence

  • alerting-oncall - Paging and escalation policy

  • llm-cost-optimization - Spend controls and efficiency patterns

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Related Skills

Related by shared tags or category signals.

Security

linux-administration

No summary provided by upstream source.

Repository SourceNeeds Review
Security

sops-encryption

No summary provided by upstream source.

Repository SourceNeeds Review
Security

linux-hardening

No summary provided by upstream source.

Repository SourceNeeds Review
Security

vpn-setup

No summary provided by upstream source.

Repository SourceNeeds Review