ai-sre-incident-response | V50.AI

ai-sre-incident-response

AI SRE Incident Response

Safety Notice

This listing is imported from skills.sh public index metadata. Review upstream SKILL.md and repository scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "ai-sre-incident-response" with this command: npx skills add bagelhole/devops-security-agent-skills/bagelhole-devops-security-agent-skills-ai-sre-incident-response

AI SRE Incident Response

Apply SRE rigor to AI systems where incidents include quality regressions, unsafe outputs, and budget explosions.

AI Incident Classes

Availability incident: model/provider unavailable, timeout storm.
Quality incident: answer accuracy or tool success drops below SLO.
Safety incident: harmful or policy-violating outputs increase.
Cost incident: unexpected token or provider spend spike.

Severity Framework (Example)

SEV1: user-facing outage, critical compliance risk, or active data leak.
SEV2: major degradation affecting key flows.
SEV3: limited impact or internal-only issue.

Golden Signals for AI Services

Request success rate
Latency (queue + generation + tool execution)
Hallucination/groundedness proxy metrics
Cost per minute and per tenant
Guardrail violation rate

Response Playbooks

Model Outage

Freeze deployments.
Shift traffic to fallback model/provider.
Enforce stricter rate limits.
Communicate ETA and mitigation.

Quality Regression

Roll back prompt/model version.
Disable risky optimization flags.
Increase sampling for trace review.
Re-run latest eval baseline.

Cost Spike

Identify top tenants/routes/models.
Enable cache + cheaper fallback path.
Apply temporary token caps.
Open postmortem with prevention actions.

Postmortem Requirements

Timeline with detector and responder timestamps
Blast radius by tenant and feature
Missed signals and alert tuning actions
Concrete hardening tasks with owners and due dates

Related Skills

incident-response - Standard incident process and evidence
alerting-oncall - Paging and escalation policy
llm-cost-optimization - Spend controls and efficiency patterns

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Open in GitHub Open in ClawHub

Related Skills

Related by shared tags or category signals.

Security

linux-administration

No summary provided by upstream source.

Repository SourceNeeds Review

Security

sops-encryption

No summary provided by upstream source.

Repository SourceNeeds Review

Security

linux-hardening

No summary provided by upstream source.

Repository SourceNeeds Review

Security

vpn-setup

No summary provided by upstream source.

Repository SourceNeeds Review