Kubernetes Incident Response
Runbooks and diagnostic workflows for common Kubernetes incidents.
When to Apply
Use this skill when:
-
User mentions: "incident", "outage", "emergency", "down", "not working"
-
Operations: emergency response, production issues, service degradation
-
Keywords: "urgent", "broken", "fix", "restore", "recover"
Priority Rules
Priority Rule Impact Tools
1 Check control plane first CRITICAL get_pods(namespace="kube-system")
2 Assess node health CRITICAL get_nodes
3 Gather events before changes HIGH get_events
4 Document timeline HIGH Manual notes
5 Rollback if safe MEDIUM rollback_deployment
Quick Reference
Incident First Tool Next Steps
Pod failure get_pod_logs(previous=True)
describe_pod , get_events
Node down describe_node
Check kubelet logs
Service unreachable get_endpoints
get_network_policies
Control plane get_pods(namespace="kube-system")
Check API server logs
Incident Triage
Quick Health Check
get_nodes() get_pods(namespace="kube-system") get_events(namespace)
Severity Assessment
Indicator Severity Action
Multiple nodes NotReady Critical Escalate immediately
kube-system pods failing Critical Control plane issue
Single pod CrashLoop Medium Debug pod
High latency Medium Check resources
Runbook: Pod Failures
CrashLoopBackOff
get_pod_logs(name, namespace, previous=True) describe_pod(name, namespace) get_events(namespace, field_selector="involvedObject.name=<pod>") get_pod_metrics(name, namespace)
Common Causes:
-
OOMKilled → Increase memory limits
-
Exit code 1 → Application error in logs
-
Exit code 137 → Killed by OOM or SIGKILL
-
Exit code 143 → Graceful SIGTERM
ImagePullBackOff
describe_pod(name, namespace) get_secrets(namespace)
Pending Pod
describe_pod(name, namespace) get_nodes() get_events(namespace)
Runbook: Node Issues
Node NotReady
describe_node(name) get_events(namespace="", field_selector="involvedObject.name=<node>") node_logs_tool(name, "kubelet")
Node DiskPressure
describe_node(name) get_pods(field_selector="spec.nodeName=<node>")
Runbook: Network Issues
Service Not Accessible
get_services(namespace) get_endpoints(namespace) get_pods(namespace, label_selector="<service-selector>") get_network_policies(namespace)
DNS Resolution Failures
get_pods(namespace="kube-system", label_selector="k8s-app=kube-dns") get_pod_logs("coredns-xxx", "kube-system")
With Cilium
cilium_status_tool() cilium_endpoints_list_tool(namespace) hubble_flows_query_tool(namespace)
With Istio
istio_analyze_tool(namespace) istio_proxy_status_tool()
Runbook: Storage Issues
PVC Pending
describe_pvc(name, namespace) get_storage_classes() get_events(namespace)
Pod Stuck in ContainerCreating
describe_pod(name, namespace) get_pvc(namespace) get_events(namespace)
Runbook: Control Plane Issues
API Server Unavailable
get_pods(namespace="kube-system", label_selector="component=kube-apiserver") get_events(namespace="kube-system")
etcd Issues
get_pods(namespace="kube-system", label_selector="component=etcd") get_pod_logs("etcd-xxx", "kube-system")
Emergency Actions
Force Delete Pod
delete_pod(name, namespace, grace_period=0, force=True)
Rollback Deployment
rollback_deployment(name, namespace, revision=0)
Helm Rollback
rollback_helm_release(name, namespace, revision=1)
Diagnostic Collection Script
For comprehensive incident diagnostics, see scripts/collect-diagnostics.py.
Multi-Cluster Incident Response
Check all clusters:
for context in ["prod-1", "prod-2", "staging"]: get_nodes(context=context) get_pods(namespace="kube-system", context=context) get_events(namespace="kube-system", context=context)
Post-Incident
Document Timeline
-
When did the incident start?
-
What was the impact?
-
What was the root cause?
-
What fixed it?
Prevent Recurrence
-
Add monitoring/alerting
-
Improve resource limits
-
Add readiness probes
-
Document runbook
Related Skills
-
k8s-troubleshoot - Detailed debugging
-
k8s-security - Security incidents