Incident Response and Remediation
Patterns for diagnosing and fixing production issues.
Healer Mode Workflow
-
Investigate - Gather metrics, logs, and system state
-
Diagnose - Identify root cause before fixing
-
Fix - Implement minimal targeted fix
-
Validate - Confirm metrics improve after deployment
-
Document - Store learnings for future incidents
Tool Usage Priority
-
Observability Tools - Query Prometheus, Loki, Grafana for metrics and logs
-
Kubernetes Tools - Check pod status, events, deployments
-
ArgoCD Tools - Verify GitOps sync status
-
Memory Search - Look for similar past incidents
-
Code Fix - Implement minimal targeted fix
Observability Queries
Prometheus Metrics
Error rate
sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m]))
Latency P99
histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))
CPU usage
sum(rate(container_cpu_usage_seconds_total{pod=~"app-.*"}[5m])) by (pod)
Memory usage
container_memory_working_set_bytes{pod=~"app-.*"}
Loki Log Queries
Errors in last hour
{namespace="production", pod=~"app-.*"} |= "error" | json | level="error"
Stack traces
{namespace="production"} |= "panic" or |= "stack trace"
Slow requests
{namespace="production"} | json | latency_ms > 1000
Kubernetes Diagnostics
Pod status and events
kubectl get pods -n production -l app=myapp kubectl describe pod <pod-name> -n production kubectl get events -n production --sort-by='.lastTimestamp'
Logs
kubectl logs -n production -l app=myapp --tail=100 kubectl logs -n production <pod-name> --previous # Previous container
Resource usage
kubectl top pods -n production kubectl top nodes
Deployment status
kubectl rollout status deployment/myapp -n production kubectl rollout history deployment/myapp -n production
ArgoCD Status
Application status
argocd app get myapp argocd app diff myapp
Sync status
argocd app sync myapp --dry-run
Rollback
argocd app rollback myapp <revision>
Common Issues and Solutions
High Error Rate
-
Check recent deployments
-
Review error logs for patterns
-
Check dependency health
-
Verify configuration changes
High Latency
-
Check database query performance
-
Review external service latency
-
Check resource constraints (CPU/memory)
-
Look for lock contention
OOMKilled Pods
-
Increase memory limits
-
Check for memory leaks
-
Review recent code changes
-
Consider horizontal scaling
CrashLoopBackOff
-
Check logs for startup errors
-
Verify secrets and configs exist
-
Check health check endpoints
-
Review recent deployments
ImagePullBackOff
-
Verify image exists in registry
-
Check image pull secrets
-
Verify image tag is correct
-
Check registry connectivity
Healing Guidelines
-
Diagnose first - Understand the root cause before fixing
-
Minimal changes - Fix only what's broken
-
Document findings - Store learnings in memory for future incidents
-
Validate fix - Confirm metrics improve after deployment
-
Rollback if needed - Don't hesitate to rollback if fix doesn't work
Post-Incident
-
Update metrics/alerts if needed
-
Document root cause and fix
-
Store learnings in memory for similar incidents
-
Consider preventive measures
-
Update runbooks if applicable