Incident Response and Remediation

Patterns for diagnosing and fixing production issues.

Healer Mode Workflow

Investigate - Gather metrics, logs, and system state
Diagnose - Identify root cause before fixing
Fix - Implement minimal targeted fix
Validate - Confirm metrics improve after deployment
Document - Store learnings for future incidents

Tool Usage Priority

Observability Tools - Query Prometheus, Loki, Grafana for metrics and logs
Kubernetes Tools - Check pod status, events, deployments
ArgoCD Tools - Verify GitOps sync status
Memory Search - Look for similar past incidents
Code Fix - Implement minimal targeted fix

Observability Queries

Prometheus Metrics

Error rate

sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m]))

Latency P99

histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))

CPU usage

sum(rate(container_cpu_usage_seconds_total{pod=~"app-.*"}[5m])) by (pod)

Memory usage

container_memory_working_set_bytes{pod=~"app-.*"}

Loki Log Queries

Errors in last hour

{namespace="production", pod=~"app-.*"} |= "error" | json | level="error"

Stack traces

{namespace="production"} |= "panic" or |= "stack trace"

Slow requests

{namespace="production"} | json | latency_ms > 1000

Kubernetes Diagnostics

Pod status and events

kubectl get pods -n production -l app=myapp kubectl describe pod <pod-name> -n production kubectl get events -n production --sort-by='.lastTimestamp'

Logs

kubectl logs -n production -l app=myapp --tail=100 kubectl logs -n production <pod-name> --previous # Previous container

Resource usage

kubectl top pods -n production kubectl top nodes

Deployment status

kubectl rollout status deployment/myapp -n production kubectl rollout history deployment/myapp -n production

ArgoCD Status

Application status

argocd app get myapp argocd app diff myapp

Sync status

argocd app sync myapp --dry-run

Rollback

argocd app rollback myapp <revision>

Common Issues and Solutions

High Error Rate

Check recent deployments
Review error logs for patterns
Check dependency health
Verify configuration changes

High Latency

Check database query performance
Review external service latency
Check resource constraints (CPU/memory)
Look for lock contention

OOMKilled Pods

Increase memory limits
Check for memory leaks
Review recent code changes
Consider horizontal scaling

CrashLoopBackOff

Check logs for startup errors
Verify secrets and configs exist
Check health check endpoints
Review recent deployments

ImagePullBackOff

Verify image exists in registry
Check image pull secrets
Verify image tag is correct
Check registry connectivity

Healing Guidelines

Diagnose first - Understand the root cause before fixing
Minimal changes - Fix only what's broken
Document findings - Store learnings in memory for future incidents
Validate fix - Confirm metrics improve after deployment
Rollback if needed - Don't hesitate to rollback if fix doesn't work

Post-Incident

Update metrics/alerts if needed
Document root cause and fix
Store learnings in memory for similar incidents
Consider preventive measures
Update runbooks if applicable

incident-response

Safety Notice

Copy this and send it to your AI assistant to learn