k8s-debug

Kubernetes Debugging Expertise

Golden Rule: Events Before Logs

When debugging Kubernetes issues, ALWAYS check events first:

get_pod_events
Shows scheduling, pulling, starting, probes, OOM
THEN get_pod_logs
Application-level errors

Events explain most crash/scheduling issues faster than logs.

Typical Investigation Flow

list_pods → Get overview of pod health in namespace
get_pod_events → Understand WHY pods are in their state
get_pod_logs → Only if events don't explain the issue
get_pod_resources → For performance/resource issues
describe_deployment → Check deployment status and conditions

Common Issue Patterns

CrashLoopBackOff

First check: get_pod_events

Event Reason Likely Cause Next Step

OOMKilled Memory limit too low or memory leak Check get_pod_resources , increase limits

Error Application crash Check get_pod_logs for stack trace

BackOff Repeated failures Check logs for startup errors

Checklist:

Memory limits vs actual usage
Recent deployment changes (get_deployment_history )
Missing config/secrets
Dependency failures (database, external services)

OOMKilled

First check: get_pod_events (confirms OOMKilled) Then: get_pod_resources (compare usage to limits)

Common causes:

Memory limit set too low for workload
Memory leak (usage increases over time)
Sudden traffic spike causing memory pressure
Large request payloads cached in memory

ImagePullBackOff

First check: get_pod_events

Common causes:

Wrong image name or tag
Private registry without imagePullSecrets
Rate limiting from registry
Network issues reaching registry

Pending Pods

First check: get_pod_events

Look for:

FailedScheduling
Insufficient resources
Unschedulable
Node affinity/taints
No matching nodes for nodeSelector

Readiness/Liveness Probe Failures

First check: describe_pod (shows probe config) Then: get_pod_events (probe failure events) Then: get_pod_logs (why endpoint isn't responding)

Evicted Pods

First check: get_pod_events

Causes:

Node resource pressure (disk, memory)
Priority preemption
Taint-based eviction

Deployment Issues

Stuck Rollout

describe_deployment → Check replicas (desired vs ready vs available) get_deployment_history → Compare current vs previous revision get_pod_events → For pods in new ReplicaSet

Common causes:

New pods failing (CrashLoopBackOff)
Readiness probes failing
Resource constraints preventing scheduling

Rollback Decision

Use get_deployment_history to see previous working versions.

Error Classification

Non-Retryable (Stop Immediately)

401 Unauthorized - Invalid credentials
403 Forbidden - No permission
404 Not Found - Resource doesn't exist
"config_required": true - Integration not configured

Retryable (May retry once)

429 Too Many Requests
500/502/503/504 Server errors
Timeout
Connection refused

Resource Investigation Pattern

For memory/CPU issues:

get_pod_resources → See allocation vs usage
describe_pod → See full container spec
get_cloudwatch_metrics/query_datadog_metrics → Historical usage
detect_anomalies on historical data → Find when issue started

Safety Notice

Copy this and send it to your AI assistant to learn

Source Transparency

Related Skills

investigate

kubernetes-debug

docker-debugging