Kubernetes Debugging Guide

A systematic approach to diagnosing and resolving Kubernetes issues. Always start with the basics: check events and logs first.

Common Error Patterns

CrashLoopBackOff

What it means: Container repeatedly crashes and fails to start. Kubernetes restarts it with exponential backoff (10s, 20s, 40s... up to 5 minutes).

Common causes:

Insufficient memory/CPU resources
Missing dependencies in container image
Misconfigured liveness/readiness probes
Application code errors or misconfigurations
Missing environment variables or secrets

Debug steps:

Check pod events and status

kubectl describe pod <pod-name> -n <namespace>

View current and previous container logs

kubectl logs <pod-name> -n <namespace> kubectl logs <pod-name> -n <namespace> --previous

Check resource limits vs actual usage

kubectl top pod <pod-name> -n <namespace>

Solutions:

Tune probe initialDelaySeconds and timeoutSeconds
Increase resource limits if hitting memory/CPU caps
Fix missing dependencies in Dockerfile
Review application startup code for errors

ImagePullBackOff

What it means: Kubernetes cannot pull the container image. Retries with increasing delay (5s, 10s, 20s... up to 5 minutes).

Common causes:

Incorrect image name or tag
Missing registry authentication credentials
Private registry without imagePullSecrets configured
Network connectivity issues to registry
Image does not exist in registry

Debug steps:

Check pod events for specific error

kubectl describe pod <pod-name> -n <namespace>

Verify image name in deployment

kubectl get deployment <name> -n <namespace> -o yaml | grep image:

Check if imagePullSecrets are configured

kubectl get pod <pod-name> -n <namespace> -o yaml | grep -A5 imagePullSecrets

Test pulling image from node (if you have node access)

docker pull <image-name>

Solutions:

Correct image name/tag in deployment spec
Create and attach imagePullSecret for private registries
Verify network access to container registry
Check registry credentials haven't expired

Pending Pods (Scheduling Failures)

What it means: Pod cannot be scheduled to any node.

Common causes:

Insufficient cluster resources (CPU/memory)
Node selectors or affinity rules cannot be satisfied
Taints without matching tolerations
PersistentVolumeClaim not bound
Resource quotas exceeded

Debug steps:

Check why pod is pending

kubectl describe pod <pod-name> -n <namespace>

View cluster events

kubectl get events -n <namespace> --sort-by='.lastTimestamp'

Check node resources

kubectl describe nodes | grep -A5 "Allocated resources" kubectl top nodes

Check PVC status if using persistent storage

kubectl get pvc -n <namespace>

Solutions:

Scale up cluster or reduce resource requests
Adjust nodeSelector/affinity rules
Add tolerations for node taints
Create or fix PersistentVolume bindings
Increase namespace resource quotas

OOMKilled

What it means: Container was forcefully terminated (SIGKILL, exit code 137) for exceeding memory limit.

Common causes:

Memory limit set too low for application
Memory leak in application code
Processing large files or datasets
High concurrency causing memory spikes
JVM/runtime heap misconfiguration

Debug steps:

Check termination reason

kubectl describe pod <pod-name> -n <namespace> | grep -A10 "Last State"

View logs before termination

kubectl logs <pod-name> -n <namespace> --previous

Check memory limits vs usage

kubectl top pod <pod-name> -n <namespace> kubectl get pod <pod-name> -n <namespace> -o yaml | grep -A5 resources:

Solutions:

Increase memory limits in deployment spec
Profile application for memory leaks
Configure application memory settings (e.g., JVM -Xmx)
Implement memory-efficient processing patterns
Add horizontal pod autoscaling for load distribution

Service Not Reachable

What it means: Cannot connect to service from within or outside cluster.

Common causes:

Service selector doesn't match pod labels
Pod not ready (failing readiness probe)
NetworkPolicy blocking traffic
Service port mismatch with container port
Ingress/LoadBalancer misconfiguration

Debug steps:

Check service and endpoints

kubectl get svc <service-name> -n <namespace> kubectl get endpoints <service-name> -n <namespace> kubectl describe svc <service-name> -n <namespace>

Verify pod labels match service selector

kubectl get pods -n <namespace> --show-labels kubectl get svc <service-name> -n <namespace> -o yaml | grep -A5 selector

Test connectivity from within cluster

kubectl run debug --rm -it --image=busybox -- wget -qO- http://<service-name>.<namespace>:port

Check network policies

kubectl get networkpolicy -n <namespace>

Solutions:

Fix service selector to match pod labels
Ensure pods are passing readiness probes
Update NetworkPolicy to allow required traffic
Verify port configurations match

PVC Binding Failures

What it means: PersistentVolumeClaim cannot bind to a PersistentVolume.

Common causes:

No PV available matching PVC requirements
StorageClass not found or misconfigured
Access mode mismatch (RWO vs RWX)
Storage capacity insufficient
Zone/region constraints not met

Debug steps:

Check PVC status

kubectl get pvc -n <namespace> kubectl describe pvc <pvc-name> -n <namespace>

List available PVs

kubectl get pv

Check StorageClass

kubectl get storageclass kubectl describe storageclass <name>

View provisioner events

kubectl get events -n <namespace> --field-selector reason=ProvisioningFailed

Solutions:

Create matching PersistentVolume manually
Fix StorageClass name or create required class
Adjust access mode or capacity requirements
Enable dynamic provisioning if available

RBAC Permission Denied

What it means: Service account lacks required permissions for API operations.

Common causes:

Missing Role or ClusterRole
RoleBinding not created for service account
Wrong namespace for RoleBinding
Insufficient permissions in Role

Debug steps:

Check what service account pod uses

kubectl get pod <pod-name> -n <namespace> -o yaml | grep serviceAccountName

Test permissions

kubectl auth can-i <verb> <resource> --as=system:serviceaccount:<namespace>:<sa-name>

List roles and bindings

kubectl get roles,rolebindings -n <namespace> kubectl get clusterroles,clusterrolebindings | grep <relevant-name>

Describe specific binding

kubectl describe rolebinding <name> -n <namespace>

Solutions:

Create Role/ClusterRole with required permissions
Create RoleBinding/ClusterRoleBinding
Verify binding references correct service account
Use namespace-scoped roles when possible

Debugging Tools Reference

kubectl describe

Get detailed information about any resource including events.

kubectl describe pod <pod-name> -n <namespace> kubectl describe node <node-name> kubectl describe svc <service-name> -n <namespace> kubectl describe deployment <name> -n <namespace>

kubectl logs

View container stdout/stderr logs.

kubectl logs <pod-name> -n <namespace> kubectl logs <pod-name> -n <namespace> -c <container-name> # specific container kubectl logs <pod-name> -n <namespace> --previous # previous instance kubectl logs <pod-name> -n <namespace> -f # follow/stream kubectl logs <pod-name> -n <namespace> --tail=100 # last 100 lines kubectl logs -l app=myapp -n <namespace> # by label selector

kubectl exec

Execute commands inside running container.

kubectl exec -it <pod-name> -n <namespace> -- /bin/sh kubectl exec -it <pod-name> -n <namespace> -- /bin/bash kubectl exec <pod-name> -n <namespace> -- cat /etc/config/app.conf kubectl exec <pod-name> -n <namespace> -c <container> -- env # specific container

kubectl debug

Debug nodes or pods with ephemeral containers (K8s 1.23+).

Debug a running pod with ephemeral container

kubectl debug -it <pod-name> -n <namespace> --image=busybox --target=<container>

Debug a node

kubectl debug node/<node-name> -it --image=busybox

Create debug copy of pod with different image

kubectl debug <pod-name> -it --copy-to=debug-pod --container=app --image=busybox

kubectl get events

View cluster events for troubleshooting.

kubectl get events -n <namespace> kubectl get events -n <namespace> --sort-by='.lastTimestamp' kubectl get events -n <namespace> --field-selector type=Warning kubectl get events -n <namespace> -w # watch for new events kubectl get events -A --sort-by='.lastTimestamp' | tail -20 # cluster-wide recent

kubectl top

View resource usage metrics (requires metrics-server).

kubectl top pods -n <namespace> kubectl top pods -n <namespace> --sort-by=memory kubectl top nodes kubectl top pod <pod-name> -n <namespace> --containers

The Four Phases of Kubernetes Debugging

Phase 1: Gather Information

Start broad, then narrow down. Never assume the cause.

Quick status overview

kubectl get pods,svc,deploy,rs -n <namespace>

Recent events (often reveals the issue immediately)

kubectl get events -n <namespace> --sort-by='.lastTimestamp' | tail -20

Describe the problematic resource

kubectl describe <resource-type> <name> -n <namespace>

Phase 2: Check Logs and Metrics

Logs reveal application-level issues; metrics reveal resource issues.

Application logs

kubectl logs <pod-name> -n <namespace> --tail=200 kubectl logs <pod-name> -n <namespace> --previous # if crashed

Resource metrics

kubectl top pod <pod-name> -n <namespace> kubectl top nodes

Phase 3: Interactive Investigation

Get inside the environment when logs aren't enough.

Shell into the container

kubectl exec -it <pod-name> -n <namespace> -- /bin/sh

Use ephemeral debug container

kubectl debug -it <pod-name> -n <namespace> --image=nicolaka/netshoot

Check connectivity from inside

wget -qO- http://service-name:port nslookup service-name curl -v http://endpoint

Phase 4: Validate and Fix

Make changes, verify they work, document the solution.

Apply fix

kubectl apply -f fixed-manifest.yaml

Watch for success

kubectl get pods -n <namespace> -w kubectl get events -n <namespace> -w

Verify health

kubectl logs <pod-name> -n <namespace> -f

Quick Reference Commands

Essential One-Liners

Get all pods with their status across namespaces

kubectl get pods -A -o wide

Find pods not in Running state

kubectl get pods -A --field-selector=status.phase!=Running

Get pod restart counts

kubectl get pods -n <namespace> -o=custom-columns='NAME:.metadata.name,RESTARTS:.status.containerStatuses[0].restartCount'

Show pod resource requests and limits

kubectl get pods -n <namespace> -o=custom-columns='NAME:.metadata.name,MEM_REQ:.spec.containers[0].resources.requests.memory,MEM_LIM:.spec.containers[0].resources.limits.memory'

Get events sorted by time (most recent last)

kubectl get events --sort-by='.lastTimestamp' -n <namespace>

Watch pods in real-time

kubectl get pods -n <namespace> -w

Get logs from all pods with label

kubectl logs -l app=myapp -n <namespace> --all-containers=true

Check endpoints for a service

kubectl get endpoints <service-name> -n <namespace>

Test DNS resolution

kubectl run dnstest --rm -it --image=busybox --restart=Never -- nslookup <service-name>.<namespace>

Check RBAC permissions

kubectl auth can-i --list --as=system:serviceaccount:<namespace>:<sa-name>

Debugging Network Issues

Run network debug pod

kubectl run netshoot --rm -it --image=nicolaka/netshoot -- /bin/bash

Test service connectivity

kubectl run curl --rm -it --image=curlimages/curl -- curl -v http://<service>:<port>

Check CoreDNS logs

kubectl logs -n kube-system -l k8s-app=kube-dns

Trace network path

kubectl exec -it <pod> -- traceroute <destination>

Debugging Storage Issues

List all PVCs with status

kubectl get pvc -A

Describe PVC for binding issues

kubectl describe pvc <name> -n <namespace>

Check storage provisioner logs

kubectl logs -n kube-system -l app=<provisioner-name>

Verify mount inside pod

kubectl exec -it <pod> -n <namespace> -- df -h kubectl exec -it <pod> -n <namespace> -- ls -la /path/to/mount

Useful Debug Images

Image Use Case

busybox

Basic shell, networking tools

nicolaka/netshoot

Comprehensive network debugging

curlimages/curl

HTTP testing

alpine

Minimal Linux with package manager

gcr.io/kubernetes-e2e-test-images/jessie-dnsutils

DNS debugging

Prevention Best Practices

Always set resource requests and limits - Prevents noisy neighbor issues and OOMKilled
Configure proper health probes - Liveness, readiness, and startup probes with appropriate delays
Use namespaces - Isolate workloads for easier debugging
Label everything - Makes filtering and selection reliable
Implement monitoring - Prometheus, Grafana, ELK stack for visibility
Validate manifests before deployment - Use kubeval, kube-linter, or similar tools
Use GitOps - Track changes and enable rollback
Document runbooks - For common issues specific to your applications

debug:kubernetes

Safety Notice

Copy this and send it to your AI assistant to learn

Check pod events and status

View current and previous container logs

Check resource limits vs actual usage

Check pod events for specific error

Verify image name in deployment

Check if imagePullSecrets are configured

Test pulling image from node (if you have node access)

Check why pod is pending

View cluster events

Check node resources

Check PVC status if using persistent storage

Check termination reason

View logs before termination

Check memory limits vs usage

Check service and endpoints

Verify pod labels match service selector

Test connectivity from within cluster

Check network policies

Check PVC status

List available PVs

Check StorageClass

View provisioner events

Check what service account pod uses

Test permissions

List roles and bindings

Describe specific binding

Debug a running pod with ephemeral container

Debug a node

Create debug copy of pod with different image

Quick status overview

Recent events (often reveals the issue immediately)

Describe the problematic resource

Application logs

Resource metrics

Shell into the container

Use ephemeral debug container

Check connectivity from inside

Apply fix

Watch for success

Verify health

Get all pods with their status across namespaces

Find pods not in Running state

Get pod restart counts

Show pod resource requests and limits

Get events sorted by time (most recent last)

Watch pods in real-time

Get logs from all pods with label

Check endpoints for a service

Test DNS resolution

Check RBAC permissions

Run network debug pod

Test service connectivity

Check CoreDNS logs

Trace network path

List all PVCs with status

Describe PVC for binding issues

Check storage provisioner logs

Verify mount inside pod

Source Transparency

Related Skills

refactor:flutter

refactor:nestjs

debug:flutter

refactor:spring-boot