Kubernetes Debugging Skill
Overview
Systematic toolkit for debugging Kubernetes clusters, workloads, networking, and storage with a deterministic, safety-first workflow.
Trigger Phrases
Use this skill when requests resemble:
-
"My pod is in CrashLoopBackOff ; help me find the root cause."
-
"Service DNS works in one pod but not another."
-
"Deployment rollout is stuck."
-
"Pods are Pending and not scheduling."
-
"Cluster health looks degraded after a change."
-
"PVC is pending and pods cannot mount storage."
Prerequisites
Run from the skill directory (devops-skills-plugin/skills/k8s-debug ) so relative script paths work as written.
Required
-
kubectl installed and configured.
-
An active cluster context.
-
Read access to namespaces, pods, events, services, and nodes.
Quick preflight:
kubectl config current-context kubectl auth can-i get pods -A kubectl auth can-i get events -A kubectl get ns
Optional but Recommended
-
jq for more precise filtering in ./scripts/cluster_health.sh .
-
Metrics API (metrics-server ) for kubectl top .
-
In-container debug tools (nslookup , getent , curl , wget , ip ) for deep network tests.
Fallback behavior:
-
If optional tools are missing, scripts continue and print warnings with reduced output.
-
If kubectl top is unavailable, continue with kubectl describe and events.
When to Use This Skill
Use this skill for:
-
Pod failures (CrashLoopBackOff, ImagePullBackOff, Pending, OOMKilled)
-
Service connectivity or DNS resolution issues
-
Network policy or ingress problems
-
Volume and storage mount failures
-
Deployment rollout issues
-
Cluster health or performance degradation
-
Resource exhaustion (CPU/memory)
-
Configuration problems (ConfigMaps, Secrets, RBAC)
Safety Rules for Disruptive Commands
Default mode is read-only diagnosis first. Only execute disruptive commands after confirming blast radius and rollback.
Commands requiring explicit confirmation:
-
kubectl delete pod ... --force --grace-period=0
-
kubectl drain ...
-
kubectl rollout restart ...
-
kubectl rollout undo ...
-
kubectl debug ... --copy-to=...
Before disruptive actions:
Snapshot current state for rollback and incident notes
kubectl get deploy,rs,pod,svc -n <namespace> -o wide kubectl get pod <pod-name> -n <namespace> -o yaml > before-<pod-name>.yaml kubectl get events -n <namespace> --sort-by='.lastTimestamp' > before-events.txt
Reference Navigation Map
Load only the section needed for the observed symptom.
Symptom / Need Open Start section
You need an end-to-end diagnosis path ./references/troubleshooting_workflow.md
General Debugging Workflow
Pod state is Pending , CrashLoopBackOff , or ImagePullBackOff
./references/troubleshooting_workflow.md
Pod Lifecycle Troubleshooting
Service reachability or DNS failure ./references/troubleshooting_workflow.md
Network Troubleshooting Workflow
Node pressure or performance regression ./references/troubleshooting_workflow.md
Resource and Performance Workflow
PVC / PV / storage class issues ./references/troubleshooting_workflow.md
Storage Troubleshooting Workflow
Quick symptom-to-fix lookup ./references/common_issues.md
matching issue heading
Post-mortem fix options for known issues ./references/common_issues.md
Solutions sections
Scripts Overview
Script Purpose Required args Optional args Output Fallback behavior
./scripts/cluster_health.sh
Cluster-wide health snapshot (nodes, workloads, events, common failure states) None --strict , K8S_REQUEST_TIMEOUT env var Sectioned report to stdout Continues on check failures, tracks them in summary and exit code
./scripts/network_debug.sh
Pod-centric network and DNS diagnostics <pod-name> (<namespace> defaults to default ) --strict , --insecure , K8S_REQUEST_TIMEOUT env var Sectioned report to stdout Uses secure API probe by default; insecure TLS requires explicit --insecure
./scripts/pod_diagnostics.py
Deep pod diagnostics (status, describe, YAML, events, per-container logs, node context) <pod-name>
-n/--namespace , -o/--output
Sectioned report to stdout or file Fails fast on missing access; skips optional metrics/log blocks with clear messages
Script Exit Codes
./scripts/cluster_health.sh and ./scripts/network_debug.sh share the same contract:
-
0 : checks completed with no check failures (warnings allowed unless --strict is set).
-
1 : one or more checks failed, or warnings occurred in --strict mode.
-
2 : blocked preconditions (for example: missing kubectl , no active context, inaccessible namespace/pod).
Deterministic Debugging Workflow
Follow this systematic approach for any Kubernetes issue:
- Preflight and Scope
kubectl config current-context kubectl get ns kubectl auth can-i get pods -n <namespace>
If preflight fails, stop and fix access/context first.
- Identify the Problem Layer
Categorize the issue:
-
Application Layer: Application crashes, errors, bugs
-
Pod Layer: Pod not starting, restarting, or pending
-
Service Layer: Network connectivity, DNS issues
-
Node Layer: Node not ready, resource exhaustion
-
Cluster Layer: Control plane issues, API problems
-
Storage Layer: Volume mount failures, PVC issues
-
Configuration Layer: ConfigMap, Secret, RBAC issues
- Gather Diagnostics with the Right Script
Use the appropriate diagnostic script based on scope:
Pod-Level Diagnostics
Use ./scripts/pod_diagnostics.py for comprehensive pod analysis:
python3 ./scripts/pod_diagnostics.py <pod-name> -n <namespace>
This script gathers:
-
Pod status and description
-
Pod events
-
Container logs (current and previous)
-
Resource usage
-
Node information
-
YAML configuration
Output can be saved for analysis:
python3 ./scripts/pod_diagnostics.py <pod-name> -n <namespace> -o diagnostics.txt
Cluster-Level Health Check
Use ./scripts/cluster_health.sh for overall cluster diagnostics:
./scripts/cluster_health.sh > cluster-health-$(date +%Y%m%d-%H%M%S).txt
This script checks:
-
Cluster info and version
-
Node status and resources
-
Pods across all namespaces
-
Failed/pending pods
-
Recent events
-
Deployments, services, statefulsets, daemonsets
-
PVCs and PVs
-
Component health
-
Common error states (CrashLoopBackOff, ImagePullBackOff)
Network Diagnostics
Use ./scripts/network_debug.sh for connectivity issues:
./scripts/network_debug.sh <namespace> <pod-name>
or force warning sensitivity / insecure TLS only when explicitly needed:
./scripts/network_debug.sh --strict <namespace> <pod-name> ./scripts/network_debug.sh --insecure <namespace> <pod-name>
This script analyzes:
-
Pod network configuration
-
DNS setup and resolution
-
Service endpoints
-
Network policies
-
Connectivity tests
-
CoreDNS logs
- Follow Issue-Specific Reference Workflow
Based on the identified issue, consult ./references/troubleshooting_workflow.md :
-
Pod Pending: Resource/scheduling workflow
-
CrashLoopBackOff: Application crash workflow
-
ImagePullBackOff: Image pull workflow
-
Service issues: Network connectivity workflow
-
DNS failures: DNS troubleshooting workflow
-
Resource exhaustion: Performance investigation workflow
-
Storage issues: PVC binding workflow
-
Deployment stuck: Rollout workflow
- Apply Targeted Fixes
Refer to ./references/common_issues.md for symptom-specific fixes.
- Verify and Close
Run final verification:
kubectl get pods -n <namespace> -o wide kubectl get events -n <namespace> --sort-by='.lastTimestamp' | tail -20 kubectl rollout status deployment/<name> -n <namespace>
Issue is done when user-visible behavior is healthy and no new critical warning events appear.
Example Flows
Example 1: CrashLoopBackOff in payments Namespace
python3 ./scripts/pod_diagnostics.py payments-api-7c97f95dfb-q9l7k -n payments -o payments-diagnostics.txt kubectl logs payments-api-7c97f95dfb-q9l7k -n payments --previous --tail=100 kubectl get deploy payments-api -n payments -o yaml | grep -A 8 livenessProbe
Then open ./references/common_issues.md and apply the CrashLoopBackOff solutions.
Example 2: Service DNS/Connectivity Failure
./scripts/network_debug.sh checkout checkout-api-75f49c9d8f-z6qtm kubectl get svc checkout-api -n checkout kubectl get endpoints checkout-api -n checkout kubectl get networkpolicies -n checkout
Then follow Service Connectivity Workflow in ./references/troubleshooting_workflow.md .
Essential Manual Commands
Pod Debugging
View pod status
kubectl get pods -n <namespace> -o wide
Detailed pod information
kubectl describe pod <pod-name> -n <namespace>
View logs
kubectl logs <pod-name> -n <namespace> kubectl logs <pod-name> -n <namespace> --previous # Previous container kubectl logs <pod-name> -n <namespace> -c <container> # Specific container
Execute commands in pod
kubectl exec <pod-name> -n <namespace> -it -- /bin/sh
Get pod YAML
kubectl get pod <pod-name> -n <namespace> -o yaml
Service and Network Debugging
Check services
kubectl get svc -n <namespace> kubectl describe svc <service-name> -n <namespace>
Check endpoints
kubectl get endpoints -n <namespace>
Test DNS
kubectl exec <pod-name> -n <namespace> -- nslookup kubernetes.default
View events
kubectl get events -n <namespace> --sort-by='.lastTimestamp'
Resource Monitoring
Node resources
kubectl top nodes kubectl describe nodes
Pod resources
kubectl top pods -n <namespace> kubectl top pod <pod-name> -n <namespace> --containers
Emergency Operations
Restart deployment
kubectl rollout restart deployment/<name> -n <namespace>
Rollback deployment
kubectl rollout undo deployment/<name> -n <namespace>
Force delete stuck pod
kubectl delete pod <pod-name> -n <namespace> --force --grace-period=0
Drain node (maintenance)
kubectl drain <node-name> --ignore-daemonsets --delete-emptydir-data
Cordon node (prevent scheduling)
kubectl cordon <node-name>
Completion Criteria
Troubleshooting session is complete when all are true:
-
Cluster context and namespace are confirmed.
-
Relevant diagnostic script output is captured.
-
Root cause is identified and tied to evidence (events/logs/config/state).
-
Any disruptive action was preceded by snapshot and rollback plan.
-
Fix verification commands show healthy state.
-
Reference path used (./references/troubleshooting_workflow.md or ./references/common_issues.md ) is documented in notes.
Related Tools
Useful additional tools for Kubernetes debugging:
-
kubectl-debug: Advanced debugging plugin
-
stern: Multi-pod log tailing
-
kubectx/kubens: Context and namespace switching
-
k9s: Terminal UI for Kubernetes
-
lens: Desktop IDE for Kubernetes
-
Prometheus/Grafana: Monitoring and alerting
-
Jaeger/Zipkin: Distributed tracing