Kubernetes Troubleshooting
Expert debugging and diagnostics for Kubernetes clusters using kubectl-mcp-server tools.
When to Apply
Use this skill when:
-
User mentions: "debug", "troubleshoot", "diagnose", "failing", "crash", "not starting", "broken"
-
Pod states: Pending, CrashLoopBackOff, ImagePullBackOff, OOMKilled, Error, Unknown
-
Node issues: NotReady, MemoryPressure, DiskPressure, NetworkUnavailable, PIDPressure
-
Keywords: "logs", "events", "describe", "why isn't working", "stuck", "not responding"
Priority Rules
Priority Rule Impact Tools
1 Check pod status first CRITICAL get_pods , describe_pod
2 View recent events CRITICAL get_events
3 Inspect logs (including previous) HIGH get_pod_logs
4 Check resource metrics HIGH get_pod_metrics
5 Verify endpoints MEDIUM get_endpoints
6 Review network policies MEDIUM get_network_policies
7 Examine node status LOW get_nodes , describe_node
Quick Reference
Symptom First Tool Next Steps
Pod Pending describe_pod
Check events, node capacity, resource requests
CrashLoopBackOff get_pod_logs(previous=True)
Check exit code, resources, liveness probes
ImagePullBackOff describe_pod
Verify image name, registry auth, network
OOMKilled get_pod_metrics
Increase memory limits, check for memory leaks
ContainerCreating describe_pod
Check PVC binding, secrets, configmaps
Terminating (stuck) describe_pod
Check finalizers, PDBs, preStop hooks
Diagnostic Workflows
Pod Not Starting
- get_pods(namespace, label_selector) - Get pod status
- describe_pod(name, namespace) - See events and conditions
- get_events(namespace, field_selector="involvedObject.name=<pod>") - Check events
- get_pod_logs(name, namespace, previous=True) - For crash loops
Common Pod States
State Likely Cause Tools to Use
Pending Scheduling issues describe_pod , get_nodes , get_events
ImagePullBackOff Registry/auth describe_pod , check image name
CrashLoopBackOff App crash get_pod_logs(previous=True)
OOMKilled Memory limit get_pod_metrics , adjust limits
ContainerCreating Volume/network describe_pod , get_pvc
Node Issues
- get_nodes() - List nodes and status
- describe_node(name) - See conditions and capacity
- Check: Ready, MemoryPressure, DiskPressure, PIDPressure
- node_logs_tool(name, "kubelet") - Kubelet logs
Deep Debugging Workflows
CrashLoopBackOff Investigation
- get_pod_logs(name, namespace, previous=True) - See why it crashed
- describe_pod(name, namespace) - Check resource limits, probes
- get_pod_metrics(name, namespace) - Memory/CPU at crash time
- If OOM: compare requests/limits to actual usage
- If app error: check logs for stack trace
Networking Issues
- get_services(namespace) - Verify service exists
- get_endpoints(namespace) - Check endpoint backends
- If empty endpoints: pods don't match selector
- get_network_policies(namespace) - Check traffic rules
- For Cilium: cilium_endpoints_list_tool(), hubble_flows_query_tool()
Storage Problems
- get_pvc(namespace) - Check PVC status
- describe_pvc(name, namespace) - See binding issues
- get_storage_classes() - Verify provisioner exists
- If Pending: check storage class, access modes
DNS Resolution
- kubectl_exec(pod, namespace, "nslookup kubernetes.default") - Test DNS
- If fails: check coredns pods in kube-system
- get_pods(namespace="kube-system", label_selector="k8s-app=kube-dns")
- get_pod_logs(name="coredns-*", namespace="kube-system")
Multi-Cluster Debugging
All tools support context parameter for targeting different clusters:
get_pods(namespace="kube-system", context="production-cluster") get_events(namespace="default", context="staging-cluster") describe_pod(name="myapp-xyz", namespace="prod", context="prod-east")
Diagnostic Scripts
For comprehensive diagnostics, run the bundled scripts:
-
See scripts/diagnose-pod.py for automated pod analysis
-
See scripts/health-check.sh for cluster health checks
Decision Tree
See references/DECISION-TREE.md for visual troubleshooting flowcharts.
Common Errors Reference
See references/COMMON-ERRORS.md for error message explanations and fixes.
Related Tools
Core Diagnostics
-
get_pods , describe_pod , get_pod_logs , get_pod_metrics
-
get_events , get_nodes , describe_node
-
get_resource_usage , compare_namespaces
Advanced (Ecosystem)
-
Cilium: cilium_endpoints_list_tool , hubble_flows_query_tool
-
Istio: istio_proxy_status_tool , istio_analyze_tool
Related Skills
-
k8s-diagnostics - Metrics and health checks
-
k8s-incident - Emergency runbooks
-
k8s-networking - Network troubleshooting