kubernetes-debug

Core Principle: Gateway First, Events Before Logs

Safety Notice

This listing is imported from skills.sh public index metadata. Review upstream SKILL.md and repository scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "kubernetes-debug" with this command: npx skills add incidentfox/incidentfox/incidentfox-incidentfox-kubernetes-debug

Kubernetes Debugging

Core Principle: Gateway First, Events Before Logs

ALWAYS start by discovering clusters via the gateway. Do NOT use kubectl directly — this sandbox has no direct k8s API access. All k8s queries go through the k8s-gateway.

Step 1: Discover clusters (MANDATORY first step)

python .claude/skills/infrastructure-kubernetes/scripts/list_clusters.py

Step 2: Use --cluster-id on all scripts

python .claude/skills/infrastructure-kubernetes/scripts/list_namespaces.py --cluster-id <CLUSTER_ID> python .claude/skills/infrastructure-kubernetes/scripts/list_pods.py -n production --cluster-id <CLUSTER_ID>

NEVER run kubectl directly. NEVER run scripts without --cluster-id . If list_clusters.py returns no clusters, tell the user they need to install the k8s-agent on their cluster first.

Gateway-capable scripts: list_pods, get_events, get_logs, describe_pod, describe_deployment, list_namespaces. Direct-only scripts (not available in SaaS): describe_node, get_resources.

ALWAYS check pod events BEFORE logs. Events explain 80% of issues faster:

  • OOMKilled → Memory limit exceeded

  • ImagePullBackOff → Image not found or auth issue

  • FailedScheduling → No nodes with enough resources

  • CrashLoopBackOff → Container crashing repeatedly

Available Scripts

All scripts are in .claude/skills/infrastructure-kubernetes/scripts/

list_clusters.py - Discover available remote clusters

python .claude/skills/infrastructure-kubernetes/scripts/list_clusters.py python .claude/skills/infrastructure-kubernetes/scripts/list_clusters.py --json

list_pods.py - List pods with status

python .claude/skills/infrastructure-kubernetes/scripts/list_pods.py -n <namespace> [--label <selector>] [--cluster-id <id>]

Examples:

python .claude/skills/infrastructure-kubernetes/scripts/list_pods.py -n otel-demo python .claude/skills/infrastructure-kubernetes/scripts/list_pods.py -n otel-demo --label app.kubernetes.io/name=payment python .claude/skills/infrastructure-kubernetes/scripts/list_pods.py -n production --cluster-id abc123

get_events.py - Get pod events (USE FIRST!)

python .claude/skills/infrastructure-kubernetes/scripts/get_events.py <pod-name> -n <namespace> [--cluster-id <id>]

Examples:

python .claude/skills/infrastructure-kubernetes/scripts/get_events.py payment-7f8b9c6d5-x2k4m -n otel-demo python .claude/skills/infrastructure-kubernetes/scripts/get_events.py payment-7f8b9c6d5-x2k4m -n production --cluster-id abc123

get_logs.py - Get pod logs

python .claude/skills/infrastructure-kubernetes/scripts/get_logs.py <pod-name> -n <namespace> [--tail N] [--container NAME] [--cluster-id <id>]

Examples:

python .claude/skills/infrastructure-kubernetes/scripts/get_logs.py payment-7f8b9c6d5-x2k4m -n otel-demo --tail 100 python .claude/skills/infrastructure-kubernetes/scripts/get_logs.py payment-7f8b9c6d5-x2k4m -n otel-demo --container payment

describe_pod.py - Detailed pod info

python .claude/skills/infrastructure-kubernetes/scripts/describe_pod.py <pod-name> -n <namespace> [--cluster-id <id>]

describe_deployment.py - Deployment status and rollout history

python .claude/skills/infrastructure-kubernetes/scripts/describe_deployment.py <deployment-name> -n <namespace> [--cluster-id <id>]

Example:

python .claude/skills/infrastructure-kubernetes/scripts/describe_deployment.py payment -n otel-demo

list_namespaces.py - List all namespaces

python .claude/skills/infrastructure-kubernetes/scripts/list_namespaces.py [--cluster-id <id>]

get_resources.py - Resource usage vs limits (direct-only)

python .claude/skills/infrastructure-kubernetes/scripts/get_resources.py <pod-name> -n <namespace>

describe_node.py - Node status, conditions, and resource usage (direct-only)

python .claude/skills/infrastructure-kubernetes/scripts/describe_node.py <node-name> python .claude/skills/infrastructure-kubernetes/scripts/describe_node.py --all

Examples:

python .claude/skills/infrastructure-kubernetes/scripts/describe_node.py ip-10-0-1-42.ec2.internal python .claude/skills/infrastructure-kubernetes/scripts/describe_node.py --all --json

Debugging Workflows

Pod Not Starting (Pending/CrashLoopBackOff)

  • list_pods.py

  • Check pod status

  • get_events.py

  • Look for scheduling/pull/crash events

  • describe_pod.py

  • Check conditions and container states

  • get_logs.py

  • Only if events don't explain

Pod Restarting (OOMKilled/Crashes)

  • get_events.py

  • Check for OOMKilled or error events

  • get_resources.py

  • Compare usage vs limits

  • get_logs.py

  • Check for errors before crash

  • describe_pod.py

  • Check restart count and state

Deployment Not Progressing

  • describe_deployment.py

  • Check replica counts and rollout history

  • list_pods.py

  • Find stuck pods

  • get_events.py

  • Check events on stuck pods

Node Resource Issues (High CPU/Memory, FailedScheduling)

  • describe_node.py --all

  • Check all nodes for conditions and resource usage

  • describe_node.py <node>

  • Deep dive into specific node

  • list_pods.py

  • Check if pods are Pending/FailedScheduling

  • get_events.py

  • Look for FailedScheduling with resource reasons

Common Issues & Solutions

Event Reason Meaning Action

OOMKilled Container exceeded memory limit Increase limits or fix memory leak

ImagePullBackOff Can't pull image Check image name, registry auth

CrashLoopBackOff Container keeps crashing Check logs for startup errors

FailedScheduling No node can run pod Check node resources, taints

Unhealthy Liveness probe failed Check probe config, app health

Output Format

When reporting findings, use this structure:

Kubernetes Analysis

Pod: <name> Namespace: <namespace> Status: <phase> (Restarts: N)

Events

  • [timestamp] <reason>: <message>

Issues Found

  1. [Issue description with evidence]

Root Cause Hypothesis

[Based on events and logs]

Recommended Action

[Specific remediation step]

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Related Skills

Related by shared tags or category signals.

General

investigate

No summary provided by upstream source.

Repository SourceNeeds Review
General

azure-infrastructure

No summary provided by upstream source.

Repository SourceNeeds Review
General

docker-debugging

No summary provided by upstream source.

Repository SourceNeeds Review
General

database-mysql

No summary provided by upstream source.

Repository SourceNeeds Review