gcp-gke-troubleshooting

Systematically diagnoses and resolves common GKE issues including pod failures, networking problems, database connection errors, and Pub/Sub issues. Use when pods are stuck in Pending, CrashLoopBackOff, ImagePullBackOff, experiencing DNS failures, Cloud SQL connection timeouts, or Pub/Sub message processing problems. Provides systematic debugging workflows and solution patterns for Spring Boot applications.

Safety Notice

This listing is imported from skills.sh public index metadata. Review upstream SKILL.md and repository scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "gcp-gke-troubleshooting" with this command: npx skills add dawiddutoit/custom-claude/dawiddutoit-custom-claude-gcp-gke-troubleshooting

GKE Troubleshooting

Purpose

Systematically diagnose and resolve common GKE issues. This skill provides structured debugging workflows, common causes, and proven solutions for the most frequent problems encountered in production deployments.

When to Use

Use this skill when you need to:

  • Debug pods stuck in Pending, CrashLoopBackOff, or ImagePullBackOff status
  • Troubleshoot networking issues (DNS failures, service connectivity)
  • Fix Cloud SQL connection problems or IAM authentication errors
  • Resolve Pub/Sub message processing issues
  • Investigate resource exhaustion or scheduling failures
  • Debug health probe failures
  • Diagnose application crashes or startup issues

Trigger phrases: "pod not starting", "CrashLoopBackOff", "debug GKE issue", "Cloud SQL connection failed", "Pub/Sub not working", "pod pending"

Table of Contents

Quick Start

Quick diagnostic flow for any pod issue:

# 1. Check pod status
kubectl get pods -n wtr-supplier-charges

# 2. View detailed pod information
kubectl describe pod <pod-name> -n wtr-supplier-charges

# 3. Check logs
kubectl logs <pod-name> -n wtr-supplier-charges

# 4. Check previous logs if crashed
kubectl logs <pod-name> -n wtr-supplier-charges --previous

# 5. Check events for scheduling issues
kubectl get events -n wtr-supplier-charges --sort-by='.lastTimestamp'

# 6. Check resource availability
kubectl top nodes
kubectl top pods -n wtr-supplier-charges

Instructions

Step 1: Identify the Pod Status

Understand what the pod status means:

kubectl get pods -n wtr-supplier-charges -o wide
StatusMeaningAction
RunningPod is executingCheck logs if issues
PendingWaiting to be scheduledCheck events, node resources
CrashLoopBackOffApp crashes repeatedlyCheck logs, configuration
ImagePullBackOffCan't pull imageVerify image, permissions
CompletedPod ran successfully and exitedNormal for batch jobs
ErrorPod exited with errorCheck logs

Step 2: Investigate Based on Status

Pod Status: ImagePullBackOff

Diagnose:

# Get detailed error
kubectl describe pod <pod-name> -n wtr-supplier-charges

# Look for "Failed to pull image" in Events section
# Example: "Failed to pull image ... access denied"

# Check if image exists in registry
gcloud artifacts docker images list \
  europe-west2-docker.pkg.dev/ecp-artifact-registry/wtr-supplier-charges-container-images

Solutions:

  1. Image doesn't exist:
# Verify image tag is correct
kubectl get deployment supplier-charges-hub -n wtr-supplier-charges \
  -o jsonpath='{.spec.template.spec.containers[0].image}'
  1. Missing Artifact Registry permissions:
# Grant Artifact Registry Reader role
gcloud artifacts repositories add-iam-policy-binding \
  wtr-supplier-charges-container-images \
  --location=europe-west2 \
  --member="serviceAccount:app-runtime@project.iam.gserviceaccount.com" \
  --role="roles/artifactregistry.reader"
  1. Private image registry authentication:
# Create image pull secret
kubectl create secret docker-registry regcred \
  --docker-server=europe-west2-docker.pkg.dev \
  --docker-username=_json_key \
  --docker-password="$(cat key.json)" \
  -n wtr-supplier-charges

# Add to deployment
spec:
  imagePullSecrets:
  - name: regcred

Pod Status: CrashLoopBackOff

Diagnose:

# Check current logs
kubectl logs <pod-name> -n wtr-supplier-charges

# Check logs from previous container (if crashed)
kubectl logs <pod-name> -n wtr-supplier-charges --previous

# Check liveness probe configuration
kubectl describe pod <pod-name> -n wtr-supplier-charges | grep -A 10 "Liveness"

Common Causes:

  1. Application exits immediately:
# Check startup logs for Java/Spring Boot errors
kubectl logs <pod-name> -n wtr-supplier-charges | head -50

# Look for: ClassNotFoundException, ConfigurationException, connection errors
  1. Liveness probe fails too early:
# Increase initialDelaySeconds from 20 to 60
kubectl patch deployment supplier-charges-hub -n wtr-supplier-charges \
  -p '{"spec":{"template":{"spec":{"containers":[{"name":"supplier-charges-hub-container","livenessProbe":{"initialDelaySeconds":60}}]}}}}'
  1. Out of memory:
# Check memory usage
kubectl top pods <pod-name> -n wtr-supplier-charges

# Increase memory limits
kubectl patch deployment supplier-charges-hub -n wtr-supplier-charges \
  -p '{"spec":{"template":{"spec":{"containers":[{"name":"supplier-charges-hub-container","resources":{"limits":{"memory":"4Gi"}}}]}}}}'
  1. Missing environment variables:
# Check what env vars are set
kubectl exec <pod-name> -n wtr-supplier-charges -- env | sort

# Verify ConfigMap/Secret values
kubectl get configmap supplier-charges-hub-config -n wtr-supplier-charges -o yaml
kubectl get secret db-credentials -n wtr-supplier-charges -o yaml

Pod Status: Pending (Unschedulable)

Diagnose:

# Check events for scheduling messages
kubectl describe pod <pod-name> -n wtr-supplier-charges

# Look for: "Insufficient memory", "Insufficient cpu", "PersistentVolumeClaim"

# Check node capacity
kubectl top nodes
kubectl describe nodes

Solutions:

  1. Insufficient cluster resources:
# Scale deployment down
kubectl scale deployment supplier-charges-hub --replicas=1 -n wtr-supplier-charges

# Or trigger autoscaling (if available)
# GKE Autopilot automatically provisions capacity
  1. Node affinity/taints preventing scheduling:
# Check node taints
kubectl get nodes -o custom-columns=NAME:.metadata.name,TAINTS:.spec.taints

# View pod's node affinity/tolerations
kubectl get pod <pod-name> -n wtr-supplier-charges -o yaml | grep -A 10 -B 2 "affinity\|toleration"

# Add toleration to deployment if needed
spec:
  tolerations:
  - key: "dedicated"
    operator: "Equal"
    value: "compute"
    effect: "NoSchedule"
  1. PersistentVolumeClaim not bound:
# Check PVC status
kubectl get pvc -n wtr-supplier-charges

# If Pending, check storage class
kubectl get storageclass

Step 3: Network and Connectivity Issues

DNS Resolution Failures

Diagnose:

# Test DNS from pod
kubectl exec <pod-name> -n wtr-supplier-charges -- nslookup postgres

# Test connectivity to service
kubectl exec <pod-name> -n wtr-supplier-charges -- curl -v http://postgres:5432

Solutions:

  1. CoreDNS pods not running:
# Check CoreDNS
kubectl get pods -n kube-system -l k8s-app=kube-dns

# Restart CoreDNS if needed
kubectl rollout restart deployment coredns -n kube-system
  1. Service doesn't exist or wrong namespace:
# Verify service exists
kubectl get svc postgres -n wtr-supplier-charges

# Use fully qualified DNS name if in different namespace
service-name.namespace.svc.cluster.local

Service Not Accessible

Diagnose:

# Check service endpoints
kubectl get endpoints supplier-charges-hub -n wtr-supplier-charges

# If empty, no pods match the selector
kubectl get svc supplier-charges-hub -n wtr-supplier-charges -o yaml | grep selector
kubectl get pods -n wtr-supplier-charges --show-labels

Solutions:

  1. Pod labels don't match service selector:
# Add/update labels on deployment
kubectl patch deployment supplier-charges-hub -n wtr-supplier-charges \
  -p '{"spec":{"template":{"metadata":{"labels":{"app":"supplier-charges-hub"}}}}}'
  1. Pods not in Ready state:
# Check readiness probe
kubectl describe pod <pod-name> -n wtr-supplier-charges | grep -A 10 "Readiness"

# Check health endpoint
kubectl exec <pod-name> -n wtr-supplier-charges -- \
  curl localhost:8080/actuator/health/readiness

Step 4: Database Connection Issues

Diagnose:

# Test connectivity to Cloud SQL Proxy
kubectl exec <pod-name> -n wtr-supplier-charges -- nc -zv localhost 5432

# Check Cloud SQL Proxy logs
kubectl logs <pod-name> -c cloud-sql-proxy -n wtr-supplier-charges

# Check application startup logs for DB connection errors
kubectl logs <pod-name> -c supplier-charges-hub-container -n wtr-supplier-charges | grep -i "database\|connection"

Solutions:

  1. IAM Authentication fails:
# Verify Workload Identity binding
kubectl get sa app-runtime -n wtr-supplier-charges -o yaml | grep iam.gke.io

# Grant cloudsql.client role
gcloud projects add-iam-policy-binding project-id \
  --member="serviceAccount:app-runtime@project.iam.gserviceaccount.com" \
  --role="roles/cloudsql.client"

# Check service account email format (must be {name}@{project}.iam)
  1. Wrong connection string:
# Verify DB_CONNECTION_NAME format: project:region:instance
kubectl get configmap db-config -n wtr-supplier-charges -o yaml

# Should be something like: ecp-wtr-supplier-charges-labs:europe-west2:supplier-charges-hub
  1. Cloud SQL Proxy not running:
# Check sidecar logs
kubectl logs <pod-name> -c cloud-sql-proxy -n wtr-supplier-charges

# Check sidecar resources
kubectl describe pod <pod-name> -n wtr-supplier-charges | grep -A 15 "cloud-sql-proxy"

Step 5: Pub/Sub Issues

Diagnose:

# Check subscription backlog
gcloud pubsub subscriptions describe supplier-charges-incoming-sub \
  --project=ecp-wtr-supplier-charges-labs

# Check application Pub/Sub logs
kubectl logs <pod-name> -c supplier-charges-hub-container \
  -n wtr-supplier-charges | grep -i "pubsub\|subscription"

# Test pub/sub connectivity from pod
kubectl exec <pod-name> -n wtr-supplier-charges -- \
  gcloud pubsub topics list --project=ecp-wtr-supplier-charges-labs

Solutions:

  1. Missing Pub/Sub permissions:
# Grant Pub/Sub roles
gcloud projects add-iam-policy-binding project-id \
  --member="serviceAccount:app-runtime@project.iam.gserviceaccount.com" \
  --role="roles/pubsub.subscriber"

gcloud projects add-iam-policy-binding project-id \
  --member="serviceAccount:app-runtime@project.iam.gserviceaccount.com" \
  --role="roles/pubsub.publisher"
  1. High subscription backlog (messages not being consumed):
# Check if pod is running
kubectl get pods -n wtr-supplier-charges

# Check application logs for processing errors
kubectl logs -f <pod-name> -c supplier-charges-hub-container \
  -n wtr-supplier-charges | grep -i "error\|exception"

# Increase message processing timeout
# In application.yaml:
# spring.cloud.gcp.pubsub.subscriber.max-ack-extension-period: 600
  1. Message processing failures:
# Check for poison messages (causing repeated failures)
# Review DLQ (Dead Letter Queue) if configured

# Implement retry logic with exponential backoff
# See Spring Cloud GCP documentation for retry configuration

Examples

See examples/examples.md for comprehensive examples including:

  • Complete troubleshooting workflow
  • Database connectivity debugging
  • Pub/Sub debugging

Requirements

  • kubectl access to the cluster
  • gcloud CLI configured
  • Permissions to view pod logs and describe resources
  • For database debugging: access to view Cloud SQL configuration
  • For Pub/Sub debugging: access to view subscription details

See Also

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Related Skills

Related by shared tags or category signals.

Automation

chrome-browser-automation

No summary provided by upstream source.

Repository SourceNeeds Review
Automation

manage-agents

No summary provided by upstream source.

Repository SourceNeeds Review
General

playwright-web-scraper

No summary provided by upstream source.

Repository SourceNeeds Review
Security

java-best-practices-security-audit

No summary provided by upstream source.

Repository SourceNeeds Review