production-troubleshooting

Production Troubleshooting

Overview

Diagnose performance issues and errors in production/test environments using systematic investigation workflows with Sentry, kubectl, and Helm configuration analysis.

When to Use This Skill

Use this skill when:

User reports performance issues on test/production (not localhost)
Need to investigate slow queries or high latency
Debugging pod crashes or resource throttling
Analyzing Sentry traces for errors
Checking Kubernetes resource limits and configurations

Investigation Workflow

Follow these steps in order when troubleshooting production issues:

Step 1: Check Sentry Traces

Start with Sentry to identify slow queries and external API latency patterns.

Using Sentry MCP:

Search for traces related to the reported issue
Look for slow database queries (>500ms)
Check external API call latency
Identify error patterns and stack traces

What to look for:

Database query times exceeding 500ms
External API calls with high latency
Repeated error patterns
Performance degradation trends

Step 2: Review Application Logs

Examine kubectl logs for timing information and error patterns.

Using agent-tools-k8s:

agent-tools-k8s logs --pod <pod-name> --env <env> --tail 200

Key log patterns to search for:

[Server]
Server startup and initialization timing
[SSR]
Server-side rendering timing
[tRPC]
TRPC query execution timing
[DB Pool]
Database connection pool status
ERROR or WARN
Application errors and warnings

Common issues:

Sequential API calls instead of parallel (Promise.all)
Long DB connection acquisition times
Slow SSR rendering

Step 3: Check Pod Resource Usage

Verify CPU and memory usage to detect throttling.

Using agent-tools-k8s:

agent-tools-k8s top --env <env>

Warning signs:

CPU usage >70% indicates potential throttling
Memory usage >80% indicates potential OOM issues
Consistent high utilization suggests under-provisioning

Step 4: Review Pod Configuration

Check resource limits and Helm values to identify misconfigurations.

Using kubectl:

kubectl get pod <pod-name> -n <namespace> -o yaml

Key sections to check:

resources.limits.cpu and resources.limits.memory
resources.requests.cpu and resources.requests.memory
Environment variables configuration
Image version and tags

Helm values locations:

web-app: /kubernetes/helm/web-app/values.{test,prod}.yaml

Reference references/helm-values-locations.md for detailed Helm configuration structure.

Common Causes & Solutions

CPU/Memory Throttling

Symptom: High CPU/memory usage (>70-80%)
Solution: Increase resource limits in Helm values

Network Latency

Symptom: Slow external API calls, DNS resolution delays
Solution: Check network policies, verify DNS configuration, consider retry logic

Database Connection Pool Issues

Symptom: [DB Pool] errors, slow connection acquisition
Solution: Review idleTimeoutMillis and pool size configuration

Sequential API Calls

Symptom: Multiple API calls taking cumulative time
Solution: Refactor to use Promise.all() for parallel execution

Resources

kubectl commands

Common kubectl operations (use via agent-tools-k8s ):

agent-tools-k8s logs --pod <pod> --env <env> --tail 200
Extract and filter pod logs
agent-tools-k8s top --env <env>
Show CPU/memory usage for pods
agent-tools-k8s describe --resource pod --name <pod> --env <env>
Check resource limits and pod configuration
agent-tools-k8s kubectl --env <env> --cmd "get pods"
Raw kubectl for anything else

references/

helm-values-locations.md
Detailed guide to Helm values file structure and locations
common-issues.md
Catalog of common production issues and solutions

production-troubleshooting

Safety Notice

Copy this and send it to your AI assistant to learn

Source Transparency

Related Skills

marketing-expert

debugging-with-opensrc

testing-patterns

requirements