Production Troubleshooting
Overview
Diagnose performance issues and errors in production/test environments using systematic investigation workflows with Sentry, kubectl, and Helm configuration analysis.
When to Use This Skill
Use this skill when:
-
User reports performance issues on test/production (not localhost)
-
Need to investigate slow queries or high latency
-
Debugging pod crashes or resource throttling
-
Analyzing Sentry traces for errors
-
Checking Kubernetes resource limits and configurations
Investigation Workflow
Follow these steps in order when troubleshooting production issues:
Step 1: Check Sentry Traces
Start with Sentry to identify slow queries and external API latency patterns.
Using Sentry MCP:
-
Search for traces related to the reported issue
-
Look for slow database queries (>500ms)
-
Check external API call latency
-
Identify error patterns and stack traces
What to look for:
-
Database query times exceeding 500ms
-
External API calls with high latency
-
Repeated error patterns
-
Performance degradation trends
Step 2: Review Application Logs
Examine kubectl logs for timing information and error patterns.
Using agent-tools-k8s:
agent-tools-k8s logs --pod <pod-name> --env <env> --tail 200
Key log patterns to search for:
-
[Server]
-
Server startup and initialization timing
-
[SSR]
-
Server-side rendering timing
-
[tRPC]
-
TRPC query execution timing
-
[DB Pool]
-
Database connection pool status
-
ERROR or WARN
-
Application errors and warnings
Common issues:
-
Sequential API calls instead of parallel (Promise.all)
-
Long DB connection acquisition times
-
Slow SSR rendering
Step 3: Check Pod Resource Usage
Verify CPU and memory usage to detect throttling.
Using agent-tools-k8s:
agent-tools-k8s top --env <env>
Warning signs:
-
CPU usage >70% indicates potential throttling
-
Memory usage >80% indicates potential OOM issues
-
Consistent high utilization suggests under-provisioning
Step 4: Review Pod Configuration
Check resource limits and Helm values to identify misconfigurations.
Using kubectl:
kubectl get pod <pod-name> -n <namespace> -o yaml
Key sections to check:
-
resources.limits.cpu and resources.limits.memory
-
resources.requests.cpu and resources.requests.memory
-
Environment variables configuration
-
Image version and tags
Helm values locations:
- web-app: /kubernetes/helm/web-app/values.{test,prod}.yaml
Reference references/helm-values-locations.md for detailed Helm configuration structure.
Common Causes & Solutions
CPU/Memory Throttling
-
Symptom: High CPU/memory usage (>70-80%)
-
Solution: Increase resource limits in Helm values
Network Latency
-
Symptom: Slow external API calls, DNS resolution delays
-
Solution: Check network policies, verify DNS configuration, consider retry logic
Database Connection Pool Issues
-
Symptom: [DB Pool] errors, slow connection acquisition
-
Solution: Review idleTimeoutMillis and pool size configuration
Sequential API Calls
-
Symptom: Multiple API calls taking cumulative time
-
Solution: Refactor to use Promise.all() for parallel execution
Resources
kubectl commands
Common kubectl operations (use via agent-tools-k8s ):
-
agent-tools-k8s logs --pod <pod> --env <env> --tail 200
-
Extract and filter pod logs
-
agent-tools-k8s top --env <env>
-
Show CPU/memory usage for pods
-
agent-tools-k8s describe --resource pod --name <pod> --env <env>
-
Check resource limits and pod configuration
-
agent-tools-k8s kubectl --env <env> --cmd "get pods"
-
Raw kubectl for anything else
references/
-
helm-values-locations.md
-
Detailed guide to Helm values file structure and locations
-
common-issues.md
-
Catalog of common production issues and solutions