/check-production
Audit production health. Output findings as structured report.
What This Does
-
Query Sentry for unresolved issues
-
Check Vercel logs for recent errors
-
Test health endpoints
-
Check GitHub Actions for CI/CD failures
-
Output prioritized findings (P0-P3)
This is a primitive. It only investigates and reports. Use /log-production-issues to create GitHub issues or /triage to fix.
Process
- Sentry Check
Run triage script if available
~/.claude/skills/triage/scripts/check_sentry.sh 2>/dev/null || echo "Sentry check unavailable"
Or spawn Sentry MCP query if configured.
- Vercel Logs Check
Check for recent errors
~/.claude/skills/triage/scripts/check_vercel_logs.sh 2>/dev/null || vercel logs --output json 2>/dev/null | head -50
- Health Endpoints
Test health endpoint
~/.claude/skills/triage/scripts/check_health_endpoints.sh 2>/dev/null || curl -sf "$(grep NEXT_PUBLIC_APP_URL .env.local 2>/dev/null | cut -d= -f2)/api/health" | jq .
- GitHub CI/CD Check
Check for failed workflow runs on default branch
gh run list --branch main --status failure --limit 5 2>/dev/null ||
gh run list --branch master --status failure --limit 5 2>/dev/null
Get details on most recent failure
gh run list --status failure --limit 1 --json databaseId,name,conclusion,createdAt,headBranch 2>/dev/null
Check for stale/stuck workflows
gh run list --status in_progress --json databaseId,name,createdAt 2>/dev/null
What to look for:
-
Failed runs on main/master branch (broken CI)
-
Failed runs on feature branches blocking PRs
-
Stuck/in-progress runs that should have completed
-
Patterns in failure types (tests, lint, build, deploy)
- Quick Application Checks
Check for error handling gaps
grep -rE "catch\s*(\s*)" --include=".ts" --include=".tsx" src/ app/ 2>/dev/null | head -5
Empty catch blocks = silent failures
Output Format
Production Health Check
P0: Critical (Active Production Issues)
- [SENTRY-123] PaymentIntent failed - 23 users affected (Score: 147) Location: api/checkout.ts:45 First seen: 2h ago
P1: High (Degraded Performance / Broken CI)
- Health endpoint slow: /api/health responding in 2.3s (should be <500ms)
- Vercel logs show 5xx errors in last hour (count: 12)
- [CI] Main branch failing: "Build" workflow (run #1234) Failed step: "Type check" Error: Type 'string' is not assignable to type 'number'
P2: Medium (Warnings)
- 3 empty catch blocks found (silent failures)
- Health endpoint missing database connectivity check
- [CI] 3 feature branch workflows failing (blocking PRs)
P3: Low (Improvements)
- Consider adding Sentry performance monitoring
- Health endpoint could include more service checks
Summary
- P0: 1 | P1: 3 | P2: 3 | P3: 2
- Recommendation: Fix P0 immediately, then fix main branch CI
Priority Mapping
Signal Priority
Active errors affecting users P0
5xx errors, slow responses P1
Main branch CI/CD failing P1
Feature branch CI blocking PRs P2
Silent failures, missing checks P2
Missing monitoring, improvements P3
Health Endpoint Anti-Pattern
Health checks that lie are worse than no health check. Example:
// ❌ BAD: Reports "ok" without checking return { status: "ok", services: { database: "ok" } };
// ✅ GOOD: Honest liveness probe (no fake service status) return { status: "ok", timestamp: new Date().toISOString() };
// ✅ BETTER: Real readiness probe const dbStatus = await checkDatabase() ? "ok" : "error"; return { status: dbStatus === "ok" ? "ok" : "degraded", services: { database: dbStatus } };
If you can't verify a service, don't report on it. False "ok" status masks outages.
Analytics Note
This skill checks production health (errors, logs, endpoints), not product analytics.
For analytics auditing, see /check-observability . Note:
-
PostHog is REQUIRED for product analytics (has MCP server)
-
Vercel Analytics is NOT acceptable (no CLI/API/MCP - unusable for our workflow)
If you need to investigate user behavior or funnels during incident response, query PostHog via MCP.
- E2E Smoke Check
If Playwright is configured in the project:
Run smoke tests against production
PLAYWRIGHT_BASE_URL="$PROD_URL" npx playwright test e2e/smoke.spec.ts --reporter=list 2>&1 | head -30
Critical paths to verify:
-
Landing page loads (anonymous)
-
Dashboard loads (authenticated) — the #1 incident class
-
Subscribe page renders
-
Session page loads
-
No error boundaries triggered on any route
- Post-Deploy Health Check
Verify health endpoint
curl -sf "$PROD_URL/api/health" -w "\nHTTP %{http_code} in %{time_total}s\n" | head -5
Verify no error boundary on dashboard (check for error text in HTML)
curl -sf "$PROD_URL/dashboard" 2>/dev/null | grep -c "Something went wrong" && echo "ERROR BOUNDARY DETECTED" || echo "Dashboard OK"
Related
-
/log-production-issues
-
Create GitHub issues from findings
-
/triage
-
Fix production issues
-
/observability
-
Set up monitoring infrastructure
-
/flywheel-qa
-
Agentic QA for preview deployments