Health Checks Skill
Overview
This skill provides knowledge and procedures for monitoring infrastructure health across GitHub Actions, Railway, Supabase, and Postgres.
Health Check Philosophy
Why Regular Health Checks?
-
Proactive Detection: Find issues before users do
-
Trend Identification: Spot degradation early
-
Capacity Planning: Know when to scale
-
Compliance: Maintain system hygiene
-
Documentation: Track system state over time
Health Check Frequency
Check Type Frequency When
Quick Every deploy After any deployment
Daily Daily Morning/start of business
Weekly Weekly Beginning of week
Deep Monthly Beginning of month
Full Audit Quarterly Scheduled maintenance window
Health Status Framework
Traffic Light System
GREEN - All systems healthy - No critical issues - Metrics within normal ranges - Advisory count: 0
YELLOW - Warning state - Non-critical issues present - Metrics approaching limits - Performance advisories present
RED - Critical state - Service impaired or unavailable - Critical metrics exceeded - Security advisories present - Immediate action required
Status Determination Rules
Condition Status
Security advisory exists RED
Service unavailable RED
Error rate > 5% RED
Connection utilization > 85% RED
CI success rate < 75% RED
Performance advisory exists YELLOW
Error rate 1-5% YELLOW
Connection utilization 70-85% YELLOW
CI success rate 75-90% YELLOW
Long-running queries present YELLOW
All metrics normal GREEN
Health Metrics
Key Performance Indicators
Platform Metric Good Warning Critical
Database Connection % <70% 70-85%
85%
Database Query Duration <100ms 100-500ms
500ms
Database Dead Rows % <10% 10-20%
20%
API Error Rate <1% 1-5%
5%
API Response Time P95 <500ms 500-2000ms
2000ms
CI/CD Success Rate
90% 75-90% <75%
CI/CD Build Time <5min 5-15min
15min
Platform-Specific Metrics
Supabase
-
API error rate
-
Auth failure rate
-
Storage utilization
-
Edge function cold starts
-
Realtime connection count
-
Advisory count (security/performance)
GitHub Actions
-
Workflow success rate
-
Average build time
-
Queue wait time
-
Cache hit rate
-
Failed workflow count
Railway
-
Service uptime
-
Deploy success rate
-
Memory utilization
-
CPU utilization
-
Health check pass rate
Postgres
-
Connection utilization
-
Query duration distribution
-
Lock contention
-
Dead tuple ratio
-
Index usage efficiency
-
Table bloat
Health Check Procedures
Quick Health Check (5 min)
Purpose: Verify basic system functionality
- Check for active incidents (any platform)
- Verify all services responding
- Check for critical advisories
- Review last hour error rate
- Check connection pool status
Daily Health Check (15 min)
Purpose: Assess overall system health
- Run quick health check
- Review 24-hour error trends
- Check CI/CD success rate
- Review all advisories
- Check slow query log
- Verify backups completed
- Review resource utilization
Weekly Health Check (30 min)
Purpose: Comprehensive review and trending
- Run daily health check
- Analyze weekly error patterns
- Review index usage stats
- Check for table bloat
- Review connection patterns
- Assess capacity trends
- Review deployment frequency
- Check certificate expirations
Monthly Deep Check (1+ hours)
Purpose: Full system audit
- Run weekly health check
- Full index analysis
- Query performance review
- Security configuration audit
- Capacity planning review
- Cost analysis
- Documentation review
- Disaster recovery test
Alert Thresholds
Immediate Alerts (Page)
-
Service unavailable > 1 minute
-
Error rate > 10%
-
Database connections > 90%
-
Security advisory created
-
Deployment failure (production)
-
Health check failure > 5 minutes
Warning Alerts (Slack/Email)
-
Error rate > 2%
-
Database connections > 75%
-
Performance advisory created
-
Build time increase > 50%
-
Response time P95 > 1s
-
Disk usage > 80%
Info Alerts (Daily Digest)
-
New advisory (any type)
-
Build time change
-
Resource trend change
-
Configuration change
Health Report Template
Infrastructure Health Report
Generated: {TIMESTAMP} Report Type: {Quick | Daily | Weekly | Monthly} Overall Status: {GREEN | YELLOW | RED}
Executive Summary
{2-3 sentence overview}
Platform Status
| Platform | Status | Issues | Warnings |
|---|---|---|---|
| GitHub Actions | {STATUS} | {N} | {N} |
| Railway | {STATUS} | {N} | {N} |
| Supabase | {STATUS} | {N} | {N} |
| Postgres | {STATUS} | {N} | {N} |
Key Metrics
Database
- Connections: {N}/{MAX} ({PCT}%)
- Query P95: {MS}ms
- Dead Rows: {PCT}%
API
- Error Rate: {PCT}%
- Response Time P95: {MS}ms
CI/CD
- Success Rate: {PCT}%
- Avg Build Time: {MIN}m
Advisories
Security
{List or "None"}
Performance
{List or "None"}
Issues Requiring Attention
Immediate
{List or "None"}
This Week
{List or "None"}
Trends
{Notable changes from previous period}
Recommendations
{Specific actions to improve health}
Next health check: {TIMESTAMP}
Remediation Playbooks
High Connection Utilization
- Check for connection leaks
- Identify idle connections
- Review connection pool settings
- Consider connection pooler (PgBouncer/Supavisor)
- Optimize application connection handling
High Error Rate
- Identify error types
- Check recent deployments
- Review affected endpoints
- Check downstream dependencies
- Roll back if deployment-related
Slow Queries
- Identify slow queries (pg_stat_statements)
- Run EXPLAIN ANALYZE
- Check for missing indexes
- Review query patterns
- Consider query optimization or caching
Build Failures
- Review failure logs
- Check for flaky tests
- Verify dependencies available
- Check for environment issues
- Review recent changes
See checklists.md for detailed health check checklists.