Health Checks Skill

Overview

This skill provides knowledge and procedures for monitoring infrastructure health across GitHub Actions, Railway, Supabase, and Postgres.

Health Check Philosophy

Why Regular Health Checks?

Proactive Detection: Find issues before users do
Trend Identification: Spot degradation early
Capacity Planning: Know when to scale
Compliance: Maintain system hygiene
Documentation: Track system state over time

Health Check Frequency

Check Type Frequency When

Quick Every deploy After any deployment

Daily Daily Morning/start of business

Weekly Weekly Beginning of week

Deep Monthly Beginning of month

Full Audit Quarterly Scheduled maintenance window

Health Status Framework

Traffic Light System

GREEN - All systems healthy - No critical issues - Metrics within normal ranges - Advisory count: 0

YELLOW - Warning state - Non-critical issues present - Metrics approaching limits - Performance advisories present

RED - Critical state - Service impaired or unavailable - Critical metrics exceeded - Security advisories present - Immediate action required

Status Determination Rules

Condition Status

Security advisory exists RED

Service unavailable RED

Error rate > 5% RED

Connection utilization > 85% RED

CI success rate < 75% RED

Performance advisory exists YELLOW

Error rate 1-5% YELLOW

Connection utilization 70-85% YELLOW

CI success rate 75-90% YELLOW

Long-running queries present YELLOW

All metrics normal GREEN

Health Metrics

Key Performance Indicators

Platform Metric Good Warning Critical

Database Connection % <70% 70-85%

85%

Database Query Duration <100ms 100-500ms

500ms

Database Dead Rows % <10% 10-20%

20%

API Error Rate <1% 1-5%

5%

API Response Time P95 <500ms 500-2000ms

2000ms

CI/CD Success Rate

90% 75-90% <75%

CI/CD Build Time <5min 5-15min

15min

Platform-Specific Metrics

Supabase

API error rate
Auth failure rate
Storage utilization
Edge function cold starts
Realtime connection count
Advisory count (security/performance)

GitHub Actions

Workflow success rate
Average build time
Queue wait time
Cache hit rate
Failed workflow count

Railway

Service uptime
Deploy success rate
Memory utilization
CPU utilization
Health check pass rate

Postgres

Connection utilization
Query duration distribution
Lock contention
Dead tuple ratio
Index usage efficiency
Table bloat

Health Check Procedures

Quick Health Check (5 min)

Purpose: Verify basic system functionality

Check for active incidents (any platform)
Verify all services responding
Check for critical advisories
Review last hour error rate
Check connection pool status

Daily Health Check (15 min)

Purpose: Assess overall system health

Weekly Health Check (30 min)

Purpose: Comprehensive review and trending

Monthly Deep Check (1+ hours)

Purpose: Full system audit

Alert Thresholds

Immediate Alerts (Page)

Service unavailable > 1 minute
Error rate > 10%
Database connections > 90%
Security advisory created
Deployment failure (production)
Health check failure > 5 minutes

Warning Alerts (Slack/Email)

Error rate > 2%
Database connections > 75%
Performance advisory created
Build time increase > 50%
Response time P95 > 1s
Disk usage > 80%

Info Alerts (Daily Digest)

New advisory (any type)
Build time change
Resource trend change
Configuration change

Health Report Template

Infrastructure Health Report

Generated: {TIMESTAMP} Report Type: {Quick | Daily | Weekly | Monthly} Overall Status: {GREEN | YELLOW | RED}

Executive Summary

{2-3 sentence overview}

Platform Status

Platform	Status	Issues	Warnings
GitHub Actions	{STATUS}	{N}	{N}
Railway	{STATUS}	{N}	{N}
Supabase	{STATUS}	{N}	{N}
Postgres	{STATUS}	{N}	{N}

Key Metrics

Database

Connections: {N}/{MAX} ({PCT}%)
Query P95: {MS}ms
Dead Rows: {PCT}%

API

Error Rate: {PCT}%
Response Time P95: {MS}ms

CI/CD

Success Rate: {PCT}%
Avg Build Time: {MIN}m

Advisories

Security

{List or "None"}

Performance

{List or "None"}

Issues Requiring Attention

Immediate

{List or "None"}

This Week

{List or "None"}

Trends

{Notable changes from previous period}

Recommendations

{Specific actions to improve health}

Next health check: {TIMESTAMP}

Remediation Playbooks

High Connection Utilization

Check for connection leaks
Identify idle connections
Review connection pool settings
Consider connection pooler (PgBouncer/Supavisor)
Optimize application connection handling

High Error Rate

Identify error types
Check recent deployments
Review affected endpoints
Check downstream dependencies
Roll back if deployment-related

Slow Queries

Identify slow queries (pg_stat_statements)
Run EXPLAIN ANALYZE
Check for missing indexes
Review query patterns
Consider query optimization or caching

Build Failures

Review failure logs
Check for flaky tests
Verify dependencies available
Check for environment issues
Review recent changes

See checklists.md for detailed health check checklists.

health-checks

Safety Notice

Copy this and send it to your AI assistant to learn

Infrastructure Health Report

Executive Summary

Platform Status

Key Metrics

Database

API

CI/CD

Advisories

Security

Performance

Issues Requiring Attention

Immediate

This Week

Trends

Recommendations

Source Transparency

Related Skills

clean-code

codebase-analysis

development-workflow