error-patterns

Error Patterns Skill

Overview

This skill provides knowledge for recognizing, categorizing, and resolving common infrastructure errors. It covers error classification, diagnostic techniques, and resolution strategies.

Error Classification Framework

By Severity

Severity Definition Response Time Example

Critical Service completely down Immediate Database unreachable

High Major functionality broken < 1 hour Auth failures

Medium Partial functionality affected < 4 hours Slow queries

Low Minor issues, workarounds exist < 24 hours Deprecation warnings

By Category

Category Subcategories Typical Causes

Database Connection, Query, Transaction, Replication Pool exhaustion, locks, slow queries

Network DNS, Timeout, Connection Misconfiguration, service down

Authentication Token, Permission, Provider Expired tokens, wrong credentials

Application Logic, Memory, Timeout Bugs, resource leaks

Infrastructure Disk, CPU, Memory Resource exhaustion

External API, Service, Rate limit Third-party issues

By Pattern Type

Pattern Description Example

Transient Self-resolving, retry works Network blip

Persistent Consistent, needs fix Misconfiguration

Cascading One failure causes others DB down → API errors

Intermittent Random occurrence Race condition

Load-dependent Appears under load Connection exhaustion

Diagnostic Methodology

The 5 Whys

Dig deeper for root cause:

Symptom: API returning 500 errors Why? → Database query failing Why? → Connection timeout Why? → Connection pool exhausted Why? → Connections not released Why? → Missing finally block in error handler

ROOT CAUSE: Code bug in error handling

Timeline Analysis

Map events chronologically:

T-60m: Deployment completed T-45m: Memory usage started climbing T-30m: First slow query warning T-15m: Connection pool warnings T-0: Service unavailable

Fault Tree

Break down possible causes:

            [Service Down]
                  |
    +-------------+-------------+
    |             |             |
[Database]    [Network]    [Application]
    |             |             |
+---+---+     +---+---+     +---+---+
|       |     |       |     |       |

[Conn] [Query] [DNS] [FW] [OOM] [Bug]

Error Resolution Process

Step 1: Identify

What is the exact error message?
When did it start?
What's the impact?

Step 2: Categorize

Which category does this fall into?
Is it transient or persistent?
What's the severity?

Step 3: Investigate

Gather relevant logs
Check recent changes
Look for patterns

Step 4: Diagnose

Apply 5 Whys
Build timeline
Identify root cause

Step 5: Remediate

Apply immediate fix
Verify resolution
Document for prevention

Error Correlation Techniques

Cross-Platform Correlation

Match errors across systems:

14:30:01 [Railway] Connection refused to db:5432 14:30:01 [Supabase] Too many connections 14:30:00 [GitHub] Deployment completed ↑ Deployment triggered connection spike

Error Chains

Follow the cascade:

[1] Initial: Database connection timeout [2] Result: API endpoint returns 500 [3] Result: Frontend shows error page [4] Result: User reports "site is down"

Impact Mapping

Error: Auth service down ├── Direct Impact │ └── No new logins ├── Cascade Impact │ ├── API requests fail (no token validation) │ └── Realtime connections drop └── User Impact └── All users affected

Resolution Strategies

Immediate Mitigation

Strategy Use When Example

Rollback Recent deployment caused issue git revert

Restart Service stuck/crashed Container restart

Scale up Resource exhaustion Add replicas

Failover Primary system down Switch to backup

Rate limit Overload Block/throttle traffic

Circuit break Cascading failures Disable failing component

Root Cause Fix

Cause Fix Approach

Code bug Deploy fix, add tests

Configuration Update config, validate

Resource limit Increase limits or optimize

External dependency Add retry/fallback

Infrastructure Scale or redesign

Prevention

Issue Prevention

Connection leaks Connection pooling, timeouts

Memory leaks Profiling, limits

Slow queries Indexes, query optimization

Deployment failures Canary deployments, rollback automation

External failures Circuit breakers, fallbacks

Common Resolution Templates

Database Connection Issues

Issue: Database Connection Error

Immediate Actions

Check connection count: SELECT count(*) FROM pg_stat_activity;
Identify idle connections: SELECT * FROM pg_stat_activity WHERE state = 'idle in transaction';
Kill stuck connections if safe: SELECT pg_terminate_backend(pid) FROM pg_stat_activity WHERE ...;

Root Cause Fix

Add connection pooling (PgBouncer)
Implement connection timeouts
Fix connection leak in application code

Prevention

Monitor connection metrics
Alert on pool usage > 80%
Regular connection audits

API Error Spike

Issue: API 500 Errors

Immediate Actions

Check API logs for error pattern
Identify failing endpoint(s)
Check downstream dependencies

Root Cause Fix

Fix code bug causing exception
Handle edge cases
Add proper error handling

Prevention

Add error monitoring
Implement circuit breakers
Add integration tests

See common-errors.md for a catalog of specific errors and solutions.

error-patterns

Safety Notice

Copy this and send it to your AI assistant to learn

Issue: Database Connection Error

Immediate Actions

Root Cause Fix

Prevention

Issue: API 500 Errors

Immediate Actions

Root Cause Fix

Prevention

Source Transparency

Related Skills

static-analysis

technical-analysis

platform-knowledge