Error Handling
- Never expose stack traces to clients—log internally, return generic message
- Structured error responses: code, message, request ID—enables debugging without leaking
- Fail fast on bad input—validate at entry point, not deep in business logic
- Unexpected errors: 500 + alert—expected errors: appropriate 4xx
Input Validation
- Validate everything from outside—query params, headers, body, path params
- Whitelist valid input, don't blacklist bad—reject unknown fields
- Validate early, before any processing—save resources, clearer errors
- Size limits on all inputs—prevent memory exhaustion attacks
Timeouts Everywhere
- Database queries: set timeout, typically 5-30s
- External HTTP calls: connect timeout + read timeout—don't wait forever
- Overall request timeout—gateway or middleware level
- Background jobs: max execution time—prevent zombie processes
Retry Patterns
- Exponential backoff: 1s, 2s, 4s, 8s...—prevents thundering herd
- Add jitter: randomize delay—prevents synchronized retries
- Idempotency keys for non-idempotent operations—safe to retry
- Circuit breaker for failing dependencies—stop hammering, fail fast
Database Practices
- Connection pooling: reuse connections—creating is expensive
- Transactions scoped minimal—hold locks briefly
- Read replicas for read-heavy workloads—separate read/write traffic
- Prepared statements always—SQL injection prevention, query plan cache
Caching Strategy
- Cache invalidation strategy decided upfront—TTL, event-based, or both
- Cache at right layer: query result, computed value, HTTP response
- Cache stampede prevention—lock or probabilistic early expiration
- Monitor hit rate—low hit rate = wasted resources
Rate Limiting
- Per-user/IP limits on expensive operations—login, signup, search
- Different limits for different operations—read vs write
- Return Retry-After header—tell clients when to retry
- Rate limit early in request pipeline—save resources
Health Checks
- Liveness: is process running—restart if fails
- Readiness: can handle traffic—remove from load balancer if fails
- Startup probe for slow-starting services—don't kill during init
- Health checks fast and cheap—don't hit database on every probe
Graceful Shutdown
- Stop accepting new requests first—drain load balancer
- Wait for in-flight requests to complete—with timeout
- Close database connections cleanly—prevent connection leaks
- SIGTERM handling: graceful; SIGKILL after timeout
Logging
- Structured logs (JSON)—parseable by log aggregators
- Request ID in every log—trace request across services
- Log level appropriate: debug for dev, info/error for prod
- Sensitive data never logged—passwords, tokens, PII
API Design
- Versioning strategy from day one—path (/v1/) or header
- Pagination for list endpoints—cursor or offset; include total count
- Consistent response format—same envelope everywhere
- Meaningful status codes—201 for create, 204 for delete, 404 for not found
Security Hygiene
- Secrets from environment or vault—never in code or config files
- Dependencies updated regularly—automated with Dependabot/Renovate
- Principle of least privilege—service accounts with minimal permissions
- Authentication and authorization separated—who you are vs what you can do
Observability
- Metrics: request count, latency percentiles, error rate—the RED method
- Distributed tracing for microservices—follow request across services
- Alerting on symptoms, not causes—high error rate, not CPU usage
- Dashboards for operational visibility—know normal to spot abnormal