Production Readiness (Meta-Skill)
Coordinates all operational concerns into a single readiness review. Instead of duplicating domain expertise, this skill routes to specialized skills and agents for each area, then synthesizes results into a unified go/no-go assessment.
Installation
OpenClaw / Moltbot / Clawbot
npx clawhub@latest install production-readiness
Purpose
Ensure a service is production-ready by systematically checking every operational concern — logging, error handling, performance, security, deployment, testing, and documentation — before traffic hits it.
A production-ready service:
- Fails gracefully under load and partial outages
- Observes itself with structured logs, metrics, and traces
- Recovers automatically from transient failures
- Communicates health to orchestrators and operators
- Documents operations so on-call engineers can respond without tribal knowledge
When to Use
| Trigger | Context |
|---|---|
| Before first deploy | New service going to production for the first time |
| Before major release | Significant feature or architectural change shipping |
| Quarterly production review | Scheduled audit of existing services |
| After incident | Post-incident hardening to prevent recurrence |
| Dependency upgrade | Major framework, runtime, or infrastructure change |
| Team handoff | Transferring ownership of a service to another team |
Orchestration Flow
Run each area sequentially or in parallel. Each step delegates to a specialized skill or agent — this skill does not re-implement their logic.
┌─────────────────────────────────────────────────┐
│ Production Readiness Review │
├─────────────────────────────────────────────────┤
│ │
│ 1. Logging & Observability ──► logging-observability skill
│ 2. Error Handling ───────────► error-handling-patterns skill
│ 3. Performance ──────────────► performance-agent
│ 4. Security ─────────────────► security-review meta-skill
│ 5. Deployment ───────────────► deployment-agent + docker-expert skill
│ 6. Testing ──────────────────► testing-workflow meta-skill
│ 7. Documentation ────────────► /generate-docs command
│ │
│ ──► Synthesize results into go/no-go report │
└─────────────────────────────────────────────────┘
Step Details
- Logging & Observability — Structured logging, log levels, correlation IDs, metrics endpoints, distributed tracing, alerting rules
- Error Handling — Global error boundaries, retry policies, dead-letter queues, error classification, user-facing error messages
- Performance — Load testing results, P95/P99 latency baselines, memory/CPU profiling, database query analysis, caching strategy
- Security — Auth/authz verification, input validation, dependency audit, secrets management, OWASP top-10 review
- Deployment — Container hardening, rollback strategy, blue-green/canary configuration, infrastructure-as-code review
- Testing — Unit/integration/e2e coverage, contract tests, chaos/failure injection, smoke tests in staging
- Documentation — API docs, runbooks, architecture diagrams, on-call playbooks, ADRs for key decisions
Skill Routing Table
| Concern | Skill / Agent | Path |
|---|---|---|
| Logging & Observability | logging-observability | ai/skills/tools/logging-observability/SKILL.md |
| Error Handling | error-handling-patterns | ai/skills/backend/error-handling-patterns/SKILL.md |
| Performance | performance-agent | ai/agents/performance/ |
| Security | security-review | ai/skills/meta/security-review/SKILL.md |
| Deployment (containers) | docker-expert | ai/skills/devops/docker/SKILL.md |
| Deployment (pipelines) | deployment-agent | ai/agents/deployment/ |
| Testing | testing-workflow | ai/skills/testing/testing-workflow/SKILL.md |
| Rate Limiting | rate-limiting-patterns | ai/skills/backend/rate-limiting-patterns/SKILL.md |
| Documentation | /generate-docs | ai/commands/documentation/ |
Routing rule: Read the target skill first, follow its instructions, then return results here for synthesis.
Production Readiness Checklist
Health & Lifecycle
- Health check endpoint (
/healthzor/health) returns dependency status - Readiness probe distinguishes "starting" from "ready to serve"
- Liveness probe detects deadlocks and unrecoverable states
- Graceful shutdown drains in-flight requests before exit
- Startup probe allows for slow initialization without false restarts
Resilience
- Circuit breakers on all external service calls
- Retry with exponential backoff and jitter on transient failures
- Rate limiting configured per endpoint and per client
- Backpressure mechanisms prevent cascade failures under load
- Timeouts set on every outbound call (HTTP, DB, queue)
- Bulkhead isolation separates critical from non-critical paths
Configuration & Secrets
- All configuration externalized (env vars, config service, or feature flags)
- No secrets in code, images, or environment variable defaults
- Secrets loaded from a vault (e.g., AWS Secrets Manager, HashiCorp Vault)
- Configuration changes do not require redeployment
- Feature flags in place for high-risk changes
Data Safety
- Backup strategy defined and tested (RPO/RTO documented)
- Restore procedure verified in a non-production environment
- Database migrations are backward-compatible and reversible
- Data retention policies implemented and enforced
Operational Readiness
- Runbooks exist for top 5 most likely failure scenarios
- SLOs defined (availability, latency, error rate) with error budgets
- SLAs communicated to dependent teams or customers
- On-call rotation staffed and escalation path documented
- Dashboards show golden signals (latency, traffic, errors, saturation)
- Alerting rules configured with appropriate thresholds and severity
Maturity Levels
| Level | Name | Requirements |
|---|---|---|
| L1 | MVP | Health check, basic logging, error handling, manual deploy, unit tests, README |
| L2 | Stable | Structured logging, metrics, graceful shutdown, CI/CD pipeline, integration tests, runbooks |
| L3 | Resilient | Distributed tracing, circuit breakers, auto-scaling, chaos testing, SLOs, on-call rotation |
| L4 | Optimized | Adaptive rate limiting, predictive alerting, canary deploys, full observability, error budgets, postmortem culture |
Progression Guidance
- L1 → L2: Add structured logging, metrics endpoint, and a CI/CD pipeline. Write runbooks for known failure modes.
- L2 → L3: Instrument distributed tracing. Add circuit breakers to external calls. Define SLOs and set up on-call.
- L3 → L4: Implement canary deployments. Adopt error budgets. Run regular game days. Build predictive alerting.
Incident Response
On-Call Rotation
- Minimum two engineers per rotation (primary + secondary)
- Handoff includes review of recent deploys, open issues, and known risks
- Escalation targets defined: primary → secondary → engineering lead → VP Eng
Escalation Matrix
| Severity | Response Time | Escalation After | Stakeholder Notification |
|---|---|---|---|
| SEV-1 (outage) | 15 min | 30 min | Immediate — exec + customers |
| SEV-2 (degraded) | 30 min | 1 hour | Within 1 hour — eng lead |
| SEV-3 (minor) | 4 hours | Next business day | Daily standup |
| SEV-4 (cosmetic) | Next sprint | N/A | Backlog |
Postmortem Template
## Incident: [Title]
**Date:** YYYY-MM-DD | **Duration:** X hours | **Severity:** SEV-N
### Summary
One-paragraph description of what happened and impact.
### Timeline
- HH:MM — First alert fired
- HH:MM — Engineer paged, investigation started
- HH:MM — Root cause identified
- HH:MM — Mitigation applied
- HH:MM — Full resolution confirmed
### Root Cause
What broke and why. Link to code/config change if applicable.
### Impact
- Users affected: N
- Revenue impact: $X (if applicable)
- SLO budget consumed: X%
### Action Items
| Action | Owner | Due Date | Status |
|--------|-------|----------|--------|
| Fix X | @eng | YYYY-MM-DD | Open |
### Lessons Learned
- What went well
- What went poorly
- Where we got lucky
NEVER Do
- NEVER skip health checks — every service must expose health endpoints; no exceptions for "simple" services
- NEVER store secrets in code or container images — use a secrets manager; never default env vars with real values
- NEVER deploy without a rollback plan — if you cannot roll back in under 5 minutes, you are not ready to deploy
- NEVER ignore error budget violations — when the error budget is exhausted, freeze feature work and fix reliability
- NEVER treat logging as optional — a service without structured logging is a service you cannot debug at 3 AM
- NEVER go to production without runbooks — if on-call cannot resolve the top 5 failure modes without the original author, the service is not production-ready