Operational Readiness Checklist
Comprehensive checklist to validate services before production launch. Analyzes codebase + asks interactive questions for items that cannot be detected from code.
Workflow Overview
- Gather context - Identify service type, tech stack, and traffic expectations
- Analyze codebase - Scan for CI/CD configs, infrastructure code, security patterns
- Interactive verification - Ask about items that cannot be detected from code
- Generate report - Produce checklist report with priorities and remediation guidance
Step 1: Gather Context
Ask the user these questions using AskUserQuestion:
Service Classification:
- Service type: Backend API, Frontend/Web App, Infrastructure/Platform, or Hybrid
- Expected traffic: <100 req/min (low), 100-1000 req/min (medium), >1000 req/min (high)
- Data handling: Stores user data (yes/no), Processes PII (yes/no)
- Public-facing: Yes/No
- Has email functionality: Yes/No
- Uses database: Yes/No (if yes, which: PostgreSQL, Supabase, DynamoDB, etc.)
Tech Stack Detection: Auto-detect from files:
Cargo.toml→ Rust servicepackage.json→ Node.js/TypeScript*.tfor*.tfvars→ Terraformcdk.jsonor*.cdk.ts→ AWS CDK.github/workflows/*.yml→ GitHub Actions CI/CDnext.config.js→ Next.js frontendDockerfile→ Containerized service
Step 2: Codebase Analysis
Analyze the codebase for evidence of checklist items. Use Glob and Grep to find:
CI/CD Detection:
.github/workflows/*.yml - GitHub Actions
Cargo.toml + [profile.release] - Rust build config
jest.config.* / vitest.config.* - Test configuration
*.tf - Terraform files
cdk.json - CDK configuration
Security Detection:
**/security*.yml - Security scanning workflows
dependabot.yml - Dependency updates
CODEOWNERS - Code ownership
*.lock files - Dependency locking
Observability Detection:
**/tracing*.rs or opentelemetry* - Distributed tracing
sentry.* or @sentry/* - Error tracking
prometheus* or metrics* - Metrics collection
**/logging*.* or log4* or tracing* - Logging config
Infrastructure Detection:
**/autoscaling* in .tf files - Autoscaling config
**/secretsmanager* or **/ssm* - Secrets management
health* endpoints in code - Health checks
Step 3: Interactive Verification
For items that cannot be detected from code, ask yes/no questions. Group questions by category to avoid overwhelming the user.
Step 4: Generate Report
Output format:
# Operational Readiness Report: [Service Name]
**Service Type:** [Backend API / Frontend / Infrastructure]
**Tech Stack:** [Detected stack]
**Generated:** [Date]
## Summary
- **Overall Readiness:** [X/Y items passing] ([Z%])
- **Launch Blockers (P0):** [count]
- **High Priority (P1):** [count]
- **Medium Priority (P2):** [count]
- **Low Priority (P3):** [count]
## Observability
| Item | Status | Priority | Notes |
|------|--------|----------|-------|
| ... | ✅/❌/⚠️ | P0-P3 | ... |
[Repeat for each category]
## Remediation Summary
[List failing items with links to remediation guidance]
Checklist Items by Category
Observability (O11Y)
| Item | Priority | Applies To | Detection Method |
|---|---|---|---|
| Alarmable top-level metric OR Canary (OpsGenie integrated) | P0 | High traffic (>100 req/min) | Ask |
| Canary coverage (if <100 req/min) | P0 | Low traffic | Ask |
| DB/Queue monitoring (CPU/Disk/Memory) | P1 | Services with DB/Queue | Ask |
| Logging configured and viewable | P1 | All | Grep for logging config |
| Audit/security log retention (min 1 year for SOC 2 Type 2) | P1 | All | Ask |
| Distributed tracing (OpenTelemetry/Jaeger) | P2 | Backend services | Grep for otel/tracing |
| Sentry instrumentation | P1 | Frontend only | Grep for @sentry |
| status.reown.com integration | P3 | Public-facing | Ask |
Note on log retention scope: The 1-year minimum retention applies specifically to audit/security event logs — authentication attempts, authorization decisions, admin actions, data access events, and configuration changes. General application logs and error tracking (e.g. Sentry) are not subject to this requirement. This aligns with SOC 2 Type 2 audit trail requirements.
Remediation: See references/remediation-o11y.md
CI/CD & Testing
| Item | Priority | Applies To | Detection Method |
|---|---|---|---|
| CI runs unit/functional tests (>80% critical path coverage) | P0 | All | Check workflow files |
| CD runs integration/e2e tests | P1 | All | Check workflow files |
| Load testing performed | P1 | High traffic / user-facing | Ask |
| Rollback procedure documented and tested | P1 | All | Ask |
| Post-deploy health checks | P2 | All | Check workflow files |
Remediation: See references/remediation-cicd.md
Primitives (Infrastructure)
| Item | Priority | Applies To | Detection Method |
|---|---|---|---|
| Runbook documented (failure modes, troubleshooting, escalation) | P0 | All | Ask |
| Infrastructure as code (Terraform/CDK) | P0 | All | Check for .tf or cdk files |
| Autoscaling configured | P1 | Backend services | Grep .tf for autoscaling |
| Healthcheck endpoint (memory, filesystem, dependencies) | P1 | All | Grep for /health endpoint |
| Multi-AZ deployment (2+ pods/instances) | P1 | All | Ask |
| Secrets management (AWS SM, Vault) - no secrets in code | P0 | All | Grep for hardcoded secrets, check .tf |
| Configuration management (env separation) | P2 | All | Check for env-specific configs |
| Data Lake integration | P3 | Analytics needs | Ask |
Remediation: See references/remediation-primitives.md
Security
| Item | Priority | Applies To | Detection Method |
|---|---|---|---|
| OWASP Top 10 2025 validation | P0 | All | Ask |
| Secure design review (threat modeling) | P1 | All | Ask |
| Dependency scanning enabled + SBOM | P1 | All | Check for dependabot, snyk |
| Software/data integrity (code signing, CI/CD security) | P2 | All | Ask |
| Fail-secure exception handling | P1 | All | Code review |
| Service-to-service auth (mTLS, JWT, API keys) | P1 | Backend with internal APIs | Ask |
| Clickjacking headers (X-Frame-Options, CSP) | P1 | Frontend only | Grep for security headers |
| SPF records | P2 | Services with email | Ask |
| DKIM records | P2 | Services with email | Ask |
| RLS policies (Supabase/DB) | P0 | Services with Supabase | Ask |
| Rate limiting | P1 | Public APIs | Grep for rate limit config |
| DDoS protection (Cloudflare/AWS Shield) | P1 | Public-facing | Ask |
| API authentication | P1 | Public APIs | Grep for auth middleware |
| Audit logging (auth, admin, data access) | P2 | All | Grep for audit log |
Remediation: See references/remediation-security.md
3rd Party Services
| Item | Priority | Applies To | Detection Method |
|---|---|---|---|
| Metrics integration for 3rd parties | P2 | Services using 3rd parties | Ask |
| Status page integration (Slack channel minimum) | P2 | Services using 3rd parties | Ask |
| RPC rate limits configured | P1 | Services using RPCs | Ask |
Remediation: See references/remediation-dependencies.md
Service Dependencies
| Item | Priority | Applies To | Detection Method |
|---|---|---|---|
| Upstream dependencies documented | P1 | All | Ask |
| Downstream dependencies documented | P1 | All | Ask |
| Dependency health in service health endpoint | P2 | All | Code review |
| Fallback behavior for non-critical deps | P2 | All | Ask |
Remediation: See references/remediation-dependencies.md
Data Retention & Privacy
| Item | Priority | Applies To | Detection Method |
|---|---|---|---|
| Data retention policy defined | P1 | Services with persistent data | Ask |
| GDPR: Personal data identified | P1 | Services handling user data | Ask |
| GDPR: DSAR process defined | P1 | Services handling user data | Ask |
| GDPR: Right to be forgotten process | P1 | Services handling user data | Ask |
| Privacy policy updated | P2 | User-facing services | Ask |
| DPAs with third-party processors | P2 | Services sharing data | Ask |
Remediation: See references/remediation-privacy.md
Efficiency & Frugality
| Item | Priority | Applies To | Detection Method |
|---|---|---|---|
| Resource-efficient implementation | P2 | All | Code review |
| Cost scaling model documented | P2 | All | Ask |
| Spend caps / usage alerts configured | P2 | All | Ask |
| FinOps review completed | P3 | All | Ask |
Remediation: See references/remediation-efficiency.md
Priority Definitions
| Priority | Meaning | Action Required |
|---|---|---|
| P0 | Launch blocker | Must fix before production |
| P1 | High priority | Fix within current sprint |
| P2 | Medium priority | Fix within quarter |
| P3 | Nice to have | Address when convenient |
Status Indicators
- ✅ Pass - Item verified as compliant
- ❌ Fail - Item not compliant, needs remediation
- ⚠️ Partial - Partially compliant, improvements needed
- ➖ N/A - Not applicable to this service type