Operational Readiness Checklist

Comprehensive checklist to validate services before production launch. Analyzes codebase + asks interactive questions for items that cannot be detected from code.

Workflow Overview

Gather context - Identify service type, tech stack, and traffic expectations
Analyze codebase - Scan for CI/CD configs, infrastructure code, security patterns
Interactive verification - Ask about items that cannot be detected from code
Generate report - Produce checklist report with priorities and remediation guidance

Step 1: Gather Context

Ask the user these questions using AskUserQuestion:

Service Classification:

Service type: Backend API, Frontend/Web App, Infrastructure/Platform, or Hybrid
Expected traffic: <100 req/min (low), 100-1000 req/min (medium), >1000 req/min (high)
Data handling: Stores user data (yes/no), Processes PII (yes/no)
Public-facing: Yes/No
Has email functionality: Yes/No
Uses database: Yes/No (if yes, which: PostgreSQL, Supabase, DynamoDB, etc.)

Tech Stack Detection: Auto-detect from files:

Cargo.toml → Rust service
package.json → Node.js/TypeScript
*.tf or *.tfvars → Terraform
cdk.json or *.cdk.ts → AWS CDK
.github/workflows/*.yml → GitHub Actions CI/CD
next.config.js → Next.js frontend
Dockerfile → Containerized service

Step 2: Codebase Analysis

Analyze the codebase for evidence of checklist items. Use Glob and Grep to find:

CI/CD Detection:

.github/workflows/*.yml - GitHub Actions
Cargo.toml + [profile.release] - Rust build config
jest.config.* / vitest.config.* - Test configuration
*.tf - Terraform files
cdk.json - CDK configuration

Security Detection:

**/security*.yml - Security scanning workflows
dependabot.yml - Dependency updates
CODEOWNERS - Code ownership
*.lock files - Dependency locking

Observability Detection:

**/tracing*.rs or opentelemetry* - Distributed tracing
sentry.* or @sentry/* - Error tracking
prometheus* or metrics* - Metrics collection
**/logging*.* or log4* or tracing* - Logging config

Infrastructure Detection:

**/autoscaling* in .tf files - Autoscaling config
**/secretsmanager* or **/ssm* - Secrets management
health* endpoints in code - Health checks

Step 3: Interactive Verification

For items that cannot be detected from code, ask yes/no questions. Group questions by category to avoid overwhelming the user.

Step 4: Generate Report

Output format:

# Operational Readiness Report: [Service Name]

**Service Type:** [Backend API / Frontend / Infrastructure]
**Tech Stack:** [Detected stack]
**Generated:** [Date]

## Summary
- **Overall Readiness:** [X/Y items passing] ([Z%])
- **Launch Blockers (P0):** [count]
- **High Priority (P1):** [count]
- **Medium Priority (P2):** [count]
- **Low Priority (P3):** [count]

## Observability
| Item | Status | Priority | Notes |
|------|--------|----------|-------|
| ... | ✅/❌/⚠️ | P0-P3 | ... |

[Repeat for each category]

## Remediation Summary
[List failing items with links to remediation guidance]

Checklist Items by Category

Observability (O11Y)

Item	Priority	Applies To	Detection Method
Alarmable top-level metric OR Canary (OpsGenie integrated)	P0	High traffic (>100 req/min)	Ask
Canary coverage (if <100 req/min)	P0	Low traffic	Ask
DB/Queue monitoring (CPU/Disk/Memory)	P1	Services with DB/Queue	Ask
Logging configured and viewable	P1	All	Grep for logging config
Audit/security log retention (min 1 year for SOC 2 Type 2)	P1	All	Ask
Distributed tracing (OpenTelemetry/Jaeger)	P2	Backend services	Grep for otel/tracing
Sentry instrumentation	P1	Frontend only	Grep for @sentry
status.reown.com integration	P3	Public-facing	Ask

Note on log retention scope: The 1-year minimum retention applies specifically to audit/security event logs — authentication attempts, authorization decisions, admin actions, data access events, and configuration changes. General application logs and error tracking (e.g. Sentry) are not subject to this requirement. This aligns with SOC 2 Type 2 audit trail requirements.

Remediation: See references/remediation-o11y.md

CI/CD & Testing

Item	Priority	Applies To	Detection Method
CI runs unit/functional tests (>80% critical path coverage)	P0	All	Check workflow files
CD runs integration/e2e tests	P1	All	Check workflow files
Load testing performed	P1	High traffic / user-facing	Ask
Rollback procedure documented and tested	P1	All	Ask
Post-deploy health checks	P2	All	Check workflow files

Remediation: See references/remediation-cicd.md

Primitives (Infrastructure)

Item	Priority	Applies To	Detection Method
Runbook documented (failure modes, troubleshooting, escalation)	P0	All	Ask
Infrastructure as code (Terraform/CDK)	P0	All	Check for .tf or cdk files
Autoscaling configured	P1	Backend services	Grep .tf for autoscaling
Healthcheck endpoint (memory, filesystem, dependencies)	P1	All	Grep for /health endpoint
Multi-AZ deployment (2+ pods/instances)	P1	All	Ask
Secrets management (AWS SM, Vault) - no secrets in code	P0	All	Grep for hardcoded secrets, check .tf
Configuration management (env separation)	P2	All	Check for env-specific configs
Data Lake integration	P3	Analytics needs	Ask

Remediation: See references/remediation-primitives.md

Security

Item	Priority	Applies To	Detection Method
OWASP Top 10 2025 validation	P0	All	Ask
Secure design review (threat modeling)	P1	All	Ask
Dependency scanning enabled + SBOM	P1	All	Check for dependabot, snyk
Software/data integrity (code signing, CI/CD security)	P2	All	Ask
Fail-secure exception handling	P1	All	Code review
Service-to-service auth (mTLS, JWT, API keys)	P1	Backend with internal APIs	Ask
Clickjacking headers (X-Frame-Options, CSP)	P1	Frontend only	Grep for security headers
SPF records	P2	Services with email	Ask
DKIM records	P2	Services with email	Ask
RLS policies (Supabase/DB)	P0	Services with Supabase	Ask
Rate limiting	P1	Public APIs	Grep for rate limit config
DDoS protection (Cloudflare/AWS Shield)	P1	Public-facing	Ask
API authentication	P1	Public APIs	Grep for auth middleware
Audit logging (auth, admin, data access)	P2	All	Grep for audit log

Remediation: See references/remediation-security.md

3rd Party Services

Item	Priority	Applies To	Detection Method
Metrics integration for 3rd parties	P2	Services using 3rd parties	Ask
Status page integration (Slack channel minimum)	P2	Services using 3rd parties	Ask
RPC rate limits configured	P1	Services using RPCs	Ask

Remediation: See references/remediation-dependencies.md

Service Dependencies

Item	Priority	Applies To	Detection Method
Upstream dependencies documented	P1	All	Ask
Downstream dependencies documented	P1	All	Ask
Dependency health in service health endpoint	P2	All	Code review
Fallback behavior for non-critical deps	P2	All	Ask

Remediation: See references/remediation-dependencies.md

Data Retention & Privacy

Item	Priority	Applies To	Detection Method
Data retention policy defined	P1	Services with persistent data	Ask
GDPR: Personal data identified	P1	Services handling user data	Ask
GDPR: DSAR process defined	P1	Services handling user data	Ask
GDPR: Right to be forgotten process	P1	Services handling user data	Ask
Privacy policy updated	P2	User-facing services	Ask
DPAs with third-party processors	P2	Services sharing data	Ask

Remediation: See references/remediation-privacy.md

Efficiency & Frugality

Item	Priority	Applies To	Detection Method
Resource-efficient implementation	P2	All	Code review
Cost scaling model documented	P2	All	Ask
Spend caps / usage alerts configured	P2	All	Ask
FinOps review completed	P3	All	Ask

Remediation: See references/remediation-efficiency.md

Priority Definitions

Priority	Meaning	Action Required
P0	Launch blocker	Must fix before production
P1	High priority	Fix within current sprint
P2	Medium priority	Fix within quarter
P3	Nice to have	Address when convenient

Status Indicators

✅ Pass - Item verified as compliant
❌ Fail - Item not compliant, needs remediation
⚠️ Partial - Partially compliant, improvements needed
➖ N/A - Not applicable to this service type