deployment-runbook

This skill provides deployment procedures, automated health checks, and rollback strategies to ensure safe, reliable deployments. Use it to standardize deployment workflows and reduce deployment-related incidents.

Safety Notice

This listing is imported from skills.sh public index metadata. Review upstream SKILL.md and repository scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "deployment-runbook" with this command: npx skills add pfangueiro/claude-code-agents/pfangueiro-claude-code-agents-deployment-runbook

Deployment Runbook

Overview

This skill provides deployment procedures, automated health checks, and rollback strategies to ensure safe, reliable deployments. Use it to standardize deployment workflows and reduce deployment-related incidents.

When to Use This Skill

  • Planning production deployments

  • Executing staged rollouts (canary, blue-green)

  • Performing post-deployment health checks

  • Rolling back failed deployments

  • Troubleshooting deployment issues

  • Establishing deployment best practices

  • Complementing the devops-automation agent for deployments

Pre-Deployment Checklist

Before any production deployment:

  • Code Review: All changes reviewed and approved

  • Tests Pass: CI/CD pipeline green

  • Unit tests: ✓

  • Integration tests: ✓

  • E2E tests: ✓

  • Database Migrations: Tested in staging

  • Backward compatible

  • Rollback script prepared

  • Configuration: Environment variables verified

  • Secrets rotated if needed

  • Feature flags configured

  • Monitoring: Dashboards and alerts ready

  • Error tracking enabled

  • Performance monitoring active

  • Log aggregation configured

  • Communication: Stakeholders notified

  • Deployment window announced

  • On-call engineer assigned

  • Rollback plan documented

  • Backups: Recent backup verified

  • Database backed up < 1 hour ago

  • Backup restoration tested

  • Capacity: Resources scaled appropriately

  • Auto-scaling configured

  • Rate limits reviewed

  • CDN cache warmed

Deployment Strategies

  1. Blue-Green Deployment

Best for: Zero-downtime deployments, easy rollbacks

Process:

  • Deploy to inactive (green) environment

  • Run health checks on green

  • Switch traffic from blue to green

  • Monitor for issues

  • Keep blue as instant rollback option

Commands:

Deploy to green environment

./deploy.sh --env green

Run health checks

python3 scripts/health_check.py --env green

Switch traffic (gradual)

./switch_traffic.sh --from blue --to green --percentage 10 ./switch_traffic.sh --from blue --to green --percentage 50 ./switch_traffic.sh --from blue --to green --percentage 100

If issues: instant rollback

./switch_traffic.sh --from green --to blue --percentage 100

  1. Canary Deployment

Best for: Risk-averse deployments, gradual rollouts

Process:

  • Deploy to small subset of servers (5-10%)

  • Monitor metrics closely

  • Gradually increase percentage

  • Roll back if metrics degrade

Monitoring During Canary:

  • Error rate < baseline + 1%

  • Response time < baseline + 10%

  • Success rate > 99.9%

  1. Rolling Deployment

Best for: Standard updates, resource-constrained environments

Process:

  • Take one instance out of load balancer

  • Deploy new version

  • Run health checks

  • Add back to load balancer

  • Repeat for remaining instances

Deployment Workflow

Phase 1: Pre-Deployment (T-30 minutes)

1. Verify staging environment

./verify_staging.sh

2. Create deployment tag

git tag -a v1.2.3 -m "Release 1.2.3" git push origin v1.2.3

3. Trigger production build

./build_production.sh --tag v1.2.3

4. Backup database

./backup_db.sh --environment production

5. Notify team

./notify_slack.sh "🚀 Starting deployment v1.2.3 in 30 minutes"

Phase 2: Deployment (T-0)

1. Enable maintenance mode (if needed)

./maintenance_mode.sh --enable

2. Run database migrations

./run_migrations.sh --environment production

3. Deploy application

./deploy.sh --environment production --version v1.2.3

4. Disable maintenance mode

./maintenance_mode.sh --disable

Phase 3: Post-Deployment Health Checks

Run comprehensive health checks

python3 scripts/health_check.py --environment production

Expected output:

✓ API health endpoint responding

✓ Database connectivity OK

✓ Cache layer accessible

✓ External services reachable

✓ Error rate within threshold

✓ Response time within SLA

Phase 4: Monitoring (T+30 minutes)

Monitor these metrics:

Application Metrics:

  • Error rate: < 0.1%

  • Response time (p95): < 200ms

  • Request throughput: within expected range

  • Success rate: > 99.9%

Infrastructure Metrics:

  • CPU utilization: < 70%

  • Memory usage: < 80%

  • Disk I/O: normal patterns

  • Network latency: < 50ms

Business Metrics:

  • Conversion rate: no significant drop

  • User signups: within expected range

  • Transaction volume: normal patterns

Rollback Procedures

When to Rollback

Rollback immediately if:

  • Error rate > 1%

  • Critical functionality broken

  • Data corruption detected

  • Security vulnerability introduced

  • Performance degradation > 50%

Rollback Methods

Method 1: Traffic Switch (Fastest)

Blue-green: instant rollback

./switch_traffic.sh --from green --to blue --percentage 100

Verification

python3 scripts/health_check.py --environment production

Method 2: Version Revert

Deploy previous version

./deploy.sh --environment production --version v1.2.2

Run health checks

python3 scripts/health_check.py --environment production

Method 3: Database Rollback

If migrations were applied

./rollback_migration.sh --environment production --steps 1

Restore from backup (last resort)

./restore_db.sh --backup latest --environment production

Post-Rollback

Verify system health

python3 scripts/health_check.py --environment production

Notify stakeholders

./notify_slack.sh "⚠️ Deployment v1.2.3 rolled back. System stable on v1.2.2"

Create postmortem

  • What went wrong?

  • Why didn't we catch it?

  • How do we prevent recurrence?

Health Check Script

Use the included health check script:

Run all checks

python3 scripts/health_check.py --env production

Run specific check

python3 scripts/health_check.py --env production --check api

Verbose output

python3 scripts/health_check.py --env production --verbose

See scripts/health_check.py for implementation.

Troubleshooting Guide

Issue: Deployment Hangs

Symptoms:

  • Deployment script doesn't complete

  • Services not starting

Diagnosis:

Check service logs

kubectl logs -f deployment/app-name

Check events

kubectl get events --sort-by='.lastTimestamp'

Resolution:

  • Increase timeout values

  • Check resource constraints

  • Verify image pull secrets

Issue: High Error Rate Post-Deployment

Symptoms:

  • Error rate spike

  • 500 errors in logs

Diagnosis:

Check application logs

tail -f /var/log/app/error.log

Check error distribution

grep "ERROR" /var/log/app/* | awk '{print $NF}' | sort | uniq -c | sort -nr

Resolution:

  • Check configuration changes

  • Verify environment variables

  • Review recent code changes

  • Consider immediate rollback

Issue: Database Connection Failures

Symptoms:

  • "Connection refused" errors

  • Timeout errors

Diagnosis:

Test database connectivity

python3 scripts/test_db_connection.py

Check connection pool

psql -h db-host -U user -c "SELECT * FROM pg_stat_activity;"

Resolution:

  • Verify connection strings

  • Check firewall rules

  • Increase connection pool size

  • Verify credentials

Communication Templates

Pre-Deployment Announcement

🚀 Production Deployment Scheduled

Version: v1.2.3 Time: 2024-01-15 14:00 UTC (30 minutes) Duration: ~15 minutes Impact: No expected downtime

Changes:

  • Feature: New user dashboard
  • Fix: Payment processing bug
  • Performance: API response time improvements

Rollback Plan: Blue-green switch (instant) On-Call: @engineer-name

Deployment Success

Deployment Complete

Version: v1.2.3 Status: Successful Duration: 12 minutes

Health Checks: All passing ✓ Metrics: Within normal range Next Check: T+30 minutes

Monitoring dashboard: [link]

Deployment Rollback

⚠️ Deployment Rolled Back

Version: v1.2.3 → v1.2.2 (rollback) Reason: Elevated error rate (2.1%) Status: System stable on v1.2.2

Action Items:

  • Root cause analysis
  • Fix identified issue
  • Re-test in staging
  • Schedule re-deployment

Incident report: [link]

Resources

scripts/

  • health_check.py: Comprehensive deployment health checks

  • test_db_connection.py: Database connectivity verification

references/

  • deployment-checklist.md: Detailed pre/post deployment checklist

  • monitoring-guide.md: Metrics to monitor during deployments

Best Practices

  • Always deploy during low-traffic windows

  • Never deploy on Fridays (unless critical hotfix)

  • Keep deployments small (< 200 lines changed)

  • Monitor for 30+ minutes post-deployment

  • Document every rollback with postmortem

  • Test rollback procedure in staging first

  • Use feature flags for risky changes

  • Automate health checks (don't rely on manual verification)

Quick Reference

Emergency Rollback:

./switch_traffic.sh --from green --to blue --percentage 100

Health Check:

python3 scripts/health_check.py --env production

View Logs:

kubectl logs -f deployment/app-name --tail=100

Check Metrics:

curl https://metrics.example.com/api/health

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Related Skills

Related by shared tags or category signals.

Coding

deep-read

No summary provided by upstream source.

Repository SourceNeeds Review
Coding

git-workflow

No summary provided by upstream source.

Repository SourceNeeds Review
Coding

ci-cd-templates

No summary provided by upstream source.

Repository SourceNeeds Review
Coding

skill-creator

No summary provided by upstream source.

Repository SourceNeeds Review