backup-restore-runbook-generator

Backup/Restore Runbook Generator

Safety Notice

This listing is imported from skills.sh public index metadata. Review upstream SKILL.md and repository scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "backup-restore-runbook-generator" with this command: npx skills add patricio0312rev/skills/patricio0312rev-skills-backup-restore-runbook-generator

Backup/Restore Runbook Generator

Create reliable disaster recovery procedures for your databases.

Backup Strategy

Database Backup Strategy

Backup Types

1. Full Backup (Daily)

  • When: 2:00 AM UTC
  • Retention: 30 days
  • Storage: S3 s3://backups/full/
  • Size: ~50 GB
  • Duration: ~45 minutes

2. Incremental Backup (Hourly)

  • When: Every hour
  • Retention: 7 days
  • Storage: S3 s3://backups/incremental/
  • Size: ~500 MB
  • Duration: ~5 minutes

3. Transaction Log Backup (Every 15 min)

  • When: Every 15 minutes
  • Retention: 3 days
  • Storage: S3 s3://backups/wal/
  • Point-in-time recovery capability

Backup Automation

PostgreSQL

#!/bin/bash
# scripts/backup-postgres.sh

set -e

# Configuration
DB_NAME="production"
DB_USER="postgres"
DB_HOST="postgres.example.com"
BACKUP_DIR="/var/backups/postgres"
S3_BUCKET="s3://my-backups/postgres"
DATE=$(date +%Y%m%d_%H%M%S)
FILENAME="${DB_NAME}_${DATE}.sql.gz"

# Create backup directory
mkdir -p $BACKUP_DIR

echo "🔄 Starting backup: $FILENAME"

# Full backup with pg_dump
pg_dump \
  --host=$DB_HOST \
  --username=$DB_USER \
  --dbname=$DB_NAME \
  --format=custom \
  --compress=9 \
  --file=$BACKUP_DIR/$FILENAME \
  --verbose

# Verify backup
if [ -f "$BACKUP_DIR/$FILENAME" ]; then
  SIZE=$(du -h "$BACKUP_DIR/$FILENAME" | cut -f1)
  echo "✅ Backup created: $SIZE"
else
  echo "❌ Backup failed"
  exit 1
fi

# Upload to S3
echo "📤 Uploading to S3..."
aws s3 cp $BACKUP_DIR/$FILENAME $S3_BUCKET/ \
  --storage-class STANDARD_IA

# Verify upload
if aws s3 ls $S3_BUCKET/$FILENAME; then
  echo "✅ Uploaded to S3"
else
  echo "❌ S3 upload failed"
  exit 1
fi

# Cleanup old local backups (keep last 7 days)
find $BACKUP_DIR -type f -name "*.sql.gz" -mtime +7 -delete
echo "🗑️  Cleaned up old local backups"

# Send notification
curl -X POST $SLACK_WEBHOOK \
  -H 'Content-Type: application/json' \
  -d "{\"text\": \"✅ Database backup complete: $FILENAME ($SIZE)\"}"

echo "✅ Backup complete!"

MySQL

#!/bin/bash

scripts/backup-mysql.sh

set -e

DB_NAME="production" DB_USER="root" DB_PASSWORD=$MYSQL_PASSWORD DATE=$(date +%Y%m%d_%H%M%S) FILENAME="${DB_NAME}_${DATE}.sql.gz"

echo "🔄 Starting MySQL backup..."

Backup with mysqldump

mysqldump
--user=$DB_USER
--password=$DB_PASSWORD
--single-transaction
--quick
--lock-tables=false
--databases $DB_NAME
| gzip > /var/backups/mysql/$FILENAME

Upload to S3

aws s3 cp /var/backups/mysql/$FILENAME s3://my-backups/mysql/

echo "✅ Backup complete!"

Restore Procedures

Full Restore

#!/bin/bash

scripts/restore-postgres.sh

set -e

BACKUP_FILE=$1 RESTORE_DB="production_restored"

if [ -z "$BACKUP_FILE" ]; then echo "Usage: ./restore-postgres.sh <backup-file>" exit 1 fi

echo "🔄 Starting restore from: $BACKUP_FILE"

1. Download from S3

echo "📥 Downloading backup..." aws s3 cp s3://my-backups/postgres/$BACKUP_FILE /tmp/

2. Create new database

echo "🗄️ Creating database..." psql -h $DB_HOST -U postgres -c "CREATE DATABASE $RESTORE_DB;"

3. Restore backup

echo "🔄 Restoring data..." pg_restore
--host=$DB_HOST
--username=postgres
--dbname=$RESTORE_DB
--verbose
/tmp/$BACKUP_FILE

4. Verify restore

echo "✅ Verifying restore..." TABLE_COUNT=$(psql -h $DB_HOST -U postgres -d $RESTORE_DB -t -c "SELECT COUNT(*) FROM information_schema.tables WHERE table_schema='public';") echo " Tables restored: $TABLE_COUNT"

ROW_COUNT=$(psql -h $DB_HOST -U postgres -d $RESTORE_DB -t -c "SELECT COUNT(*) FROM users;") echo " User rows: $ROW_COUNT"

echo "✅ Restore complete!" echo " Database: $RESTORE_DB" echo " To use: UPDATE application config to point to $RESTORE_DB"

Point-in-Time Recovery (PITR)

#!/bin/bash

scripts/pitr-restore.sh

TARGET_TIME=$1 # Format: 2024-01-15 14:30:00

echo "🔄 Point-in-Time Restore to: $TARGET_TIME"

1. Restore base backup

echo "📦 Restoring base backup..." pg_basebackup -D /var/lib/postgresql/data -X stream

2. Configure recovery

cat > /var/lib/postgresql/data/recovery.conf << EOF restore_command = 'aws s3 cp s3://my-backups/wal/%f %p' recovery_target_time = '$TARGET_TIME' recovery_target_action = 'promote' EOF

3. Start PostgreSQL

echo "🚀 Starting PostgreSQL in recovery mode..." systemctl start postgresql

4. Wait for recovery

while ! pg_isready; do echo " Waiting for recovery..." sleep 5 done

echo "✅ PITR complete!"

Validation Checks

#!/bin/bash

scripts/validate-restore.sh

DB=$1

echo "🔍 Validating restore..."

1. Check table count

TABLES=$(psql -d $DB -t -c "SELECT COUNT(*) FROM information_schema.tables WHERE table_schema='public';") echo "Tables: $TABLES"

if [ "$TABLES" -lt 10 ]; then echo "❌ Too few tables restored" exit 1 fi

2. Check row counts

for table in users products orders; do ROWS=$(psql -d $DB -t -c "SELECT COUNT(*) FROM $table;") echo " $table: $ROWS rows"

if [ "$ROWS" -lt 1 ]; then echo "❌ Table $table is empty" exit 1 fi done

3. Check constraints

CONSTRAINTS=$(psql -d $DB -t -c "SELECT COUNT(*) FROM information_schema.table_constraints WHERE constraint_type='FOREIGN KEY';") echo "Foreign keys: $CONSTRAINTS"

4. Check indexes

INDEXES=$(psql -d $DB -t -c "SELECT COUNT(*) FROM pg_indexes WHERE schemaname='public';") echo "Indexes: $INDEXES"

5. Test query performance

START=$(date +%s%N) psql -d $DB -c "SELECT COUNT(*) FROM users WHERE email LIKE '%@example.com%';" > /dev/null END=$(date +%s%N) DURATION=$(( (END - START) / 1000000 )) echo "Query performance: ${DURATION}ms"

if [ "$DURATION" -gt 1000 ]; then echo "⚠️ Slow query - missing indexes?" fi

echo "✅ Validation complete!"

Disaster Recovery Runbook

Disaster Recovery Runbook

Incident Response

1. Assess Situation (5 minutes)

  • Identify incident severity (P0/P1/P2)
  • Determine data loss window
  • Notify stakeholders

Contacts:

  • DBA On-Call: [phone]
  • Engineering Lead: [phone]
  • CTO: [phone]

2. Stop the Bleeding (10 minutes)

  • Enable maintenance mode
  • Stop writes to corrupted database
  • Preserve evidence (logs, backups)
# Enable maintenance mode
kubectl scale deployment/api --replicas=0
  1. Identify Recovery Point (15 minutes)
  • Determine last good backup

  • Check backup integrity

  • Calculate data loss

List available backups

aws s3 ls s3://my-backups/postgres/ | tail -20

Check backup size

aws s3 ls s3://my-backups/postgres/production_20240115_020000.sql.gz --human-readable

  1. Prepare Recovery Environment (30 minutes)
  • Spin up new database instance

  • Configure networking

  • Test connectivity

Create RDS instance

aws rds create-db-instance
--db-instance-identifier production-recovery
--db-instance-class db.r6g.xlarge
--engine postgres
--master-username postgres
--master-user-password [secure-password]

  1. Execute Restore (1-2 hours)
  • Download backup from S3

  • Run restore script

  • Apply transaction logs (if PITR)

  • Verify data integrity

Run restore

./scripts/restore-postgres.sh production_20240115_020000.sql.gz

Validate

./scripts/validate-restore.sh production_restored

  1. Validate and Test (30 minutes)
  • Run validation scripts

  • Test critical queries

  • Verify row counts

  • Check data consistency

  1. Cutover (15 minutes)
  • Update application config

  • Point DNS to new database

  • Disable maintenance mode

  • Monitor for errors

Update connection string

kubectl set env deployment/api DATABASE_URL=postgresql://...

Scale up

kubectl scale deployment/api --replicas=3

  1. Post-Recovery (1 hour)
  • Monitor system health

  • Verify user reports

  • Document incident

  • Schedule postmortem

Recovery Time Objective (RTO)

Scenario Target Actual

Full restore 2 hours [measured]

PITR restore 3 hours [measured]

Region failover 15 minutes [measured]

Recovery Point Objective (RPO)

Backup Type Data Loss Window

Full backup 24 hours

Incremental 1 hour

Transaction logs 15 minutes

Automated Backup Monitoring

// scripts/monitor-backups.ts
import { S3Client, ListObjectsV2Command } from '@aws-sdk/client-s3';

const s3 = new S3Client({ region: 'us-east-1' });

async function checkBackupHealth() {
  const bucket = 'my-backups';
  const prefix = 'postgres/';

  // List recent backups
  const command = new ListObjectsV2Command({
    Bucket: bucket,
    Prefix: prefix,
    MaxKeys: 10,
  });

  const response = await s3.send(command);
  const backups = response.Contents || [];

  // Check last backup age
  const latestBackup = backups[0];
  const age = Date.now() - new Date(latestBackup.LastModified!).getTime();
  const ageHours = age / (1000 * 60 * 60);

  if (ageHours > 25) {
    console.error('❌ No backup in last 24 hours!');
    // Send alert
    await sendSlackAlert('No recent database backup!');
    process.exit(1);
  }

  // Check backup size
  const size = latestBackup.Size! / (1024 * 1024 * 1024); // GB
  if (size &#x3C; 10) {
    console.error('⚠️  Backup size suspiciously small');
  }

  console.log('✅ Backup health check passed');
  console.log(`  Latest: ${latestBackup.Key}`);
  console.log(`  Age: ${ageHours.toFixed(1)} hours`);
  console.log(`  Size: ${size.toFixed(2)} GB`);
}

checkBackupHealth();

Role Assignments

## DR Team Roles

### Database Administrator (Primary)

- Execute restore procedures
- Verify data integrity
- Monitor recovery progress

### Engineering Lead

- Coordinate response
- Communicate with stakeholders
- Make cutover decisions

### DevOps Engineer

- Provision infrastructure
- Update application configs
- Monitor system health

### Product Manager

- Assess business impact
- Prioritize recovery
- Customer communication

## Escalation Path

1. DBA on-call →
2. Engineering Lead →
3. CTO →
4. CEO (P0 incidents only)

Best Practices

- Test restores regularly: Quarterly DR drills

- Automate backups: Never rely on manual processes

- Multiple locations: Cross-region backup storage

- Monitor backup health: Alert on failures

- Document procedures: Keep runbook updated

- Encrypt backups: Protect sensitive data

- Version control: Track backup script changes

Output Checklist

-  Backup automation scripts

-  Restore procedures documented

-  Validation checks defined

-  PITR procedure (if applicable)

-  DR runbook created

-  Role assignments documented

-  RTO/RPO defined

-  Backup monitoring configured

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Related Skills

Related by shared tags or category signals.

General

framer-motion-animator

No summary provided by upstream source.

Repository SourceNeeds Review
General

eslint-prettier-config

No summary provided by upstream source.

Repository SourceNeeds Review
General

postman-collection-generator

No summary provided by upstream source.

Repository SourceNeeds Review
General

nginx-config-optimizer

No summary provided by upstream source.

Repository SourceNeeds Review