runbook-generator

Expert in creating comprehensive, standardized runbooks for operational procedures, incident response, and system maintenance tasks.

Safety Notice

This listing is imported from skills.sh public index metadata. Review upstream SKILL.md and repository scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "runbook-generator" with this command: npx skills add dengineproblem/agents-monorepo/dengineproblem-agents-monorepo-runbook-generator

Runbook Generator

Expert in creating comprehensive, standardized runbooks for operational procedures, incident response, and system maintenance tasks.

Runbook Structure

runbook_template: metadata: title: "Runbook title" version: "1.0" last_updated: "2024-01-15" owner: "Team/Person" reviewers: ["Name 1", "Name 2"]

overview: purpose: "What this runbook accomplishes" scope: "Systems/services affected" audience: "Who should use this"

prerequisites: access: - "AWS Console access" - "SSH key for production servers" - "Database credentials" tools: - "kubectl configured" - "AWS CLI installed" - "jq for JSON parsing" knowledge: - "Basic Kubernetes concepts" - "Understanding of service architecture"

execution: estimated_time: "15-30 minutes" risk_level: "Medium" requires_change_ticket: true requires_approval: true can_be_automated: true

steps: [] # Detailed steps below

verification: [] # How to confirm success

rollback: [] # How to undo changes

troubleshooting: [] # Common issues

contacts: primary_oncall: "PagerDuty" escalation: "Engineering Manager" subject_experts: ["DBA Team", "Platform Team"]

Standard Runbook Template

[Runbook Title]

Version: 1.0 Last Updated: YYYY-MM-DD Owner: Team Name Risk Level: Low | Medium | High | Critical

Overview

Purpose

Brief description of what this runbook accomplishes.

When to Use

  • Trigger condition 1
  • Trigger condition 2
  • Alert: "Alert Name" fires

Scope

Systems and services affected:

  • Service A
  • Database B
  • External dependency C

Prerequisites

Required Access

  • Production AWS Console
  • Kubernetes cluster access
  • Database read/write permissions

Required Tools

# Verify kubectl
kubectl version --client

# Verify AWS CLI
aws sts get-caller-identity

# Verify database connectivity
psql -h $DB_HOST -U $DB_USER -c "SELECT 1"

Required Knowledge

- Kubernetes pod management

- Service architecture overview

- Incident response process

Pre-Execution Checklist

-  Change ticket created: CHG-XXXXX

-  Approval obtained from: [Name]

-  Backup verified (if applicable)

-  Stakeholders notified

-  Maintenance window scheduled (if applicable)

Execution Steps

Step 1: [Action Name]

Purpose: Why this step is necessary

Command:

kubectl get pods -n production -l app=myservice

Expected Output:

NAME                        READY   STATUS    RESTARTS   AGE
myservice-abc123-xyz        1/1     Running   0          2d
myservice-def456-uvw        1/1     Running   0          2d

Verification: Confirm all pods show STATUS=Running

If unexpected: See Troubleshooting section

Step 2: [Next Action]

Purpose: Description

Command:

# Command with explanation
kubectl scale deployment myservice --replicas=3 -n production

Expected Output:

deployment.apps/myservice scaled

Verification:

# Verify new replicas are running
kubectl get pods -n production -l app=myservice -w

Wait for: All 3 pods to show Running status (typically 2-5 minutes)

Post-Execution Verification

Verify Service Health

# Check deployment status
kubectl rollout status deployment/myservice -n production

# Check service endpoints
kubectl get endpoints myservice -n production

# Verify application health
curl -s https://api.example.com/health | jq .

Expected:

{
  "status": "healthy",
  "version": "1.2.3",
  "uptime": "2h30m"
}

Verify Metrics

-  Error rate returned to normal (<0.1%)

-  Latency within SLA (<200ms p99)

-  No new alerts firing

Rollback Procedure

When to Rollback

- Error rate exceeds 1%

- Latency exceeds 500ms p99

- Critical functionality broken

Rollback Steps

# Rollback to previous deployment
kubectl rollout undo deployment/myservice -n production

# Verify rollback
kubectl rollout status deployment/myservice -n production

# Confirm previous version
kubectl get deployment myservice -n production -o jsonpath='{.spec.template.spec.containers[0].image}'

Troubleshooting

Symptom
Likely Cause
Resolution

Pods stuck in Pending
Resource constraints
Check node capacity: kubectl describe nodes

CrashLoopBackOff
Application error
Check logs: kubectl logs -f <pod>

ImagePullBackOff
Registry auth issue
Verify secret: kubectl get secret regcred

Connection refused
Service not ready
Wait for readiness probe, check endpoints

Common Issues

Issue: Deployment times out

# Check pod events
kubectl describe pod <pod-name> -n production

# Check resource limits
kubectl top pods -n production

Issue: Database connection failures

# Verify database connectivity
kubectl exec -it <pod> -n production -- psql -h $DB_HOST -c "SELECT 1"

# Check connection pool
kubectl logs <pod> -n production | grep -i "connection"

Emergency Contacts

Role
Contact
When to Engage

On-call Engineer
PagerDuty
Any issue

Database Team
#dba-oncall
Database issues

Platform Team
#platform-oncall
Infrastructure issues

Engineering Manager
[Name]
Escalation

Change Log

Version
Date
Author
Changes

1.0
2024-01-15
Author
Initial version

Related Documentation

- Service Architecture

- Incident Response Process

- Monitoring Dashboard

## Runbook Types

### Incident Response Runbook

```yaml
incident_runbook:
  sections:
    detection:
      alert_name: "High Error Rate - Payment Service"
      threshold: "Error rate > 5% for 5 minutes"
      severity: "P1"

    immediate_actions:
      - step: "Acknowledge alert"
        command: "In PagerDuty, acknowledge incident"
        time: "< 5 min"

      - step: "Assess impact"
        command: |
          # Check error rate
          curl -s "https://metrics.example.com/api/v1/query?query=rate(http_errors[5m])"
        time: "< 2 min"

      - step: "Notify stakeholders"
        action: "Post in #incident-channel"
        template: |
          🚨 INCIDENT: Payment Service High Errors
          Severity: P1
          Status: Investigating
          Impact: Payment processing affected
          IC: @oncall

    investigation:
      - "Check recent deployments"
      - "Review error logs"
      - "Check dependent services"
      - "Review infrastructure metrics"

    mitigation:
      options:
        - name: "Rollback deployment"
          when: "Error started after deploy"
          command: "kubectl rollout undo deployment/payment -n prod"

        - name: "Scale up"
          when: "Load-related errors"
          command: "kubectl scale deployment/payment --replicas=10 -n prod"

        - name: "Enable circuit breaker"
          when: "Downstream dependency failing"
          command: "Toggle feature flag: payment.circuit_breaker=true"

    resolution:
      checklist:
        - "[ ] Error rate < 0.1%"
        - "[ ] No P1 alerts"
        - "[ ] Stakeholders notified"
        - "[ ] Incident documented"

Deployment Runbook

deployment_runbook:
  pre_deployment:
    checklist:
      - "[ ] Code review approved"
      - "[ ] CI/CD pipeline passed"
      - "[ ] Staging tested"
      - "[ ] Change ticket approved"
      - "[ ] Rollback plan documented"

    verification:
      - step: "Verify staging health"
        command: |
          curl -s https://staging.example.com/health

      - step: "Check deployment queue"
        command: |
          kubectl get pods -n staging -l app=myservice

  deployment:
    - step: "Apply deployment"
      command: |
        kubectl apply -f k8s/production/deployment.yaml

    - step: "Monitor rollout"
      command: |
        kubectl rollout status deployment/myservice -n production --timeout=10m

    - step: "Verify new version"
      command: |
        kubectl get deployment myservice -n production \
          -o jsonpath='{.spec.template.spec.containers[0].image}'

  post_deployment:
    - step: "Smoke test"
      command: |
        ./scripts/smoke-test.sh production

    - step: "Monitor metrics"
      duration: "15 minutes"
      watch:
        - "Error rate"
        - "Latency p99"
        - "Request rate"

    - step: "Update ticket"
      action: "Mark CHG ticket as completed"

Maintenance Runbook

maintenance_runbook:
  log_rotation:
    schedule: "Weekly, Sunday 02:00 UTC"

    steps:
      - step: "Connect to server"
        command: |
          ssh admin@logs.example.com

      - step: "Rotate logs"
        command: |
          sudo logrotate -f /etc/logrotate.d/application

      - step: "Verify rotation"
        command: |
          ls -la /var/log/application/
          # Should see rotated files with date suffix

      - step: "Clean old logs"
        command: |
          # Remove logs older than 30 days
          find /var/log/application/ -name "*.log.*" -mtime +30 -delete

      - step: "Verify disk space"
        command: |
          df -h /var/log
          # Should show > 20% free

  database_maintenance:
    schedule: "Monthly, first Sunday 03:00 UTC"

    steps:
      - step: "Check table sizes"
        command: |
          psql -c "
            SELECT tablename,
                   pg_size_pretty(pg_total_relation_size(tablename::text))
            FROM pg_tables
            WHERE schemaname = 'public'
            ORDER BY pg_total_relation_size(tablename::text) DESC
            LIMIT 10;
          "

      - step: "Run VACUUM ANALYZE"
        command: |
          psql -c "VACUUM ANALYZE;"

      - step: "Reindex if needed"
        command: |
          psql -c "REINDEX DATABASE mydb;"

Writing Guidelines

principles:
  clarity:
    - "Use active voice"
    - "Be explicit, never assume"
    - "One action per step"

  completeness:
    - "Include all commands"
    - "Show expected output"
    - "Document verification"

  safety:
    - "Test in non-prod first"
    - "Include rollback steps"
    - "Document risks"

formatting:
  commands:
    - "Use code blocks with language"
    - "Include full paths"
    - "Add comments for complex commands"

  steps:
    - "Number sequentially"
    - "Include purpose"
    - "Show expected result"
    - "Note time estimate"

  variables:
    format: "$VARIABLE_NAME or <placeholder>"
    document: "List all variables at start"

Quality Checklist

validation:
  structure:
    - "[ ] Clear title and metadata"
    - "[ ] Prerequisites listed"
    - "[ ] Steps numbered and clear"
    - "[ ] Expected outputs included"
    - "[ ] Verification steps present"
    - "[ ] Rollback documented"
    - "[ ] Troubleshooting section"
    - "[ ] Contacts listed"

  testing:
    - "[ ] All commands tested"
    - "[ ] Outputs verified"
    - "[ ] Rollback tested"
    - "[ ] Time estimates accurate"

  maintenance:
    - "[ ] Version number updated"
    - "[ ] Change log maintained"
    - "[ ] Quarterly review scheduled"
    - "[ ] Owner assigned"

Лучшие практики

- Test everything — каждая команда должна быть проверена

- Show expected output — пользователь должен знать что увидит

- Include rollback — всегда план отката

- Keep updated — ревью каждый квартал

- Version control — runbooks в git

- Automate when possible — автоматизируй повторяющиеся процедуры

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Related Skills

Related by shared tags or category signals.

Automation

social-media-marketing

No summary provided by upstream source.

Repository SourceNeeds Review
Automation

video-marketing

No summary provided by upstream source.

Repository SourceNeeds Review
Automation

frontend-design

No summary provided by upstream source.

Repository SourceNeeds Review
Automation

k6-load-test

No summary provided by upstream source.

Repository SourceNeeds Review