Checkpoint Workflow Builder

Design and implement fault-tolerant workflows that can resume from any point of failure.

Overview

Complex workflows often fail mid-execution due to:

Network timeouts
System crashes
Resource exhaustion
External service failures
User interruptions

This skill teaches you to build workflows that:

Save progress at key checkpoints
Resume from last successful state
Handle partial failures gracefully
Provide clear progress visibility
Enable manual intervention points

When to Use

Use this skill when:

Building multi-phase data pipelines
Implementing long-running migration scripts
Creating deployment workflows
Processing large batches of items
Orchestrating multi-system operations
Building ETL (Extract, Transform, Load) workflows
Implementing saga patterns for distributed systems
Creating user-facing wizards with save/resume

Core Concepts

State Machine Pattern

┌─────────┐
│  INIT   │
└────┬────┘
     │
     ▼
┌────────────┐
│  DOWNLOAD  │
└─────┬──────┘
      │
      ▼
┌────────────┐
│  PROCESS   │
└─────┬──────┘
      │
      ▼
┌────────────┐
│  VALIDATE  │
└─────┬──────┘
      │
      ▼
┌────────────┐
│  FINALIZE  │
└─────┬──────┘
      │
      ▼
┌────────────┐
│  COMPLETE  │
└────────────┘

Each state:

Has clear entry conditions
Performs specific operations
Saves checkpoint before transition
Can be resumed independently

Basic Implementation

Simple State Machine

#!/bin/bash
# state-machine.sh - Basic resumable workflow

STATE_FILE=".workflow_state"

# Read current state (default: INIT)
CURRENT_STATE=$(cat "$STATE_FILE" 2>/dev/null || echo "INIT")

echo "Current state: $CURRENT_STATE"

case "$CURRENT_STATE" in
    INIT)
        echo "=== Phase 1: Initialization ==="

        # Initialize workspace
        mkdir -p workspace
        mkdir -p results

        # Download dependencies
        echo "Setting up environment..."

        # Save next state
        echo "DOWNLOAD" > "$STATE_FILE"
        echo "✓ Initialization complete"
        echo "Run again to continue"
        ;;

    DOWNLOAD)
        echo "=== Phase 2: Download Data ==="

        # Download data files
        echo "Downloading data..."
        # curl -o workspace/data.zip https://example.com/data.zip

        # Verify download
        if [ -f "workspace/data.zip" ]; then
            echo "EXTRACT" > "$STATE_FILE"
            echo "✓ Download complete"
            echo "Run again to continue"
        else
            echo "✗ Download failed - fix and run again"
            exit 1
        fi
        ;;

    EXTRACT)
        echo "=== Phase 3: Extract Data ==="

        # Extract files
        echo "Extracting data..."
        # unzip workspace/data.zip -d workspace/

        echo "PROCESS" > "$STATE_FILE"
        echo "✓ Extraction complete"
        echo "Run again to continue"
        ;;

    PROCESS)
        echo "=== Phase 4: Process Data ==="

        # Process data
        echo "Processing data..."
        # ./process_data.sh workspace/ results/

        echo "VALIDATE" > "$STATE_FILE"
        echo "✓ Processing complete"
        echo "Run again to continue"
        ;;

    VALIDATE)
        echo "=== Phase 5: Validate Results ==="

        # Validate results
        echo "Validating results..."
        # ./validate.sh results/

        if [ $? -eq 0 ]; then
            echo "FINALIZE" > "$STATE_FILE"
            echo "✓ Validation passed"
            echo "Run again to finalize"
        else
            echo "✗ Validation failed"
            echo "Fix issues and change state to PROCESS to reprocess"
            exit 1
        fi
        ;;

    FINALIZE)
        echo "=== Phase 6: Finalize ==="

        # Cleanup and finalize
        echo "Finalizing workflow..."
        # mv results/ final/
        # rm -rf workspace/

        echo "COMPLETE" > "$STATE_FILE"
        echo "✓ Workflow complete!"
        ;;

    COMPLETE)
        echo "=== Workflow Already Complete ==="
        echo "Results available in: final/"
        ;;

    *)
        echo "✗ Unknown state: $CURRENT_STATE"
        echo "Reset with: echo 'INIT' > $STATE_FILE"
        exit 1
        ;;
esac

Enhanced with Progress Tracking

#!/bin/bash
# enhanced-state-machine.sh - With detailed progress

STATE_FILE=".workflow_state"
PROGRESS_FILE=".workflow_progress.json"

# Initialize progress tracking
init_progress() {
    cat > "$PROGRESS_FILE" << EOF
{
  "current_state": "INIT",
  "started_at": "$(date -Iseconds)",
  "updated_at": "$(date -Iseconds)",
  "phases": {
    "INIT": {"status": "pending", "started": null, "completed": null},
    "DOWNLOAD": {"status": "pending", "started": null, "completed": null},
    "PROCESS": {"status": "pending", "started": null, "completed": null},
    "VALIDATE": {"status": "pending", "started": null, "completed": null},
    "FINALIZE": {"status": "pending", "started": null, "completed": null}
  }
}
EOF
}

# Update progress
update_progress() {
    local state="$1"
    local status="$2"  # "running", "completed", "failed"

    local timestamp="$(date -Iseconds)"

    if [ ! -f "$PROGRESS_FILE" ]; then
        init_progress
    fi

    jq --arg state "$state" \
       --arg status "$status" \
       --arg timestamp "$timestamp" \
       '.current_state = $state |
        .updated_at = $timestamp |
        .phases[$state].status = $status |
        (.phases[$state].started //= $timestamp) |
        (if $status == "completed" then .phases[$state].completed = $timestamp else . end)' \
       "$PROGRESS_FILE" > "$PROGRESS_FILE.tmp"

    mv "$PROGRESS_FILE.tmp" "$PROGRESS_FILE"
}

# Show progress
show_progress() {
    if [ ! -f "$PROGRESS_FILE" ]; then
        echo "No progress file found"
        return
    fi

    echo "=== Workflow Progress ==="
    echo ""
    jq -r '.phases | to_entries | .[] |
        "[\(.value.status | ascii_upcase)] \(.key)" +
        (if .value.started then " (started: " + .value.started + ")" else "" end)' \
        "$PROGRESS_FILE"
    echo ""
    echo "Current: $(jq -r '.current_state' $PROGRESS_FILE)"
    echo "Updated: $(jq -r '.updated_at' $PROGRESS_FILE)"
}

# Workflow implementation
run_workflow() {
    CURRENT_STATE=$(cat "$STATE_FILE" 2>/dev/null || echo "INIT")

    # Show progress before executing
    show_progress

    case "$CURRENT_STATE" in
        INIT)
            update_progress "INIT" "running"
            echo "Initializing..."
            # ... initialization logic ...
            update_progress "INIT" "completed"
            echo "DOWNLOAD" > "$STATE_FILE"
            ;;

        DOWNLOAD)
            update_progress "DOWNLOAD" "running"
            echo "Downloading..."
            # ... download logic ...
            update_progress "DOWNLOAD" "completed"
            echo "PROCESS" > "$STATE_FILE"
            ;;

        # ... other states ...
    esac
}

run_workflow

Advanced Patterns

Pattern 1: Batched Checkpoint

#!/bin/bash
# batched-checkpoint.sh - Process items with batch checkpoints

ITEMS_FILE="items.txt"
CHECKPOINT_FILE=".batch_checkpoint"
BATCH_SIZE=10

# Load checkpoint
if [ -f "$CHECKPOINT_FILE" ]; then
    LAST_COMPLETED=$(cat "$CHECKPOINT_FILE")
    echo "Resuming from item $LAST_COMPLETED"
else
    LAST_COMPLETED=0
fi

TOTAL_ITEMS=$(wc -l < "$ITEMS_FILE")
ITEMS_PROCESSED=0

# Process in batches
tail -n +$((LAST_COMPLETED + 1)) "$ITEMS_FILE" | while read -r item; do
    # Process item
    echo "Processing: $item"
    process_item "$item"

    ITEMS_PROCESSED=$((ITEMS_PROCESSED + 1))

    # Checkpoint every batch
    if [ $((ITEMS_PROCESSED % BATCH_SIZE)) -eq 0 ]; then
        CURRENT_POSITION=$((LAST_COMPLETED + ITEMS_PROCESSED))
        echo "$CURRENT_POSITION" > "$CHECKPOINT_FILE"

        echo "Checkpoint: $CURRENT_POSITION/$TOTAL_ITEMS"

        # Optional: Break for timeout management
        if [ $((ITEMS_PROCESSED)) -ge $((BATCH_SIZE * 5)) ]; then
            echo "Processed 5 batches ($(($BATCH_SIZE * 5)) items)"
            echo "Run again to continue"
            exit 0
        fi
    fi
done

# Final checkpoint
echo "$TOTAL_ITEMS" > "$CHECKPOINT_FILE"
echo "✓ All items processed!"

Pattern 2: Rollback Support

#!/bin/bash
# rollback-workflow.sh - State machine with rollback

STATE_FILE=".state"
ROLLBACK_DIR=".rollback"

mkdir -p "$ROLLBACK_DIR"

# Save rollback point
save_rollback() {
    local state="$1"
    local timestamp=$(date +%s)

    tar -czf "$ROLLBACK_DIR/${state}_${timestamp}.tar.gz" workspace/ 2>/dev/null || true
    echo "${state}_${timestamp}" > "$ROLLBACK_DIR/latest"
}

# Perform rollback
rollback() {
    local rollback_point="$1"

    if [ -z "$rollback_point" ]; then
        rollback_point=$(cat "$ROLLBACK_DIR/latest" 2>/dev/null)
    fi

    if [ -z "$rollback_point" ] || [ ! -f "$ROLLBACK_DIR/${rollback_point}.tar.gz" ]; then
        echo "✗ No rollback point found"
        exit 1
    fi

    echo "Rolling back to: $rollback_point"

    # Restore from backup
    rm -rf workspace/
    tar -xzf "$ROLLBACK_DIR/${rollback_point}.tar.gz"

    # Set state
    STATE=$(echo "$rollback_point" | cut -d'_' -f1)
    echo "$STATE" > "$STATE_FILE"

    echo "✓ Rolled back to state: $STATE"
}

# In workflow
case "$CURRENT_STATE" in
    DOWNLOAD)
        save_rollback "PRE_DOWNLOAD"
        # ... download logic ...
        ;;

    PROCESS)
        save_rollback "PRE_PROCESS"
        # ... process logic ...
        ;;
esac

Pattern 3: Parallel Phase Execution

#!/bin/bash
# parallel-phases.sh - Execute independent phases in parallel

STATE_FILE=".parallel_state.json"

# Initialize parallel state
init_parallel_state() {
    cat > "$STATE_FILE" << EOF
{
  "phase_a": "pending",
  "phase_b": "pending",
  "phase_c": "pending",
  "finalize": "pending"
}
EOF
}

# Update phase status
update_phase() {
    local phase="$1"
    local status="$2"

    jq --arg phase "$phase" --arg status "$status" \
       '.[$phase] = $status' "$STATE_FILE" > "$STATE_FILE.tmp"
    mv "$STATE_FILE.tmp" "$STATE_FILE"
}

# Check if all phases complete
all_phases_complete() {
    local incomplete=$(jq -r '[.phase_a, .phase_b, .phase_c] | map(select(. != "completed")) | length' "$STATE_FILE")
    [ "$incomplete" -eq 0 ]
}

# Execute phase
execute_phase() {
    local phase="$1"
    local current_status=$(jq -r ".$phase" "$STATE_FILE")

    if [ "$current_status" = "completed" ]; then
        echo "$phase: Already complete"
        return 0
    fi

    echo "$phase: Running"
    update_phase "$phase" "running"

    # Phase-specific logic
    case "$phase" in
        phase_a)
            # Independent task A
            task_a
            ;;
        phase_b)
            # Independent task B
            task_b
            ;;
        phase_c)
            # Independent task C
            task_c
            ;;
    esac

    if [ $? -eq 0 ]; then
        update_phase "$phase" "completed"
        echo "$phase: Completed"
    else
        update_phase "$phase" "failed"
        echo "$phase: Failed"
        return 1
    fi
}

# Main workflow
[ ! -f "$STATE_FILE" ] && init_parallel_state

# Run independent phases
execute_phase "phase_a"
execute_phase "phase_b"
execute_phase "phase_c"

# Check if ready for finalization
if all_phases_complete; then
    finalize_status=$(jq -r '.finalize' "$STATE_FILE")

    if [ "$finalize_status" != "completed" ]; then
        echo "All phases complete. Finalizing..."
        finalize
        update_phase "finalize" "completed"
        echo "✓ Workflow complete!"
    fi
else
    echo "Some phases incomplete. Run again to retry."
fi

Real-World Examples

Example 1: Database Migration Workflow

#!/bin/bash
# db-migration-workflow.sh

STATE_FILE=".migration_state"
MIGRATION_LOG="migration.log"

log() {
    echo "[$(date -Iseconds)] $1" | tee -a "$MIGRATION_LOG"
}

CURRENT_STATE=$(cat "$STATE_FILE" 2>/dev/null || echo "INIT")

case "$CURRENT_STATE" in
    INIT)
        log "=== Starting Migration ==="
        log "Checking prerequisites..."

        # Verify database connection
        psql -U user -d database -c "SELECT 1" > /dev/null 2>&1 || {
            log "✗ Database connection failed"
            exit 1
        }

        # Verify backup destination
        [ -d "/backups" ] || mkdir -p /backups

        echo "BACKUP" > "$STATE_FILE"
        log "✓ Prerequisites checked"
        ;;

    BACKUP)
        log "=== Creating Backup ==="

        BACKUP_FILE="/backups/db_$(date +%Y%m%d_%H%M%S).sql"
        pg_dump -U user database > "$BACKUP_FILE"

        if [ -f "$BACKUP_FILE" ] && [ -s "$BACKUP_FILE" ]; then
            log "✓ Backup created: $BACKUP_FILE"
            echo "MIGRATE" > "$STATE_FILE"
        else
            log "✗ Backup failed"
            exit 1
        fi
        ;;

    MIGRATE)
        log "=== Running Migration ==="

        # Apply migration scripts in order
        for script in migrations/*.sql; do
            log "Applying: $(basename $script)"

            psql -U user -d database -f "$script" >> "$MIGRATION_LOG" 2>&1

            if [ $? -ne 0 ]; then
                log "✗ Migration failed: $script"
                log "To rollback: ./db-migration-workflow.sh rollback"
                exit 1
            fi
        done

        echo "VALIDATE" > "$STATE_FILE"
        log "✓ Migrations applied"
        ;;

    VALIDATE)
        log "=== Validating Migration ==="

        # Run validation queries
        VALIDATION_RESULT=$(psql -U user -d database -f validation.sql 2>&1)

        if echo "$VALIDATION_RESULT" | grep -q "PASS"; then
            log "✓ Validation passed"
            echo "COMPLETE" > "$STATE_FILE"
        else
            log "✗ Validation failed"
            log "$VALIDATION_RESULT"
            log "Review and fix, then change state to MIGRATE to retry"
            exit 1
        fi
        ;;

    COMPLETE)
        log "=== Migration Complete ==="
        log "Database migrated successfully"
        ;;

    rollback)
        log "=== Rolling Back Migration ==="

        # Find latest backup
        LATEST_BACKUP=$(ls -t /backups/db_*.sql | head -1)

        if [ -z "$LATEST_BACKUP" ]; then
            log "✗ No backup found"
            exit 1
        fi

        log "Restoring from: $LATEST_BACKUP"

        psql -U user -d database < "$LATEST_BACKUP"

        log "✓ Rollback complete"
        echo "INIT" > "$STATE_FILE"
        ;;
esac

Example 2: Data Pipeline Workflow

#!/bin/bash
# data-pipeline.sh - ETL pipeline with checkpoints

STATE_FILE=".pipeline_state.json"

init_state() {
    cat > "$STATE_FILE" << EOF
{
  "state": "EXTRACT",
  "extracted_files": [],
  "transformed_files": [],
  "loaded_records": 0,
  "total_records": 0
}
EOF
}

[ ! -f "$STATE_FILE" ] && init_state

CURRENT_STATE=$(jq -r '.state' "$STATE_FILE")

case "$CURRENT_STATE" in
    EXTRACT)
        echo "=== Extract Phase ==="

        # Extract from sources
        for source in source1 source2 source3; do
            # Check if already extracted
            if jq -e ".extracted_files | index(\"$source\")" "$STATE_FILE" > /dev/null; then
                echo "✓ $source already extracted"
                continue
            fi

            echo "Extracting from: $source"
            extract_from_source "$source" > "raw_data/$source.csv"

            # Update state
            jq ".extracted_files += [\"$source\"]" "$STATE_FILE" > "$STATE_FILE.tmp"
            mv "$STATE_FILE.tmp" "$STATE_FILE"
        done

        # Move to next state
        jq '.state = "TRANSFORM"' "$STATE_FILE" > "$STATE_FILE.tmp"
        mv "$STATE_FILE.tmp" "$STATE_FILE"

        echo "✓ Extraction complete"
        ;;

    TRANSFORM)
        echo "=== Transform Phase ==="

        for file in raw_data/*.csv; do
            filename=$(basename "$file")

            # Check if already transformed
            if jq -e ".transformed_files | index(\"$filename\")" "$STATE_FILE" > /dev/null; then
                echo "✓ $filename already transformed"
                continue
            fi

            echo "Transforming: $filename"
            transform_data "$file" > "processed_data/$filename"

            # Update state
            jq ".transformed_files += [\"$filename\"]" "$STATE_FILE" > "$STATE_FILE.tmp"
            mv "$STATE_FILE.tmp" "$STATE_FILE"
        done

        # Count total records for loading
        TOTAL=$(wc -l processed_data/*.csv | tail -1 | awk '{print $1}')
        jq ".total_records = $TOTAL | .state = \"LOAD\"" "$STATE_FILE" > "$STATE_FILE.tmp"
        mv "$STATE_FILE.tmp" "$STATE_FILE"

        echo "✓ Transformation complete"
        ;;

    LOAD)
        echo "=== Load Phase ==="

        LOADED=$(jq -r '.loaded_records' "$STATE_FILE")
        TOTAL=$(jq -r '.total_records' "$STATE_FILE")

        echo "Progress: $LOADED/$TOTAL records"

        # Load data in batches
        BATCH_SIZE=1000

        tail -n +$((LOADED + 1)) processed_data/combined.csv | head -$BATCH_SIZE | while read record; do
            load_record "$record"
            LOADED=$((LOADED + 1))

            # Checkpoint every 100 records
            if [ $((LOADED % 100)) -eq 0 ]; then
                jq ".loaded_records = $LOADED" "$STATE_FILE" > "$STATE_FILE.tmp"
                mv "$STATE_FILE.tmp" "$STATE_FILE"
            fi
        done

        # Check if complete
        LOADED=$(jq -r '.loaded_records' "$STATE_FILE")
        if [ "$LOADED" -ge "$TOTAL" ]; then
            jq '.state = "COMPLETE"' "$STATE_FILE" > "$STATE_FILE.tmp"
            mv "$STATE_FILE.tmp" "$STATE_FILE"
            echo "✓ Loading complete"
        else
            echo "Loaded $LOADED/$TOTAL - run again to continue"
        fi
        ;;

    COMPLETE)
        echo "=== Pipeline Complete ==="
        jq '.' "$STATE_FILE"
        ;;
esac

Best Practices

✅ DO

Design clear states - Each state has single responsibility
Save frequently - Checkpoint after each significant operation
Validate transitions - Verify state before transitioning
Handle failures - Plan for failure at every state
Log everything - Comprehensive logging for debugging
Enable inspection - Make state file human-readable (JSON)
Support rollback - Save enough info to undo
Test resumption - Verify workflow resumes correctly

❌ DON'T

Don't mix concerns - Keep states focused
Don't skip validation - Validate before each transition
Don't hide state - Make current state obvious
Don't lose progress - Always save before risky operations
Don't ignore errors - Handle errors explicitly
Don't hardcode paths - Use configuration
Don't forget cleanup - Remove checkpoint files when done
Don't nest too deep - Keep state machine flat

Quick Reference

# Basic state machine
STATE=$(cat .state || echo "INIT")
case "$STATE" in
    STATE1) action1; echo "STATE2" > .state ;;
    STATE2) action2; echo "STATE3" > .state ;;
esac

# With progress tracking (JSON)
jq '.state = "NEW_STATE" | .updated = "'$(date -Iseconds)'"' .state.json

# Checkpoint pattern
process_batch && echo "$POSITION" > .checkpoint

# Rollback support
tar -czf .rollback/pre_${STATE}.tar.gz data/

# Reset workflow
echo "INIT" > .state && rm .progress.json

Version: 1.0.0 Author: Harvested from timeout-prevention and state machine patterns Last Updated: 2025-11-18 License: MIT Key Principle: Every operation should be resumable from its last successful checkpoint.

Build workflows that never lose progress! 🔄

checkpoint-workflow-builder

Safety Notice

Copy this and send it to your AI assistant to learn