databricks-incident-runbook

Databricks Incident Runbook

Safety Notice

This listing is imported from skills.sh public index metadata. Review upstream SKILL.md and repository scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "databricks-incident-runbook" with this command: npx skills add jeremylongshore/claude-code-plugins-plus-skills/jeremylongshore-claude-code-plugins-plus-skills-databricks-incident-runbook

Databricks Incident Runbook

Overview

Rapid incident response procedures for Databricks-related outages.

Prerequisites

  • Access to Databricks workspace

  • CLI configured with appropriate permissions

  • Access to monitoring dashboards

  • Communication channels (Slack, PagerDuty)

Severity Levels

Level Definition Response Time Examples

P1 Production data pipeline down < 15 min Critical ETL failed, data not updating

P2 Degraded performance < 1 hour Slow queries, partial failures

P3 Non-critical issues < 4 hours Dev cluster issues, delayed non-critical jobs

P4 No user impact Next business day Monitoring gaps, documentation

Quick Triage

#!/bin/bash set -euo pipefail

quick-triage.sh - Run this first during any incident

echo "=== Databricks Quick Triage ===" echo "Time: $(date)" echo ""

1. Check Databricks status

echo "--- Databricks Status ---" curl -s https://status.databricks.com/api/v2/status.json | jq '.status.description' echo ""

2. Check workspace connectivity

echo "--- Workspace Connectivity ---" databricks workspace list / --output json | jq -r '.[] | .path' | head -5 if [ $? -eq 0 ]; then echo "Workspace: CONNECTED" else echo "Workspace: CONNECTION FAILED" fi echo ""

3. Check recent job failures

echo "--- Recent Job Failures (last 1 hour) ---" databricks runs list --limit 20 --output json |
jq -r '.runs[] | select(.state.result_state == "FAILED") | "(.run_id): (.run_name) - (.state.state_message)"' echo ""

4. Check cluster status

echo "--- Running Clusters ---" databricks clusters list --output json |
jq -r '.clusters[] | select(.state == "RUNNING" or .state == "ERROR") | "(.cluster_id): (.cluster_name) [(.state)]"' echo ""

5. Check for errors in last hour

echo "--- Recent Errors ---"

Query system tables via SQL warehouse or notebook

Decision Tree

Job/Pipeline failing? ├─ YES: Is it a single job or multiple? │ ├─ SINGLE JOB → Check job-specific issues │ │ ├─ Cluster failed to start → Check cluster events │ │ ├─ Code error → Check task output/logs │ │ ├─ Data issue → Check source data │ │ └─ Permission error → Check grants │ │ │ └─ MULTIPLE JOBS → Likely infrastructure issue │ ├─ Check Databricks status page │ ├─ Check workspace quotas │ └─ Check network connectivity │ └─ NO: Is it a performance issue? ├─ Slow queries → Check query plan, cluster sizing ├─ Slow cluster startup → Check instance availability └─ Data freshness → Check upstream pipelines

Immediate Actions by Error Type

Job Failed - Code Error

1. Get run details

RUN_ID="your-run-id" databricks runs get --run-id $RUN_ID

2. Get detailed error output

databricks runs get-output --run-id $RUN_ID | jq '.error'

3. Check task-level errors

databricks runs get --run-id $RUN_ID | jq '.tasks[] | select(.state.result_state == "FAILED") | {task: .task_key, error: .state.state_message}'

4. If notebook task, get notebook output

(View in UI or use jobs API to get cell outputs)

Cluster Failed to Start

1. Check cluster events

CLUSTER_ID="your-cluster-id" databricks clusters events --cluster-id $CLUSTER_ID --limit 20

2. Common causes and fixes

- QUOTA_EXCEEDED: Terminate unused clusters

- CLOUD_PROVIDER_LAUNCH_ERROR: Check instance availability

- DRIVER_UNREACHABLE: Network/firewall issue

3. Quick fix - restart cluster

databricks clusters restart --cluster-id $CLUSTER_ID

4. Check cluster logs

databricks clusters get --cluster-id $CLUSTER_ID | jq '.termination_reason'

Permission/Auth Errors

1. Check current user

databricks current-user me

2. Check job permissions

databricks permissions get jobs --job-id $JOB_ID

3. Check table permissions (run in notebook)

SHOW GRANTS ON TABLE catalog.schema.table

4. Fix: Grant necessary permissions

databricks permissions update jobs --job-id $JOB_ID --json '{ "access_control_list": [{ "user_name": "user@company.com", "permission_level": "CAN_MANAGE_RUN" }] }'

Data Quality Failures

-- Quick data quality check SELECT COUNT(*) as total_rows, COUNT(DISTINCT id) as unique_ids, SUM(CASE WHEN amount IS NULL THEN 1 ELSE 0 END) as null_amounts, MIN(created_at) as oldest_record, MAX(created_at) as newest_record FROM catalog.schema.table WHERE created_at > current_timestamp() - INTERVAL 1 DAY;

-- Check for recent changes DESCRIBE HISTORY catalog.schema.table LIMIT 10;

-- Restore to previous version if needed RESTORE TABLE catalog.schema.table TO VERSION AS OF 5;

Communication Templates

Internal (Slack)

:red_circle: P1 INCIDENT: [Brief Description]

Status: INVESTIGATING Impact: [Describe user/business impact] Started: [Time] Current Action: [What you're doing now] Next Update: [Time]

Incident Commander: @[name] Thread: [link]

External (Status Page)

Data Pipeline Delay

We are experiencing delays in data processing. Some reports may show stale data.

Impact: Dashboard data may be up to [X] hours delayed Started: [Time] UTC Current Status: Our team is actively investigating

We will provide updates every 30 minutes.

Last updated: [Timestamp]

Post-Incident

Evidence Collection

#!/bin/bash

collect-incident-evidence.sh

INCIDENT_ID=$1 RUN_ID=$2 CLUSTER_ID=$3

mkdir -p "incident-$INCIDENT_ID"

Job run details

databricks runs get --run-id $RUN_ID > "incident-$INCIDENT_ID/run_details.json" databricks runs get-output --run-id $RUN_ID > "incident-$INCIDENT_ID/run_output.json"

Cluster info

if [ -n "$CLUSTER_ID" ]; then databricks clusters get --cluster-id $CLUSTER_ID > "incident-$INCIDENT_ID/cluster_info.json" databricks clusters events --cluster-id $CLUSTER_ID --limit 50 > "incident-$INCIDENT_ID/cluster_events.json" fi

Create summary

cat << EOF > "incident-$INCIDENT_ID/summary.md"

Incident $INCIDENT_ID

Date: $(date) Run ID: $RUN_ID Cluster ID: $CLUSTER_ID

Evidence Collected

  • run_details.json
  • run_output.json
  • cluster_info.json
  • cluster_events.json EOF

tar -czf "incident-$INCIDENT_ID.tar.gz" "incident-$INCIDENT_ID" echo "Evidence collected: incident-$INCIDENT_ID.tar.gz"

Postmortem Template

Incident: [Title]

Date: YYYY-MM-DD Duration: X hours Y minutes Severity: P[1-4] Incident Commander: [Name]

Summary

[1-2 sentence description of what happened]

Timeline (UTC)

TimeEvent
HH:MM[First alert/detection]
HH:MM[Investigation started]
HH:MM[Root cause identified]
HH:MM[Mitigation applied]
HH:MM[Incident resolved]

Root Cause

[Technical explanation of what went wrong]

Impact

  • Data Impact: [Tables affected, rows impacted]
  • Users Affected: [Number, types]
  • Duration: [How long data was unavailable/stale]
  • Financial Impact: [If applicable]

Detection

  • How detected: [Alert, user report, monitoring]
  • Time to detect: [Minutes from issue start]
  • Detection gap: [What could have caught this sooner]

Response

  • Time to respond: [Minutes from detection]
  • What worked: [Effective response actions]
  • What didn't: [Ineffective actions, dead ends]

Action Items

PriorityActionOwnerDue Date
P1[Preventive measure][Name][Date]
P2[Monitoring improvement][Name][Date]
P3[Documentation update][Name][Date]

Lessons Learned

  1. [Key learning 1]
  2. [Key learning 2]
  3. [Key learning 3]

Instructions

Step 1: Quick Triage

Run the triage script to identify the issue source.

Step 2: Follow Decision Tree

Determine if the issue is Databricks-side or code/data issue.

Step 3: Execute Immediate Actions

Apply the appropriate remediation for the error type.

Step 4: Communicate Status

Update internal and external stakeholders.

Step 5: Collect Evidence

Document everything for postmortem.

Output

  • Issue identified and categorized

  • Remediation applied

  • Stakeholders notified

  • Evidence collected for postmortem

Error Handling

Issue Cause Solution

Can't access workspace Token expired Re-authenticate

CLI commands fail Network issue Check VPN

Logs unavailable Cluster terminated Check cluster events

Restore fails Retention exceeded Check vacuum settings

Examples

One-Line Job Health Check

databricks runs list --job-id $JOB_ID --limit 5 --output json |
jq -r '.runs[] | "(.start_time): (.state.result_state)"'

Quick Cluster Restart

databricks clusters restart --cluster-id $CLUSTER_ID &&
echo "Cluster restart initiated"

Resources

  • Databricks Status Page

  • Databricks Support

  • Community Forum

Next Steps

For data handling, see databricks-data-handling .

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Related Skills

Related by shared tags or category signals.

Coding

backtesting-trading-strategies

No summary provided by upstream source.

Repository SourceNeeds Review
Coding

svg-icon-generator

No summary provided by upstream source.

Repository SourceNeeds Review
Coding

performance-lighthouse-runner

No summary provided by upstream source.

Repository SourceNeeds Review
Coding

mindmap-generator

No summary provided by upstream source.

Repository SourceNeeds Review