Databricks Incident Runbook

Overview

Rapid incident response procedures for Databricks-related outages.

Prerequisites

Access to Databricks workspace
CLI configured with appropriate permissions
Access to monitoring dashboards
Communication channels (Slack, PagerDuty)

Severity Levels

Level Definition Response Time Examples

P1 Production data pipeline down < 15 min Critical ETL failed, data not updating

P2 Degraded performance < 1 hour Slow queries, partial failures

P3 Non-critical issues < 4 hours Dev cluster issues, delayed non-critical jobs

P4 No user impact Next business day Monitoring gaps, documentation

Quick Triage

#!/bin/bash set -euo pipefail

quick-triage.sh - Run this first during any incident

echo "=== Databricks Quick Triage ===" echo "Time: $(date)" echo ""

1. Check Databricks status

echo "--- Databricks Status ---" curl -s https://status.databricks.com/api/v2/status.json | jq '.status.description' echo ""

2. Check workspace connectivity

echo "--- Workspace Connectivity ---" databricks workspace list / --output json | jq -r '.[] | .path' | head -5 if [ $? -eq 0 ]; then echo "Workspace: CONNECTED" else echo "Workspace: CONNECTION FAILED" fi echo ""

3. Check recent job failures

echo "--- Recent Job Failures (last 1 hour) ---" databricks runs list --limit 20 --output json |
jq -r '.runs[] | select(.state.result_state == "FAILED") | "(.run_id): (.run_name) - (.state.state_message)"' echo ""

4. Check cluster status

echo "--- Running Clusters ---" databricks clusters list --output json |
jq -r '.clusters[] | select(.state == "RUNNING" or .state == "ERROR") | "(.cluster_id): (.cluster_name) [(.state)]"' echo ""

5. Check for errors in last hour

echo "--- Recent Errors ---"

Query system tables via SQL warehouse or notebook

Decision Tree

Job/Pipeline failing? ├─ YES: Is it a single job or multiple? │ ├─ SINGLE JOB → Check job-specific issues │ │ ├─ Cluster failed to start → Check cluster events │ │ ├─ Code error → Check task output/logs │ │ ├─ Data issue → Check source data │ │ └─ Permission error → Check grants │ │ │ └─ MULTIPLE JOBS → Likely infrastructure issue │ ├─ Check Databricks status page │ ├─ Check workspace quotas │ └─ Check network connectivity │ └─ NO: Is it a performance issue? ├─ Slow queries → Check query plan, cluster sizing ├─ Slow cluster startup → Check instance availability └─ Data freshness → Check upstream pipelines

Immediate Actions by Error Type

Job Failed - Code Error

1. Get run details

RUN_ID="your-run-id" databricks runs get --run-id $RUN_ID

2. Get detailed error output

databricks runs get-output --run-id $RUN_ID | jq '.error'

3. Check task-level errors

databricks runs get --run-id $RUN_ID | jq '.tasks[] | select(.state.result_state == "FAILED") | {task: .task_key, error: .state.state_message}'

4. If notebook task, get notebook output

(View in UI or use jobs API to get cell outputs)

Cluster Failed to Start

1. Check cluster events

CLUSTER_ID="your-cluster-id" databricks clusters events --cluster-id $CLUSTER_ID --limit 20

2. Common causes and fixes

- QUOTA_EXCEEDED: Terminate unused clusters

- CLOUD_PROVIDER_LAUNCH_ERROR: Check instance availability

- DRIVER_UNREACHABLE: Network/firewall issue

3. Quick fix - restart cluster

databricks clusters restart --cluster-id $CLUSTER_ID

4. Check cluster logs

databricks clusters get --cluster-id $CLUSTER_ID | jq '.termination_reason'

Permission/Auth Errors

1. Check current user

databricks current-user me

2. Check job permissions

databricks permissions get jobs --job-id $JOB_ID

3. Check table permissions (run in notebook)

SHOW GRANTS ON TABLE catalog.schema.table

4. Fix: Grant necessary permissions

databricks permissions update jobs --job-id $JOB_ID --json '{ "access_control_list": [{ "user_name": "user@company.com", "permission_level": "CAN_MANAGE_RUN" }] }'

Data Quality Failures

-- Quick data quality check SELECT COUNT(*) as total_rows, COUNT(DISTINCT id) as unique_ids, SUM(CASE WHEN amount IS NULL THEN 1 ELSE 0 END) as null_amounts, MIN(created_at) as oldest_record, MAX(created_at) as newest_record FROM catalog.schema.table WHERE created_at > current_timestamp() - INTERVAL 1 DAY;

-- Check for recent changes DESCRIBE HISTORY catalog.schema.table LIMIT 10;

-- Restore to previous version if needed RESTORE TABLE catalog.schema.table TO VERSION AS OF 5;

Communication Templates

Internal (Slack)

:red_circle: P1 INCIDENT: [Brief Description]

Status: INVESTIGATING Impact: [Describe user/business impact] Started: [Time] Current Action: [What you're doing now] Next Update: [Time]

Incident Commander: @[name] Thread: [link]

External (Status Page)

Data Pipeline Delay

We are experiencing delays in data processing. Some reports may show stale data.

Impact: Dashboard data may be up to [X] hours delayed Started: [Time] UTC Current Status: Our team is actively investigating

We will provide updates every 30 minutes.

Last updated: [Timestamp]

Post-Incident

Evidence Collection

#!/bin/bash

collect-incident-evidence.sh

INCIDENT_ID=$1 RUN_ID=$2 CLUSTER_ID=$3

mkdir -p "incident-$INCIDENT_ID"

Job run details

databricks runs get --run-id $RUN_ID > "incident-$INCIDENT_ID/run_details.json" databricks runs get-output --run-id $RUN_ID > "incident-$INCIDENT_ID/run_output.json"

Cluster info

if [ -n "$CLUSTER_ID" ]; then databricks clusters get --cluster-id $CLUSTER_ID > "incident-$INCIDENT_ID/cluster_info.json" databricks clusters events --cluster-id $CLUSTER_ID --limit 50 > "incident-$INCIDENT_ID/cluster_events.json" fi

Create summary

cat << EOF > "incident-$INCIDENT_ID/summary.md"

Incident $INCIDENT_ID

Date: $(date) Run ID: $RUN_ID Cluster ID: $CLUSTER_ID

Evidence Collected

run_details.json
run_output.json
cluster_info.json
cluster_events.json EOF

tar -czf "incident-$INCIDENT_ID.tar.gz" "incident-$INCIDENT_ID" echo "Evidence collected: incident-$INCIDENT_ID.tar.gz"

Postmortem Template

Incident: [Title]

Date: YYYY-MM-DD Duration: X hours Y minutes Severity: P[1-4] Incident Commander: [Name]

Summary

[1-2 sentence description of what happened]

Timeline (UTC)

Time	Event
HH:MM	[First alert/detection]
HH:MM	[Investigation started]
HH:MM	[Root cause identified]
HH:MM	[Mitigation applied]
HH:MM	[Incident resolved]

Root Cause

[Technical explanation of what went wrong]

Impact

Data Impact: [Tables affected, rows impacted]
Users Affected: [Number, types]
Duration: [How long data was unavailable/stale]
Financial Impact: [If applicable]

Detection

How detected: [Alert, user report, monitoring]
Time to detect: [Minutes from issue start]
Detection gap: [What could have caught this sooner]

Response

Time to respond: [Minutes from detection]
What worked: [Effective response actions]
What didn't: [Ineffective actions, dead ends]

Action Items

Priority	Action	Owner	Due Date
P1	[Preventive measure]	[Name]	[Date]
P2	[Monitoring improvement]	[Name]	[Date]
P3	[Documentation update]	[Name]	[Date]

Lessons Learned

[Key learning 1]
[Key learning 2]
[Key learning 3]

Instructions

Step 1: Quick Triage

Run the triage script to identify the issue source.

Step 2: Follow Decision Tree

Determine if the issue is Databricks-side or code/data issue.

Step 3: Execute Immediate Actions

Apply the appropriate remediation for the error type.

Step 4: Communicate Status

Update internal and external stakeholders.

Step 5: Collect Evidence

Document everything for postmortem.

Output

Issue identified and categorized
Remediation applied
Stakeholders notified
Evidence collected for postmortem

Error Handling

Issue Cause Solution

Can't access workspace Token expired Re-authenticate

CLI commands fail Network issue Check VPN

Logs unavailable Cluster terminated Check cluster events

Restore fails Retention exceeded Check vacuum settings

Examples

One-Line Job Health Check

databricks runs list --job-id $JOB_ID --limit 5 --output json |
jq -r '.runs[] | "(.start_time): (.state.result_state)"'

Quick Cluster Restart

databricks clusters restart --cluster-id $CLUSTER_ID &&
echo "Cluster restart initiated"

Resources

Databricks Status Page
Databricks Support
Community Forum

Next Steps

For data handling, see databricks-data-handling .