Databricks Asset Bundles Skill

Overview

Databricks Asset Bundles (DAB) is a modern deployment framework that packages notebooks, DLT pipelines, jobs, and configurations into versioned, environment-aware bundles. It enables Infrastructure as Code for Databricks.

Key Benefits:

Infrastructure as Code
Multi-environment support (dev, staging, prod)
Version control for all artifacts
Automated deployment
Environment-specific configurations
Integrated with CI/CD

When to Use This Skill

Use Databricks Asset Bundles when you need to:

Deploy pipelines across multiple environments
Implement Infrastructure as Code
Automate deployment workflows
Manage environment-specific configurations
Version control Databricks artifacts
Enable collaborative development
Standardize deployment processes

Core Concepts

Bundle Structure

Standard Bundle Layout:

my-bundle/ ├── databricks.yml # Main configuration ├── environments/ │ ├── dev.yml # Development overrides │ ├── staging.yml # Staging overrides │ └── prod.yml # Production overrides ├── src/ │ ├── notebooks/ │ │ ├── bronze_ingestion.py │ │ └── silver_transformation.py │ └── pipelines/ │ └── dlt_pipeline.py ├── resources/ │ ├── jobs.yml │ ├── pipelines.yml │ └── clusters.yml └── tests/ └── test_transformations.py

Main Configuration

databricks.yml:

bundle: name: data-platform-bundle

Optional git configuration

git: branch: main origin_url: https://github.com/org/repo.git

workspace: host: https://your-workspace.databricks.com root_path: /Workspace/bundles/${bundle.name}

Define variables

variables: catalog_name: description: "Unity Catalog name" default: "dev_catalog"

storage_path: description: "Base storage path" default: "/mnt/dev/data"

cluster_size: description: "Cluster size" default: "small"

Include other configuration files

include:

resources/*.yml

Define resources

resources: jobs: daily_pipeline: name: "[${bundle.environment}] Daily Pipeline"

  tasks:
    - task_key: bronze_ingestion
      notebook_task:
        notebook_path: ./src/notebooks/bronze_ingestion
        source: WORKSPACE
        base_parameters:
          catalog: ${var.catalog_name}
          storage: ${var.storage_path}

      new_cluster:
        num_workers: 2
        spark_version: 13.3.x-scala2.12
        node_type_id: i3.xlarge
        spark_conf:
          spark.databricks.delta.preview.enabled: "true"

    - task_key: silver_transformation
      depends_on:
        - task_key: bronze_ingestion
      notebook_task:
        notebook_path: ./src/notebooks/silver_transformation
        source: WORKSPACE

      job_cluster_key: shared_cluster

  job_clusters:
    - job_cluster_key: shared_cluster
      new_cluster:
        num_workers: "${var.cluster_size == 'small' ? 2 : 8}"
        spark_version: 13.3.x-scala2.12
        node_type_id: i3.xlarge

  schedule:
    quartz_cron_expression: "0 0 1 * * ?"  # Daily at 1 AM
    timezone_id: "America/New_York"

  email_notifications:
    on_failure:
      - data-team@company.com

pipelines: bronze_to_gold: name: "[${bundle.environment}] Bronze to Gold Pipeline" target: ${var.catalog_name} storage: ${var.storage_path}/dlt

  libraries:
    - notebook:
        path: ./src/pipelines/dlt_pipeline.py

  clusters:
    - label: default
      num_workers: 4
      node_type_id: i3.xlarge

  configuration:
    source_path: ${var.storage_path}/landing
    checkpoint_path: ${var.storage_path}/checkpoints

  development: false
  continuous: false

targets: dev: mode: development workspace: host: https://dev-workspace.databricks.com root_path: /Workspace/dev/${bundle.name} variables: catalog_name: dev_catalog storage_path: /mnt/dev/data cluster_size: small

staging: mode: production workspace: host: https://staging-workspace.databricks.com root_path: /Workspace/staging/${bundle.name} variables: catalog_name: staging_catalog storage_path: /mnt/staging/data cluster_size: medium

prod: mode: production workspace: host: https://prod-workspace.databricks.com root_path: /Workspace/prod/${bundle.name} variables: catalog_name: prod_catalog storage_path: /mnt/prod/data cluster_size: large

Environment-Specific Configuration

environments/prod.yml:

Production-specific overrides

variables: catalog_name: prod_catalog storage_path: /mnt/prod/data cluster_size: large

resources: jobs: daily_pipeline: # Production-specific settings max_concurrent_runs: 1 timeout_seconds: 7200

  job_clusters:
    - job_cluster_key: shared_cluster
      new_cluster:
        num_workers: 8
        node_type_id: i3.2xlarge
        autoscale:
          min_workers: 4
          max_workers: 16

  email_notifications:
    on_start:
      - data-team@company.com
    on_success:
      - data-team@company.com
    on_failure:
      - data-team@company.com
      - oncall@company.com

pipelines: bronze_to_gold: development: false continuous: true # Continuous processing in prod

  clusters:
    - label: default
      num_workers: 8
      node_type_id: i3.2xlarge
      autoscale:
        min_workers: 4
        max_workers: 16

  notifications:
    - email_recipients:
        - data-team@company.com
      on_failure: true
      on_success: false

4. Deployment Workflow

CLI Commands:

Install Databricks CLI

curl -fsSL https://raw.githubusercontent.com/databricks/setup-cli/main/install.sh | sh

Authenticate

databricks auth login --host https://your-workspace.databricks.com

Validate bundle

databricks bundle validate -t dev

Deploy to development

databricks bundle deploy -t dev

Run a job

databricks bundle run -t dev daily_pipeline

Deploy to production

databricks bundle deploy -t prod

Destroy bundle (cleanup)

databricks bundle destroy -t dev

Implementation Patterns

Pattern 1: Multi-Environment Pipeline

Complete Bundle with Environment Variations:

databricks.yml

bundle: name: customer-analytics

variables: environment: description: "Deployment environment" catalog: description: "Unity Catalog" min_workers: description: "Minimum cluster workers" default: 2 max_workers: description: "Maximum cluster workers" default: 8

resources: jobs: customer_pipeline: name: "[${var.environment}] Customer Analytics Pipeline"

  tasks:
    - task_key: ingest
      notebook_task:
        notebook_path: ./notebooks/ingest_customers
      new_cluster:
        num_workers: ${var.min_workers}
        spark_version: 13.3.x-scala2.12
        node_type_id: i3.xlarge

    - task_key: transform
      depends_on:
        - task_key: ingest
      notebook_task:
        notebook_path: ./notebooks/transform_customers
      new_cluster:
        autoscale:
          min_workers: ${var.min_workers}
          max_workers: ${var.max_workers}
        spark_version: 13.3.x-scala2.12
        node_type_id: i3.xlarge

    - task_key: aggregate
      depends_on:
        - task_key: transform
      notebook_task:
        notebook_path: ./notebooks/aggregate_metrics
      new_cluster:
        num_workers: ${var.min_workers}
        spark_version: 13.3.x-scala2.12
        node_type_id: i3.xlarge

targets: dev: variables: environment: dev catalog: dev_catalog min_workers: 2 max_workers: 4

prod: variables: environment: prod catalog: prod_catalog min_workers: 4 max_workers: 16

Pattern 2: Modular Configuration

Split Configuration Across Files:

databricks.yml

bundle: name: data-platform

include:

resources/jobs/*.yml
resources/pipelines/*.yml
resources/clusters/*.yml

resources/jobs/ingestion_jobs.yml

resources: jobs: ingest_customers: name: "[${bundle.environment}] Ingest Customers" tasks: - task_key: main notebook_task: notebook_path: ./notebooks/ingest_customers

ingest_orders:
  name: "[${bundle.environment}] Ingest Orders"
  tasks:
    - task_key: main
      notebook_task:
        notebook_path: ./notebooks/ingest_orders

resources/pipelines/dlt_pipelines.yml

resources: pipelines: customer_pipeline: name: "[${bundle.environment}] Customer DLT Pipeline" target: ${var.catalog}.customer libraries: - notebook: path: ./pipelines/customer_dlt

order_pipeline:
  name: "[${bundle.environment}] Order DLT Pipeline"
  target: ${var.catalog}.orders
  libraries:
    - notebook:
        path: ./pipelines/order_dlt

Pattern 3: Python Deployment Script

Automated Deployment:

""" Automated bundle deployment script. """ import subprocess import sys from typing import Dict, Any

class BundleDeployer: """Deploy Databricks Asset Bundles."""

def __init__(self, bundle_path: str):
    self.bundle_path = bundle_path

def validate(self, target: str) -> bool:
    """Validate bundle configuration."""
    print(f"Validating bundle for target: {target}")

    result = subprocess.run(
        ["databricks", "bundle", "validate", "-t", target],
        cwd=self.bundle_path,
        capture_output=True,
        text=True
    )

    if result.returncode != 0:
        print(f"Validation failed: {result.stderr}")
        return False

    print("Validation successful")
    return True

def deploy(self, target: str, force: bool = False) -> bool:
    """Deploy bundle to target environment."""
    if not self.validate(target):
        return False

    print(f"Deploying bundle to {target}")

    cmd = ["databricks", "bundle", "deploy", "-t", target]
    if force:
        cmd.append("--force")

    result = subprocess.run(
        cmd,
        cwd=self.bundle_path,
        capture_output=True,
        text=True
    )

    if result.returncode != 0:
        print(f"Deployment failed: {result.stderr}")
        return False

    print(f"Deployment successful: {result.stdout}")
    return True

def run_job(self, target: str, job_key: str) -> bool:
    """Run a specific job from bundle."""
    print(f"Running job: {job_key} on {target}")

    result = subprocess.run(
        ["databricks", "bundle", "run", "-t", target, job_key],
        cwd=self.bundle_path,
        capture_output=True,
        text=True
    )

    if result.returncode != 0:
        print(f"Job run failed: {result.stderr}")
        return False

    print(f"Job started: {result.stdout}")
    return True

def destroy(self, target: str, auto_approve: bool = False) -> bool:
    """Destroy bundle resources."""
    print(f"WARNING: Destroying bundle resources in {target}")

    cmd = ["databricks", "bundle", "destroy", "-t", target]
    if auto_approve:
        cmd.append("--auto-approve")

    result = subprocess.run(
        cmd,
        cwd=self.bundle_path,
        capture_output=True,
        text=True
    )

    if result.returncode != 0:
        print(f"Destroy failed: {result.stderr}")
        return False

    print("Bundle resources destroyed")
    return True

Usage

if name == "main": deployer = BundleDeployer("./my-bundle")

# Deploy to development
if deployer.deploy("dev"):
    deployer.run_job("dev", "daily_pipeline")

# Deploy to production (requires approval)
if len(sys.argv) > 1 and sys.argv[1] == "--prod":
    deployer.deploy("prod")

Pattern 4: GitOps Integration

GitHub Actions Workflow:

.github/workflows/bundle-deploy.yml

name: Deploy Databricks Bundle

on: push: branches: [main, develop] pull_request: branches: [main] workflow_dispatch: inputs: environment: description: 'Target environment' required: true type: choice options: - dev - staging - prod

jobs: validate: runs-on: ubuntu-latest steps: - uses: actions/checkout@v3

  - name: Install Databricks CLI
    run: |
      curl -fsSL https://raw.githubusercontent.com/databricks/setup-cli/main/install.sh | sh

  - name: Validate Bundle
    env:
      DATABRICKS_HOST: ${{ secrets.DATABRICKS_HOST }}
      DATABRICKS_TOKEN: ${{ secrets.DATABRICKS_TOKEN }}
    run: |
      cd bundle/
      databricks bundle validate -t dev

deploy-dev: needs: validate if: github.ref == 'refs/heads/develop' runs-on: ubuntu-latest environment: development steps: - uses: actions/checkout@v3

  - name: Install Databricks CLI
    run: |
      curl -fsSL https://raw.githubusercontent.com/databricks/setup-cli/main/install.sh | sh

  - name: Deploy to Development
    env:
      DATABRICKS_HOST: ${{ secrets.DEV_DATABRICKS_HOST }}
      DATABRICKS_TOKEN: ${{ secrets.DEV_DATABRICKS_TOKEN }}
    run: |
      cd bundle/
      databricks bundle deploy -t dev

deploy-prod: needs: validate if: github.ref == 'refs/heads/main' runs-on: ubuntu-latest environment: production steps: - uses: actions/checkout@v3

  - name: Install Databricks CLI
    run: |
      curl -fsSL https://raw.githubusercontent.com/databricks/setup-cli/main/install.sh | sh

  - name: Deploy to Production
    env:
      DATABRICKS_HOST: ${{ secrets.PROD_DATABRICKS_HOST }}
      DATABRICKS_TOKEN: ${{ secrets.PROD_DATABRICKS_TOKEN }}
    run: |
      cd bundle/
      databricks bundle deploy -t prod

Best Practices

Bundle Organization

Keep bundle files under version control
Use environment-specific overrides
Separate resources into logical files
Document variable purposes
Include README for bundle usage

Environment Management

Use consistent naming

targets: dev: mode: development # Enables faster iterations staging: mode: production # Production-like behavior prod: mode: production # Full production settings

Variable Usage

Define reusable variables

variables: project_name: description: "Project identifier" default: "customer-analytics"

Use variables consistently

resources: jobs: ${var.project_name}_job: name: "[${bundle.environment}] ${var.project_name}"

Testing Strategy

Test bundle locally

databricks bundle validate -t dev

Deploy to dev for testing

databricks bundle deploy -t dev

Run integration tests

databricks bundle run -t dev test_job

Deploy to prod after validation

databricks bundle deploy -t prod

Common Pitfalls to Avoid

Don't:

Hard-code environment-specific values
Skip validation before deployment
Modify resources outside of bundles
Use development mode in production
Deploy without testing

Do:

Use variables for environment differences
Always validate before deploying
Manage all resources through bundles
Use production mode for prod
Test in lower environments first

Complete Examples

See /examples/ directory for:

complete_bundle_project/ : Full bundle structure
multi_workspace_deployment/ : Cross-workspace deployment

Related Skills

delta-live-tables : Deploy DLT pipelines
cicd-workflows : Automate deployments
testing-patterns : Test before deploy
data-products : Deploy data products

References

Databricks Asset Bundles Docs
Bundle Configuration Reference
CLI Reference

databricks-asset-bundles

Safety Notice

Copy this and send it to your AI assistant to learn

Optional git configuration

Define variables

Include other configuration files

Define resources

Production-specific overrides

Install Databricks CLI

Authenticate

Validate bundle

Deploy to development

Run a job

Deploy to production

Destroy bundle (cleanup)

databricks.yml

databricks.yml

resources/jobs/ingestion_jobs.yml

resources/pipelines/dlt_pipelines.yml

Usage

.github/workflows/bundle-deploy.yml

Use consistent naming

Define reusable variables

Use variables consistently

Test bundle locally

Deploy to dev for testing

Run integration tests

Deploy to prod after validation

Source Transparency

Related Skills

medallion-architecture

delta-live-tables

access-management

data-products