Databricks Expert Engineer Skill

This skill provides a comprehensive guide for Databricks development.

Databricks CLI Usage

1.1. About warehouse_id

Find and select one Serverless SQL Warehouse for warehouse_id
Note: databricks CLI does not auto-read warehouse_id from config files, so explicitly include it in JSON each time

1.2. Authentication

When auth_type=databricks-cli in profile, run U2M authentication first

databricks auth login --host https://xxx.cloud.databricks.com --profile PROFILE_NAME

Check authentication status

databricks auth profiles

1.3. Basic Usage

Statements API (Query Execution)

Execute query

databricks api post /api/2.0/sql/statements --profile "DEFAULT" --json '{ "warehouse_id": "xxxxxxxxxx", "catalog": "catalog_name", "schema": "schema_name", "statement": "select * from table_name limit 10" }'

Get results (statement_id is returned from execution)

databricks api get /api/2.0/sql/statements/{statement_id} --profile "DEFAULT"

Queries API (Query Object Management)

Create query object (for dashboards, saved queries)

IMPORTANT: Use warehouse_id, query_text, display_name (NOT data_source_id, query, name)

databricks api post /api/2.0/sql/queries --profile "DEFAULT" --json '{ "warehouse_id": "xxxxxxxxxx", "display_name": "My Query", "query_text": "SELECT * FROM table_name LIMIT 10", "description": "Optional description" }'

List queries

databricks api get /api/2.0/sql/queries --profile "DEFAULT"

Get query by ID

databricks api get /api/2.0/sql/queries/{query_id} --profile "DEFAULT"

Common Mistakes:

❌ data_source_id → ✅ warehouse_id
❌ query → ✅ query_text
❌ name → ✅ display_name

1.4. Command Tips

Query execution flow

post executes query -> returns statement_id
get retrieves results (wait until state is SUCCEEDED )
For long queries, add sleep and retry

Error handling

state: CLOSED : Result retrieval was too slow. Get earlier
state: FAILED : SQL error. Check error_message
state: RUNNING : Still executing. Wait and retry get
Timeout: For large data, use limit to verify

Reading results

data_array : Actual data (2D array)
schema.columns : Column names and type info
total_row_count : Total count (shown even with limit)
state : Query execution state

Parameterized queries

databricks api post /api/2.0/sql/statements --profile "DEFAULT" --json '{ "warehouse_id": "xxxxxxxxxx", "statement": "select * from table where date >= :start_date", "parameters": [{"name": "start_date", "value": "2025-01-01", "type": "DATE"}] }'

Well-Architected Lakehouse Framework

Consists of 7 pillars:

2.1. Data and AI Governance

Policies and practices to securely manage data and AI assets. Minimize data copies with unified governance solution.

2.2. Interoperability and Usability

Consistent user experience and seamless integration with external systems.

2.3. Operational Excellence

Processes supporting continuous production operations.

2.4. Security, Privacy, and Compliance

Implement safeguards against threats.

2.5. Reliability

Ensure disaster recovery capabilities.

2.6. Performance Efficiency

Adaptability to workload changes.

2.7. Cost Optimization

Cost management to maximize value delivery.

Unity Catalog

3.1. Basic Concepts

"Define once, secure everywhere" approach
Unified access control policies across multiple workspaces
ANSI SQL compliant permission management

3.2. Object Model

3-level namespace: catalog.schema.table

Catalog layer: Data isolation unit (by department, etc.)
Schema layer: Logical group containing tables, views, volumes
Object layer: Tables, views, volumes, functions, models

3.3. Permission Management

Users cannot access data by default
Explicit permission grants required
Permissions inherit from parent to child (catalog -> schema -> table)

-- Check permissions SHOW GRANTS ON SCHEMA main.default;

-- Grant permissions GRANT CREATE TABLE ON SCHEMA main.default TO finance-team;

-- Revoke permissions REVOKE CREATE TABLE ON SCHEMA main.default FROM finance-team;

3.4. Best Practices

Managed tables/volumes recommended (Delta Lake format, full lifecycle management)
Catalog isolation across workspaces possible
Independent managed storage location per catalog recommended

Data Engineering

4.1. Lakeflow Solution

Unifies data ingestion, transformation, and orchestration.

Lakeflow Connect: Simplifies data ingestion
Lakeflow Spark Declarative Pipelines (SDP): Declarative pipeline framework
Lakeflow Jobs: Workflow automation

4.2. Delta Lake

Parquet data files with file-based transaction log
ACID transactions
Time travel functionality
Optimizations: liquid clustering, data skipping, file layout optimization, vacuum

4.3. Lakeflow Jobs

Task types:

Notebook tasks
Pipeline tasks
Python script tasks

Triggers:

Time-based (e.g., daily at 2 AM)
Event-based (on new data arrival)

Limits:

Workspace: Max 2000 concurrent task executions
Saved jobs: Max 12000
Tasks per job: Max 1000

Machine Learning Infrastructure

5.1. MLflow

Core tool for experiment tracking and model management
Dedicated features for GenAI

5.2. Feature Store

Feature management system
Automatic data pipelines and feature discovery

5.3. Model Serving

Deploy custom models and LLMs as REST endpoints
Auto-scaling and GPU support

Security

6.1. Authentication and Access Control

SSO configuration
Multi-factor authentication
Access control lists

6.2. Network Security

Private connectivity
Serverless egress control
Firewall settings
VPC management

6.3. Data Encryption

Encryption at rest and in transit
Customer-managed keys
Inter-cluster communication encryption
Automatic credential masking

SQL Warehouse

7.1. Serverless SQL Warehouse Benefits

Instant and elastic compute
Auto-scaling
Minimal management (Databricks handles capacity)
Low total cost of ownership

Schema Discovery and Validation

8.1. Pre-Query Validation Rule

YOU MUST: Run DESCRIBE before executing SELECT on unfamiliar tables
YOU MUST: Verify exact column names and case before writing queries

-- Check table columns first DESCRIBE TABLE catalog.schema.table_name;

-- Then write your query using verified column names SELECT column_name FROM catalog.schema.table_name;

8.2. Schema Discovery Commands

-- Basic column info DESCRIBE TABLE catalog.schema.table_name;

-- Extended info (types, nullability, comments) DESCRIBE EXTENDED catalog.schema.table_name;

-- List tables in schema SHOW TABLES IN catalog.schema;

-- Table properties and metadata DESCRIBE DETAIL catalog.schema.table_name;

8.3. Common Gotchas

Issue Cause Prevention

Column name case Databricks preserves case Use DESCRIBE before query

Data type mismatch Implicit conversion fails Check column types explicitly

NULL handling Unexpected NULL in aggregation Use COALESCE or filter NULLs

Timestamp precision TIMESTAMP vs TIMESTAMP_NTZ Verify type before comparison

8.4. Knowledge Accumulation

When encountering schema-related issues, update this skill with:

Universal patterns (case sensitivity, type coercion rules)
Common column naming conventions in Unity Catalog
Databricks-specific SQL behaviors

NOTE: Do not include project-specific table names or business logic. Keep entries generalizable across environments.

VARIANT Type and JSON Operations

9.1. VARIANT Type (Runtime 15.3+)

Benefits:

10-30x faster than JSON strings
Schema evolution without manual updates
No predefined schema required

Basic Usage:

-- Create table with VARIANT CREATE TABLE events ( id BIGINT, data VARIANT );

-- Insert JSON data INSERT INTO events VALUES (1, parse_json('{"name":"太郎","age":25}')), (2, parse_json('{"name":"花子","age":30,"new_field":"value"}'));

-- Query with colon notation SELECT data:name::STRING AS name, data:age::INT AS age, data:new_field::STRING AS new_field -- Auto-recognized FROM events;

9.2. JSON Access Patterns

Colon Notation (Recommended):

-- Object fields json_data:name json_data:metadata.status

-- Array elements json_data:tags[0] json_data:tags[1]

-- Wildcards (all elements) json_data:tags[*] -- Returns array

-- Nested arrays json_data:basket[][0] -- First element of each sub-array json_data:basket[0][] -- All elements of first array

get_json_object() Function:

-- Basic usage get_json_object(json_data, '$.name') get_json_object(json_data, '$.tags[0]') get_json_object(json_data, '$.metadata.status')

-- Limitation: Path must be STRING literal (no variables)

json_object_keys() Function:

-- Get all keys as array SELECT json_object_keys(json_data) FROM table_name; -- Result: ["name", "age", "tags", "metadata"]

-- Access by index (order not guaranteed) SELECT json_object_keys(json_data)[0] AS first_key, get_json_object(json_data, '$.' || json_object_keys(json_data)[0]) AS first_value FROM table_name;

Important Notes:

Object field order is NOT guaranteed in JSON
Array order IS guaranteed
Colon notation supports type casting: json_data:age::INT
Wildcards [*] only work with colon notation (not get_json_object)

Dashboard API (Lakeview)

10.1. Important Changes (2026)

Legacy Dashboard API: Deprecated (access disabled 2026-01-12)
Migration deadline: 2026-03-02
New API: Lakeview API (/api/2.0/lakeview/dashboards )

10.2. Dashboard Visualization Limitations

AI/BI Dashboard (Lakeview):

❌ No custom HTML/JavaScript
❌ No client-side JSON parsing
✅ 20+ predefined visualization types
✅ Query parameters for interactivity

Recommendation: Parse JSON in SQL (server-side) before visualization

dbt Integration Patterns

11.1. Auto-Schema Evolution with Jinja

Macro for Dynamic JSON Expansion:

-- macros/get_json_keys.sql {% macro get_json_keys(table_ref, json_column) %} {% set query %} SELECT DISTINCT key FROM {{ table_ref }}, LATERAL variant_explode({{ json_column }}) ORDER BY key {% endset %}

{% if execute %} {% set results = run_query(query) %} {% set keys = results.columns[0].values() %} {{ return(keys) }} {% else %} {{ return([]) }} {% endif %} {% endmacro %}

dbt Model:

-- models/staging/stg_events.sql {% set json_keys = get_json_keys(source('bronze', 'events'), 'json_data') %}

SELECT id, {% for key in json_keys %} json_data:{{ key }}::STRING AS {{ key | lower }} {%- if not loop.last %},{% endif %} {% endfor %} FROM {{ source('bronze', 'events') }}

Benefits:

New JSON fields auto-detected on dbt run
No manual model updates required
Works with VARIANT or JSON string columns

11.2. Recommended Architecture

Source → Bronze (VARIANT) → Silver (dbt expand) → Gold (business logic)

Bronze: VARIANT型でRaw JSON保存
Silver: dbt Jinjaマクロで必要なフィールドを展開
Gold: ビジネスロジック、アグリゲーション

Reference Links

Official docs: https://docs.databricks.com/
Unity Catalog: https://docs.databricks.com/en/data-governance/unity-catalog/
Lakeflow Jobs: https://docs.databricks.com/en/jobs/
MLflow: https://docs.databricks.com/en/mlflow/
Delta Lake: https://docs.databricks.com/en/delta/
Security: https://docs.databricks.com/en/security/
VARIANT Type: https://docs.databricks.com/aws/en/semi-structured/variant.html
JSON Operations: https://docs.databricks.com/aws/en/semi-structured/json.html
Colon Notation: https://docs.databricks.com/aws/en/sql/language-manual/functions/colonsign.html
Queries API: https://docs.databricks.com/api/workspace/queries
Lakeview Dashboards: https://docs.databricks.com/aws/en/dashboards/
dbt Documentation: https://docs.getdbt.com/

databricks

Safety Notice

Copy this and send it to your AI assistant to learn

Execute query

Get results (statement_id is returned from execution)

Create query object (for dashboards, saved queries)

IMPORTANT: Use warehouse_id, query_text, display_name (NOT data_source_id, query, name)

List queries

Get query by ID

Source Transparency

Related Skills

daily-report

bigquery

atlassian