databricks-spark-structured-streaming

Spark Structured Streaming

Safety Notice

This listing is imported from skills.sh public index metadata. Review upstream SKILL.md and repository scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "databricks-spark-structured-streaming" with this command: npx skills add databricks-solutions/ai-dev-kit/databricks-solutions-ai-dev-kit-databricks-spark-structured-streaming

Spark Structured Streaming

Production-ready streaming pipelines with Spark Structured Streaming. This skill provides navigation to detailed patterns and best practices.

Quick Start

from pyspark.sql.functions import col, from_json

Basic Kafka to Delta streaming

df = (spark .readStream .format("kafka") .option("kafka.bootstrap.servers", "broker:9092") .option("subscribe", "topic") .load() .select(from_json(col("value").cast("string"), schema).alias("data")) .select("data.*") )

df.writeStream
.format("delta")
.outputMode("append")
.option("checkpointLocation", "/Volumes/catalog/checkpoints/stream")
.trigger(processingTime="30 seconds")
.start("/delta/target_table")

Core Patterns

Pattern Description Reference

Kafka Streaming Kafka to Delta, Kafka to Kafka, Real-Time Mode See kafka-streaming.md

Stream Joins Stream-stream joins, stream-static joins See stream-stream-joins.md, stream-static-joins.md

Multi-Sink Writes Write to multiple tables, parallel merges See multi-sink-writes.md

Merge Operations MERGE performance, parallel merges, optimizations See merge-operations.md

Configuration

Topic Description Reference

Checkpoints Checkpoint management and best practices See checkpoint-best-practices.md

Stateful Operations Watermarks, state stores, RocksDB configuration See stateful-operations.md

Trigger & Cost Trigger selection, cost optimization, RTM See trigger-and-cost-optimization.md

Best Practices

Topic Description Reference

Production Checklist Comprehensive best practices See streaming-best-practices.md

Production Checklist

  • Checkpoint location is persistent (UC volumes, not DBFS)

  • Unique checkpoint per stream

  • Fixed-size cluster (no autoscaling for streaming)

  • Monitoring configured (input rate, lag, batch duration)

  • Exactly-once verified (txnVersion/txnAppId)

  • Watermark configured for stateful operations

  • Left joins for stream-static (not inner)

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Related Skills

Related by shared tags or category signals.

Coding

databricks-python-sdk

No summary provided by upstream source.

Repository SourceNeeds Review
Coding

python-dev

No summary provided by upstream source.

Repository SourceNeeds Review
Coding

skill-test

No summary provided by upstream source.

Repository SourceNeeds Review
Coding

databricks-docs

No summary provided by upstream source.

Repository SourceNeeds Review