data-engineering-orchestration

Pipeline Orchestration

Safety Notice

This listing is imported from skills.sh public index metadata. Review upstream SKILL.md and repository scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "data-engineering-orchestration" with this command: npx skills add legout/data-platform-agent-skills/legout-data-platform-agent-skills-data-engineering-orchestration

Pipeline Orchestration

Workflow orchestration tools for data pipelines: Prefect, Dagster, and dbt. These tools handle scheduling, dependency resolution, retries, monitoring, and state management for production data pipelines.

Quick Comparison

Tool Paradigm Best For Learning Curve

Prefect Flow-based Pythonic workflows, quick prototypes, cloud-first Moderate

Dagster Asset-based Data asset lineage, reproducibility, type checking Steeper

dbt SQL transformations Analytics engineering, ELT, data warehouses Low (SQL-focused)

FlowerPower Hamilton DAGs Lightweight batch ETL, configuration-driven pipelines Low-Moderate

When to Use Which?

Prefect: You want Python code flexibility, Prefect Cloud UI, and quick setup. Good for general-purpose data pipelines, ETL, and API integrations.

Dagster: You care about data asset observability, type safety, and reproducibility. Good for complex data platforms with clear asset dependencies.

dbt: Your transformations are primarily SQL and you're building analytics marts in a data warehouse. Great for analytics engineering teams.

Skill Dependencies

Assumes familiarity with:

  • @data-engineering-core

  • Polars, DuckDB, PyArrow

  • @data-engineering-storage-remote-access

  • Cloud storage for intermediate data

Related:

  • @data-engineering-quality

  • Data validation integrated into orchestration

  • @data-engineering-observability

  • Monitoring and tracing

  • @data-engineering-storage-lakehouse

  • Delta/Iceberg for state management

Detailed Guides

Prefect

See: @data-engineering-orchestration/prefect.md

  • Flows and tasks with decorators

  • Retries, caching, and parameters

  • Prefect Cloud (serverless) vs Prefect Server (self-hosted)

  • Deployment patterns

Dagster

See: @data-engineering-orchestration/dagster.md

  • Asset-based programming model

  • Materialization and partitions

  • Type checking with Dagster types

  • Sensors and schedules

  • Integration with data platforms

dbt (Data Build Tool)

See: @data-engineering-orchestration/dbt.md

  • Projects, models, tests, snapshots, seeds

  • Jinja templating and macros

  • Data testing (schema, cardinality, custom)

  • Documentation generation

  • Package management (dbt packages)

  • Adapters (DuckDB, Postgres, Snowflake, BigQuery, Spark)

FlowerPower (Lightweight Alternative)

FlowerPower is a lightweight DAG orchestration framework built on Apache Hamilton, ideal for batch ETL and data transformation scripts without the overhead of full orchestrators.

Key characteristics:

  • Hamilton-based: Define pipelines as Python functions; DAG auto-constructed

  • Configuration-driven: YAML files for parameters and execution settings

  • Lightweight: No database, no scheduler, no state persistence (batch-only)

  • Multiple executors: synchronous, threadpool, processpool, ray, dask

  • I/O plugins: Delta Lake, DuckDB, Polars, Pandas, S3, PostgreSQL, and more

When to choose FlowerPower over Prefect/Dagster:

  • Simple batch pipelines (daily/Hourly ETL)

  • Quick prototyping that can grow

  • Teams that prefer code-first (Python functions) over YAML/UI

  • No need for sophisticated scheduling, SLA tracking, or long-running state

When NOT to use:

  • Production 24/7 workflows requiring reliability guarantees

  • Complex dependency graphs with cross-dependencies

  • Need for built-in retry policies with circuit breakers

  • Workflows requiring checkpoints and state recovery

  • Multi-team orchestration with fine-grained permissions

FlowerPower limitations vs. Prefect/Dagster:

Feature Prefect/Dagster FlowerPower

Scheduling Native (cron, intervals) External (cron/systemd)

State persistence Database/cloud None (ephemeral)

Retry policies Configurable per task Per-pipeline via YAML

Observability Rich UI, lineage Basic Hamilton UI

Production readiness High Moderate (batch jobs)

Integration with data-engineering stack:

  • Uses Polars/DuckDB for DataFrame operations (@data-engineering-core )

  • Delta Lake for ACID table formats (@data-engineering-storage-lakehouse )

  • fsspec/S3 for cloud storage (@data-engineering-storage-remote-access )

  • Pandera for data validation (@data-engineering-quality )

  • Follows medallion architecture (@data-engineering-best-practices )

Skill reference: @flowerpower

  • Complete guide to FlowerPower with advanced production patterns (watermarks, data quality, incremental loads, cloud deployment).

Cloud Storage Integration

See: @data-engineering-orchestration/integrations/cloud-storage.md

  • dbt + S3/GCS via HTTPFS (DuckDB), aws_s3 extension (Postgres)

  • Configuration patterns for profiles.yml

  • Credential management best practices

Common Patterns

Retry Pattern (All Orchestrators)

Prefect: @task(retries=3, retry_delay_seconds=60)

Dagster: @asset(retry_policy=RetryPolicy(...))

dbt: --fail-fast flag + custom macro retry logic

Idempotency

All orchestrators assume idempotent operations - running twice should produce identical results. Design your INSERT , UPDATE , MERGE operations to be idempotent.

State Management

  • Prefect: Flow run state persisted to database/cloud

  • Dagster: Asset materialization events tracked

  • dbt: Model run status in dbt_run_results.json ; uses SELECT

  • INSERT by default

Dependency Management

  • Prefect: Explicit task dependencies (task1 >> task2 )

  • Dagster: Asset dependencies (@asset(depends_on=[other_asset]) )

  • dbt: DAG built from DAG from ref() calls in models

Production Recommendations

  • Version control everything: Code, configs, dbt models, Prefect/Dagster definitions

  • Test locally first: Use unit tests for transformation logic, integration tests for pipeline runs

  • Use environment variables for credentials (never hardcode)

  • Monitor pipeline runs: Prefect Cloud UI, Dagster Dagit, dbt Cloud or custom alerts

  • Alert on failures: Configure email/Slack/webhook notifications

  • Log aggregation: Send orchestrator logs to centralized system (Datadog, CloudWatch)

  • Idempotent writes: Avoid duplicate data on retries

  • Schema evolution: Handle schema changes gracefully (additive only preferred)

References

  • Prefect Documentation

  • Dagster Documentation

  • dbt Documentation

  • dbt-Labs/dbt-duckdb adapter

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Related Skills

Related by shared tags or category signals.

Automation

data-science-eda

No summary provided by upstream source.

Repository SourceNeeds Review
Automation

data-science-feature-engineering

No summary provided by upstream source.

Repository SourceNeeds Review
Automation

data-engineering-core

No summary provided by upstream source.

Repository SourceNeeds Review