etl-elt-patterns

ETL/ELT Patterns

Patterns for data extraction, loading, and transformation including modern ELT approaches and pipeline design.

When to Use This Skill

Choosing between ETL and ELT
Designing data pipelines
Implementing data transformations
Building modern data stacks
Handling data quality in pipelines

ETL vs ELT

ETL (Extract, Transform, Load)

┌─────────┐ ┌─────────────┐ ┌─────────────┐ │ Sources │ ──► │ Transform │ ──► │ Warehouse │ └─────────┘ │ Server │ └─────────────┘ └─────────────┘ (Transformation happens before loading)

Characteristics:

Transform before load
Requires ETL server/tool
Schema-on-write
Traditional approach

Best for:

Complex transformations
Limited target storage
Strict data quality requirements
Legacy systems

ELT (Extract, Load, Transform)

┌─────────┐ ┌─────────────┐ ┌─────────────┐ │ Sources │ ──► │ Target │ ──► │ Transform │ └─────────┘ │ (Load raw) │ │ (in-place) │ └─────────────┘ └─────────────┘ (Transformation happens after loading)

Characteristics:

Load first, transform later
Uses target system's compute
Schema-on-read
Modern approach

Best for:

Cloud data warehouses
Flexible exploration
Iterative development
Large data volumes

Comparison

Factor ETL ELT

Transform timing Before load After load

Compute location Separate server Target system

Raw data access Limited Full

Flexibility Low High

Latency Higher Lower

Cost model ETL server + storage Storage + target compute

Best for Complex, pre-defined Exploratory, iterative

Modern Data Stack

Extract: Fivetran, Airbyte, Stitch, Custom │ ▼ Load: Cloud Warehouse (Snowflake, BigQuery, Redshift) │ ▼ Transform: dbt, Dataform, SQLMesh │ ▼ Visualize: Looker, Tableau, Metabase

dbt (Data Build Tool)

Core concepts:

Models: SQL SELECT statements that define transformations
Tests: Data quality assertions
Documentation: Inline docs and lineage
Macros: Reusable SQL snippets

Example model: -- models/customers.sql SELECT customer_id, first_name, last_name, order_count FROM {{ ref('stg_customers') }} LEFT JOIN {{ ref('customer_orders') }} USING (customer_id)

Pipeline Patterns

Full Refresh

Strategy: Drop and recreate entire table

Process:

Extract all data from source
Truncate target table
Load all data

Pros: Simple, consistent Cons: Slow for large tables, can't handle deletes Best for: Small dimension tables, reference data

Incremental Load

Strategy: Only process new/changed records

Process:

Track high watermark (last processed timestamp/ID)
Extract records > watermark
Merge into target

Pros: Fast, efficient Cons: Complex, may miss updates Best for: Large fact tables, event data

Change Data Capture (CDC)

Strategy: Capture all changes from source

Approaches: ┌────────────────────────────────────────────────┐ │ Log-based CDC (Debezium, AWS DMS) │ │ - Reads database transaction log │ │ - Captures inserts, updates, deletes │ │ - No source table modification needed │ └────────────────────────────────────────────────┘

┌────────────────────────────────────────────────┐ │ Trigger-based CDC │ │ - Database triggers on changes │ │ - Writes to change table │ │ - Adds load to source │ └────────────────────────────────────────────────┘

┌────────────────────────────────────────────────┐ │ Timestamp-based CDC │ │ - Query by updated_at timestamp │ │ - Simple but misses hard deletes │ └────────────────────────────────────────────────┘

Merge (Upsert) Pattern

-- Snowflake/BigQuery style MERGE MERGE INTO target t USING source s ON t.id = s.id WHEN MATCHED THEN UPDATE SET t.name = s.name, t.updated_at = CURRENT_TIMESTAMP WHEN NOT MATCHED THEN INSERT (id, name, created_at) VALUES (s.id, s.name, CURRENT_TIMESTAMP);

Data Quality Patterns

Validation Gates

Pipeline with quality gates:

Extract → Validate → Load → Transform → Validate → Serve │ │ ▼ ▼ Quarantine Alert/Block (bad records) (quality issue)

Quality Checks

Schema validation:

Required fields present
Data types match
Field lengths within limits

Data validation:

Null checks
Range checks
Format validation (dates, emails)
Referential integrity

Statistical validation:

Row count within expected range
Value distributions normal
No unexpected duplicates

Data Contracts

Define expectations between producer and consumer:

{ "contract_version": "1.0", "schema": { "customer_id": {"type": "string", "required": true}, "email": {"type": "string", "format": "email"}, "created_at": {"type": "timestamp"} }, "quality": { "freshness": "< 1 hour", "completeness": "> 99%", "row_count": "10000-100000" } }

Pipeline Architecture

Batch Pipeline

Schedule-based processing:

┌─────────┐ ┌─────────┐ ┌─────────┐ │ Cron │ ──► │ Spark │ ──► │ DW │ │ (daily) │ │ (ETL) │ │ │ └─────────┘ └─────────┘ └─────────┘

Best for: Non-real-time, large volumes Tools: Airflow, Dagster, Prefect

Streaming Pipeline

Real-time processing:

┌─────────┐ ┌─────────┐ ┌─────────┐ │ Kafka │ ──► │ Flink │ ──► │ DW │ │(events) │ │(process)│ │(stream) │ └─────────┘ └─────────┘ └─────────┘

Best for: Real-time analytics, event-driven Tools: Kafka Streams, Flink, Spark Streaming

Lambda Architecture

Batch + Speed layers:

                ┌─────────────────┐
         ┌────► │   Batch Layer   │ ────┐
         │      │ (comprehensive) │     │

┌─────────┐ │ └─────────────────┘ │ ┌─────────┐ │ Data │──┤ ├─►│ Serving │ │ Sources │ │ ┌─────────────────┐ │ │ Layer │ └─────────┘ └────► │ Speed Layer │ ────┘ └─────────┘ │ (real-time) │ └─────────────────┘

Pros: Complete + real-time Cons: Complex, duplicate logic

Kappa Architecture

Streaming only (reprocess from log):

┌─────────┐ ┌─────────┐ ┌─────────┐ │ Kafka │ ──► │ Stream │ ──► │ Serving │ │ (log) │ │ Process │ │ Layer │ └─────────┘ └─────────┘ └─────────┘ │ └── Replay for reprocessing

Pros: Simple, single codebase Cons: Requires replayable log

Orchestration

DAG-Based Orchestration

Directed Acyclic Graph of tasks:

extract_a ──┐
            ├── transform ── load
extract_b ──┘

Tools: Airflow, Dagster, Prefect

Orchestration Best Practices

Idempotent tasks (safe to retry)
Clear dependencies
Appropriate granularity
Monitoring and alerting
Backfill support
Parameterized runs

Error Handling

Retry Strategies

Transient errors (network, timeout):

Exponential backoff
Max retry count
Circuit breaker

Data errors (validation failure):

Quarantine bad records
Continue processing good records
Alert for review

Dead Letter Queue

Failed records → DLQ → Manual review → Reprocess

Capture:

Original record
Error message
Timestamp
Retry count

Best Practices

Pipeline Design

Idempotent transformations
Clear lineage tracking
Appropriate checkpointing
Graceful failure handling
Comprehensive logging
Data quality gates

Performance

Partition data appropriately
Incremental processing when possible
Parallel extraction
Efficient file formats (Parquet, ORC)
Compression
Resource sizing

Related Skills

data-architecture
Data platform design
stream-processing
Real-time processing
ml-system-design
Feature engineering

etl-elt-patterns

Safety Notice

Copy this and send it to your AI assistant to learn

Source Transparency

Related Skills

design-thinking

plantuml-syntax

system-prompt-engineering