etl-designer

ETL Designer

Design robust ETL/ELT pipelines for data processing.

Quick Start

Use Airflow for orchestration, implement idempotent operations, add error handling, monitor pipeline health.

Instructions

Airflow DAG Structure

from airflow import DAG from airflow.operators.python import PythonOperator from datetime import datetime, timedelta

default_args = { 'owner': 'data-team', 'retries': 3, 'retry_delay': timedelta(minutes=5), 'email_on_failure': True, 'email': ['alerts@company.com'] }

with DAG( 'etl_pipeline', default_args=default_args, schedule_interval='0 2 * * *', # Daily at 2 AM start_date=datetime(2024, 1, 1), catchup=False ) as dag:

extract = PythonOperator(
    task_id='extract_data',
    python_callable=extract_from_source
)

transform = PythonOperator(
    task_id='transform_data',
    python_callable=transform_data
)

load = PythonOperator(
    task_id='load_to_warehouse',
    python_callable=load_to_warehouse
)

extract >> transform >> load

Incremental Processing

def extract_incremental(last_run_date): query = f""" SELECT * FROM source_table WHERE updated_at > '{last_run_date}' """ return pd.read_sql(query, conn)

Error Handling

def safe_transform(data): try: transformed = transform_data(data) return transformed except Exception as e: logger.error(f"Transform failed: {e}") send_alert(f"Pipeline failed: {e}") raise

Best Practices

Make operations idempotent
Use incremental processing
Implement proper error handling
Add monitoring and alerts
Use data quality checks
Document pipeline logic

Safety Notice

Copy this and send it to your AI assistant to learn

Source Transparency

Related Skills

gcp-cost-optimizer

schema-designer

aws-cost-optimizer