Data Engineering Hub
Welcome to the comprehensive data engineering skill suite. This hub organizes all data engineering knowledge into logical, non-overlapping domains.
Skill Map
Domain Skills When to Use
Core @data-engineering-core
Polars, DuckDB, PyArrow fundamentals; ETL patterns; error handling; performance optimization
Storage @data-engineering-storage-lakehouse
Delta Lake, Apache Iceberg, Apache Hudi
@data-engineering-storage-remote-access
fsspec, pyarrow.fs, obstore; cloud access patterns
@data-engineering-storage-authentication
AWS, GCP, Azure auth - IAM roles, managed identity, secrets management
@data-engineering-storage-formats
Parquet optimizations, Lance, Zarr, Avro, ORC
Orchestration @data-engineering-orchestration
Prefect, Dagster, dbt, workflow scheduling
Streaming @data-engineering-streaming
Kafka, MQTT, NATS JetStream for real-time data
Quality @data-engineering-quality
Great Expectations, Pandera for data validation
Observability @data-engineering-observability
OpenTelemetry, Prometheus for pipeline monitoring
AI/ML @data-engineering-ai-ml
Embeddings, vector databases, RAG pipelines
Best Practices @data-engineering-best-practices
Medallion architecture, partitioning, file sizing, incremental loads, schema evolution, testing
Catalogs @data-engineering-catalogs
Data catalog systems: Iceberg catalogs, DuckDB multi-source, Amundsen/DataHub/OpenMetadata
Quick Reference: Core Stack
Task Recommended Tool
DataFrame operations Polars (10-50x faster than pandas)
SQL analytics DuckDB (embedded OLAP, zero-copy Arrow integration)
Data interchange PyArrow (Arrow format, zero-copy transfers)
Cloud storage access fsspec (universal), pyarrow.fs (Arrow-native), obstore (high-performance)
Lakehouse format Delta Lake (Spark ecosystem), Iceberg (engine-agnostic), Hudi (streaming CDC)
Orchestration Prefect (Pythonic flows), Dagster (asset-based), dbt (SQL transformations)
Validation Pandera (lightweight), Great Expectations (enterprise)
Getting Started
New to Data Engineering?
Start with @data-engineering-core to learn the foundational libraries and patterns.
Working with Cloud Storage?
Go to @data-engineering-storage-remote-access for fsspec, pyarrow.fs, and obstore.
Building Data Lakes?
Explore @data-engineering-storage-lakehouse for ACID table formats.
Choosing a Data Catalog?
Check @data-engineering-catalogs for Iceberg catalogs, DuckDB multi-source patterns, and tool comparisons.
Production-Grade Pipelines?
Read @data-engineering-best-practices for medallion architecture, partitioning, schema evolution, and testing strategies.
Orchestrating Pipelines?
Check @data-engineering-orchestration for Prefect, Dagster, and dbt.
Production Monitoring?
See @data-engineering-observability for tracing and metrics.
AI/ML Data Pipelines?
Visit @data-engineering-ai-ml for embeddings, vector databases, and RAG.
Principles
-
Lazy evaluation: Use Polars lazy frames and DuckDB query planning for performance
-
Zero-copy data transfer: Leverage Arrow format for memory efficiency
-
Pushdown optimization: Filter at storage layer to minimize data transfer
-
Type safety: Use explicit schemas and type hints
-
Resilience: Implement retries, circuit breakers, and proper error handling
-
Observability: Instrument pipelines with traces and metrics
-
Security: Never hardcode credentials; use IAM roles and environment variables
Migration from Legacy Skills
This restructured suite replaces the previous split organization (data-engineering-* and remote-filesystems-* ). All content has been consolidated to eliminate duplication and clarify ownership.
Legacy skill replacements:
-
data-engineering-core → @data-engineering-core (plus specific integrations)
-
data-engineering-lakehouse → @data-engineering-storage-lakehouse
-
data-engineering-orchestration → @data-engineering-orchestration
-
data-engineering-streaming → @data-engineering-streaming
-
data-engineering-quality → @data-engineering-quality
-
data-engineering-observability → @data-engineering-observability
-
data-engineering-llm-pipelines → @data-engineering-ai-ml
-
remote-filesystems-* → @data-engineering-storage-remote-access and integrations
All legacy skills remain functional but are deprecated. New content should be added to the new structure only.