Data Architecture
Modern data architecture patterns including data lakes, lakehouses, data mesh, and data platform design.
When to Use This Skill
-
Choosing between data lake, warehouse, and lakehouse
-
Designing a modern data platform
-
Implementing data mesh principles
-
Planning data storage strategy
-
Understanding data architecture trade-offs
Data Architecture Evolution
Generation 1: Data Warehouse (1990s-2000s)
- Structured data only
- ETL into warehouse
- Star/snowflake schemas
- SQL-based analytics
Generation 2: Data Lake (2010s)
- All data types (structured, semi, unstructured)
- Schema-on-read
- Hadoop/HDFS based
- Cheap storage, complex processing
Generation 3: Lakehouse (2020s)
- Best of both: lake flexibility + warehouse features
- ACID transactions on lake
- Schema enforcement optional
- Unified analytics and ML
Architecture Comparison
Data Warehouse
┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │ Sources │ ──► │ ETL │ ──► │ Warehouse │ │ (Structured)│ │ (Transform) │ │ (Star/Snow) │ └─────────────┘ └─────────────┘ └─────────────┘ │ ▼ ┌─────────────┐ │ BI │ │ Analytics │ └─────────────┘
Characteristics:
- Schema-on-write
- Optimized for SQL queries
- Structured data only
- High data quality
- Expensive storage
Best for:
- Business intelligence
- Financial reporting
- Structured analytics
Data Lake
┌─────────────┐ ┌─────────────┐ │ Sources │ ──► │ Data Lake │ │ (All) │ │ (Raw) │ └─────────────┘ └─────────────┘ │ ┌────────────────┼────────────────┐ ▼ ▼ ▼ ┌─────────┐ ┌─────────┐ ┌─────────┐ │ ML │ │ ETL │ │ Spark │ │ Training│ │ to DW │ │ Analysis│ └─────────┘ └─────────┘ └─────────┘
Characteristics:
- Schema-on-read
- All data types
- Cheap storage
- Flexible processing
- Risk of "data swamp"
Best for:
- Data science/ML
- Unstructured data
- Experimental analysis
Data Lakehouse
┌─────────────┐ ┌─────────────────────────────────┐ │ Sources │ ──► │ Data Lakehouse │ │ (All) │ │ ┌──────────────────────────┐ │ └─────────────┘ │ │ Metadata Layer │ │ │ │ (Delta/Iceberg/Hudi) │ │ │ └──────────────────────────┘ │ │ ┌──────────────────────────┐ │ │ │ Storage Layer │ │ │ │ (Object Storage) │ │ │ └──────────────────────────┘ │ └─────────────────────────────────┘ │ ┌────────────────────┼────────────────────┐ ▼ ▼ ▼ ┌─────────┐ ┌─────────┐ ┌─────────┐ │ SQL │ │ ML │ │ Stream │ │ BI │ │ Workload│ │ Process │ └─────────┘ └─────────┘ └─────────┘
Characteristics:
- ACID transactions
- Schema evolution
- Time travel
- Unified batch/streaming
- Open formats
Best for:
- Unified analytics
- Both BI and ML
- Modern data platforms
Architecture Selection Guide
Factor Warehouse Lake Lakehouse
Data types Structured All All
Query performance Excellent Poor-Medium Good
Data quality High Variable Configurable
Cost High Low Medium
ML workloads Limited Excellent Excellent
Real-time Limited Good Good
Governance Strong Weak Strong
Complexity Low High Medium
Decision Tree:
Is data mostly structured with BI focus? ├── Yes → Data Warehouse └── No └── Need ML + BI on same data? ├── Yes → Lakehouse └── No └── Primarily ML/unstructured? ├── Yes → Data Lake └── No → Lakehouse
Lakehouse Technologies
Delta Lake (Databricks)
Features:
- ACID transactions
- Time travel (data versioning)
- Schema enforcement/evolution
- Unified batch/streaming
- Optimized performance (Z-ordering, compaction)
File format: Parquet + Delta log
Apache Iceberg (Netflix)
Features:
- ACID transactions
- Hidden partitioning
- Schema evolution
- Time travel
- Vendor neutral
File format: Parquet/ORC/Avro + metadata
Apache Hudi (Uber)
Features:
- ACID transactions
- Incremental processing
- Record-level updates
- Time travel
- Optimized for streaming
File format: Parquet + Hudi metadata
Technology Comparison
Feature Delta Lake Iceberg Hudi
ACID Yes Yes Yes
Time Travel Yes Yes Yes
Schema Evolution Good Excellent Good
Streaming Excellent Good Excellent
Ecosystem Databricks Wide Wide
Performance Excellent Excellent Good
Community Large Growing Medium
Data Mesh
Principles
Data Mesh = Decentralized data architecture
Four Principles:
-
Domain Ownership
- Data owned by domain teams
- Not centralized data team
-
Data as a Product
- Treat data like a product
- Quality, discoverability, usability
-
Self-Serve Platform
- Platform enables domain teams
- Reduces friction
-
Federated Governance
- Global standards
- Local implementation
Data Products
Data Product = Autonomous unit of data
Components: ┌──────────────────────────────────────┐ │ Data Product │ │ ┌──────────┐ ┌──────────────────┐ │ │ │ Data │ │ Metadata │ │ │ │ (Tables) │ │ (Schema, docs) │ │ │ └──────────┘ └──────────────────┘ │ │ ┌──────────┐ ┌──────────────────┐ │ │ │ Code │ │ APIs │ │ │ │ (ETL) │ │ (Access layer) │ │ │ └──────────┘ └──────────────────┘ │ │ ┌──────────────────────────────────┐│ │ │ Quality + SLAs ││ │ └──────────────────────────────────┘│ └──────────────────────────────────────┘
Data Mesh vs Centralized
Aspect Centralized Data Mesh
Ownership Central data team Domain teams
Scaling Team bottleneck Scales with org
Domain knowledge Lost in translation Preserved
Governance Centralized Federated
Implementation Uniform Heterogeneous
Complexity Lower initially Higher initially
Data Modeling Patterns
Star Schema
┌─────────────┐
│ Dim_Time │
└──────┬──────┘
│
┌───────────┐ │ ┌───────────┐ │Dim_Product├──┼──┤Dim_Customer│ └───────────┘ │ └───────────┘ │ ┌──────┴──────┐ │ Fact_Sales │ └─────────────┘
Pros: Simple, fast queries Cons: Denormalized, redundancy Best for: BI, reporting
Snowflake Schema
Normalized dimensions: Dim_Product → Dim_Category → Dim_Subcategory
Pros: Less redundancy Cons: More joins, slower Best for: Complex hierarchies
Data Vault
Hub (business keys) ←→ Link (relationships) ←→ Satellite (attributes)
Pros: Auditable, flexible, scalable Cons: Complex, learning curve Best for: Enterprise data warehouse
Storage Layers
Bronze/Silver/Gold (Medallion Architecture)
┌─────────┐ ┌─────────┐ ┌─────────┐ │ Bronze │ ──► │ Silver │ ──► │ Gold │ │ (Raw) │ │(Cleaned)│ │(Curated)│ └─────────┘ └─────────┘ └─────────┘
Bronze: Raw ingestion, append-only Silver: Cleaned, validated, conformed Gold: Business-level aggregates, features
Zones in Data Lake
Landing Zone: Raw files from sources Raw Zone: Structured raw data Curated Zone: Transformed, quality-checked Consumption Zone: Ready for analytics Sandbox Zone: Exploration and experimentation
Best Practices
Data Quality
Implement quality gates:
- Schema validation
- Null checks
- Range validation
- Referential integrity
- Freshness monitoring
Governance
Key capabilities:
- Data catalog
- Lineage tracking
- Access control
- Privacy compliance
- Audit logging
Performance
Optimization techniques:
- Partitioning (by date, region)
- Clustering/Z-ordering
- Compaction
- Caching
- Materialized views
Related Skills
-
etl-elt-patterns
-
Data transformation
-
stream-processing
-
Real-time data
-
database-scaling
-
Database patterns