data-architecture

Data Architecture

Modern data architecture patterns including data lakes, lakehouses, data mesh, and data platform design.

When to Use This Skill

Choosing between data lake, warehouse, and lakehouse
Designing a modern data platform
Implementing data mesh principles
Planning data storage strategy
Understanding data architecture trade-offs

Data Architecture Evolution

Generation 1: Data Warehouse (1990s-2000s)

Structured data only
ETL into warehouse
Star/snowflake schemas
SQL-based analytics

Generation 2: Data Lake (2010s)

All data types (structured, semi, unstructured)
Schema-on-read
Hadoop/HDFS based
Cheap storage, complex processing

Generation 3: Lakehouse (2020s)

Best of both: lake flexibility + warehouse features
ACID transactions on lake
Schema enforcement optional
Unified analytics and ML

Architecture Comparison

Data Warehouse

┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │ Sources │ ──► │ ETL │ ──► │ Warehouse │ │ (Structured)│ │ (Transform) │ │ (Star/Snow) │ └─────────────┘ └─────────────┘ └─────────────┘ │ ▼ ┌─────────────┐ │ BI │ │ Analytics │ └─────────────┘

Characteristics:

Schema-on-write
Optimized for SQL queries
Structured data only
High data quality
Expensive storage

Best for:

Business intelligence
Financial reporting
Structured analytics

Data Lake

┌─────────────┐ ┌─────────────┐ │ Sources │ ──► │ Data Lake │ │ (All) │ │ (Raw) │ └─────────────┘ └─────────────┘ │ ┌────────────────┼────────────────┐ ▼ ▼ ▼ ┌─────────┐ ┌─────────┐ ┌─────────┐ │ ML │ │ ETL │ │ Spark │ │ Training│ │ to DW │ │ Analysis│ └─────────┘ └─────────┘ └─────────┘

Characteristics:

Schema-on-read
All data types
Cheap storage
Flexible processing
Risk of "data swamp"

Best for:

Data science/ML
Unstructured data
Experimental analysis

Data Lakehouse

┌─────────────┐ ┌─────────────────────────────────┐ │ Sources │ ──► │ Data Lakehouse │ │ (All) │ │ ┌──────────────────────────┐ │ └─────────────┘ │ │ Metadata Layer │ │ │ │ (Delta/Iceberg/Hudi) │ │ │ └──────────────────────────┘ │ │ ┌──────────────────────────┐ │ │ │ Storage Layer │ │ │ │ (Object Storage) │ │ │ └──────────────────────────┘ │ └─────────────────────────────────┘ │ ┌────────────────────┼────────────────────┐ ▼ ▼ ▼ ┌─────────┐ ┌─────────┐ ┌─────────┐ │ SQL │ │ ML │ │ Stream │ │ BI │ │ Workload│ │ Process │ └─────────┘ └─────────┘ └─────────┘

Characteristics:

ACID transactions
Schema evolution
Time travel
Unified batch/streaming
Open formats

Best for:

Unified analytics
Both BI and ML
Modern data platforms

Architecture Selection Guide

Factor Warehouse Lake Lakehouse

Data types Structured All All

Query performance Excellent Poor-Medium Good

Data quality High Variable Configurable

Cost High Low Medium

ML workloads Limited Excellent Excellent

Real-time Limited Good Good

Governance Strong Weak Strong

Complexity Low High Medium

Decision Tree:

Is data mostly structured with BI focus? ├── Yes → Data Warehouse └── No └── Need ML + BI on same data? ├── Yes → Lakehouse └── No └── Primarily ML/unstructured? ├── Yes → Data Lake └── No → Lakehouse

Lakehouse Technologies

Delta Lake (Databricks)

Features:

ACID transactions
Time travel (data versioning)
Schema enforcement/evolution
Unified batch/streaming
Optimized performance (Z-ordering, compaction)

File format: Parquet + Delta log

Apache Iceberg (Netflix)

Features:

ACID transactions
Hidden partitioning
Schema evolution
Time travel
Vendor neutral

File format: Parquet/ORC/Avro + metadata

Apache Hudi (Uber)

Features:

ACID transactions
Incremental processing
Record-level updates
Time travel
Optimized for streaming

File format: Parquet + Hudi metadata

Technology Comparison

Feature Delta Lake Iceberg Hudi

ACID Yes Yes Yes

Time Travel Yes Yes Yes

Schema Evolution Good Excellent Good

Streaming Excellent Good Excellent

Ecosystem Databricks Wide Wide

Performance Excellent Excellent Good

Community Large Growing Medium

Data Mesh

Principles

Data Mesh = Decentralized data architecture

Four Principles:

Domain Ownership
- Data owned by domain teams
- Not centralized data team
Data as a Product
- Treat data like a product
- Quality, discoverability, usability
Self-Serve Platform
- Platform enables domain teams
- Reduces friction
Federated Governance
- Global standards
- Local implementation

Data Products

Data Product = Autonomous unit of data

Components: ┌──────────────────────────────────────┐ │ Data Product │ │ ┌──────────┐ ┌──────────────────┐ │ │ │ Data │ │ Metadata │ │ │ │ (Tables) │ │ (Schema, docs) │ │ │ └──────────┘ └──────────────────┘ │ │ ┌──────────┐ ┌──────────────────┐ │ │ │ Code │ │ APIs │ │ │ │ (ETL) │ │ (Access layer) │ │ │ └──────────┘ └──────────────────┘ │ │ ┌──────────────────────────────────┐│ │ │ Quality + SLAs ││ │ └──────────────────────────────────┘│ └──────────────────────────────────────┘

Data Mesh vs Centralized

Aspect Centralized Data Mesh

Ownership Central data team Domain teams

Scaling Team bottleneck Scales with org

Domain knowledge Lost in translation Preserved

Governance Centralized Federated

Implementation Uniform Heterogeneous

Complexity Lower initially Higher initially

Data Modeling Patterns

Star Schema

    ┌─────────────┐
    │  Dim_Time   │
    └──────┬──────┘
           │

┌───────────┐ │ ┌───────────┐ │Dim_Product├──┼──┤Dim_Customer│ └───────────┘ │ └───────────┘ │ ┌──────┴──────┐ │ Fact_Sales │ └─────────────┘

Pros: Simple, fast queries Cons: Denormalized, redundancy Best for: BI, reporting

Snowflake Schema

Normalized dimensions: Dim_Product → Dim_Category → Dim_Subcategory

Pros: Less redundancy Cons: More joins, slower Best for: Complex hierarchies

Data Vault

Hub (business keys) ←→ Link (relationships) ←→ Satellite (attributes)

Pros: Auditable, flexible, scalable Cons: Complex, learning curve Best for: Enterprise data warehouse

Storage Layers

Bronze/Silver/Gold (Medallion Architecture)

┌─────────┐ ┌─────────┐ ┌─────────┐ │ Bronze │ ──► │ Silver │ ──► │ Gold │ │ (Raw) │ │(Cleaned)│ │(Curated)│ └─────────┘ └─────────┘ └─────────┘

Bronze: Raw ingestion, append-only Silver: Cleaned, validated, conformed Gold: Business-level aggregates, features

Zones in Data Lake

Landing Zone: Raw files from sources Raw Zone: Structured raw data Curated Zone: Transformed, quality-checked Consumption Zone: Ready for analytics Sandbox Zone: Exploration and experimentation

Best Practices

Data Quality

Implement quality gates:

Schema validation
Null checks
Range validation
Referential integrity
Freshness monitoring

Governance

Key capabilities:

Data catalog
Lineage tracking
Access control
Privacy compliance
Audit logging

Performance

Optimization techniques:

Partitioning (by date, region)
Clustering/Z-ordering
Compaction
Caching
Materialized views

Related Skills

etl-elt-patterns
Data transformation
stream-processing
Real-time data
database-scaling
Database patterns

data-architecture

Safety Notice

Copy this and send it to your AI assistant to learn

Source Transparency

Related Skills

design-thinking

plantuml-syntax

system-prompt-engineering