Data Warehousing

Production-grade data warehouse design with Snowflake, BigQuery, and dimensional modeling patterns.

Quick Start

-- Snowflake Modern Data Warehouse Setup
CREATE WAREHOUSE analytics_wh
    WITH WAREHOUSE_SIZE = 'MEDIUM'
    AUTO_SUSPEND = 300
    AUTO_RESUME = TRUE
    MIN_CLUSTER_COUNT = 1
    MAX_CLUSTER_COUNT = 4;

-- Create dimensional model
CREATE TABLE marts.fact_orders (
    order_key BIGINT AUTOINCREMENT PRIMARY KEY,
    date_key INT NOT NULL REFERENCES dim_date(date_key),
    customer_key INT NOT NULL,
    product_key INT NOT NULL,
    quantity INT NOT NULL,
    unit_price DECIMAL(10,2) NOT NULL,
    total_amount DECIMAL(12,2) NOT NULL,
    _loaded_at TIMESTAMP_NTZ DEFAULT CURRENT_TIMESTAMP()
) CLUSTER BY (date_key);

-- Dimension with SCD Type 2
CREATE TABLE marts.dim_customer (
    customer_key INT AUTOINCREMENT PRIMARY KEY,
    customer_id VARCHAR(50) NOT NULL,
    customer_name VARCHAR(255),
    segment VARCHAR(50),
    valid_from DATE NOT NULL,
    valid_to DATE DEFAULT '9999-12-31',
    is_current BOOLEAN DEFAULT TRUE
);

Core Concepts

1. Dimensional Modeling (Kimball)

-- Star Schema Design
-- Fact table: measurable business events
-- Dimension tables: context for analysis

-- Date dimension (conformed)
CREATE TABLE dim_date (
    date_key INT PRIMARY KEY,
    full_date DATE NOT NULL,
    day_of_week INT,
    day_name VARCHAR(10),
    month_num INT,
    month_name VARCHAR(10),
    quarter INT,
    year INT,
    is_weekend BOOLEAN,
    fiscal_year INT,
    fiscal_quarter INT
);

-- SCD Type 2 MERGE pattern
MERGE INTO dim_customer AS target
USING staging_customer AS source
ON target.customer_id = source.customer_id AND target.is_current = TRUE
WHEN MATCHED AND (
    target.customer_name != source.customer_name OR
    target.segment != source.segment
) THEN UPDATE SET valid_to = CURRENT_DATE - 1, is_current = FALSE
WHEN NOT MATCHED THEN INSERT (
    customer_id, customer_name, segment, valid_from
) VALUES (
    source.customer_id, source.customer_name, source.segment, CURRENT_DATE
);

2. Snowflake Optimization

-- Clustering for performance
ALTER TABLE fact_orders CLUSTER BY (date_key, customer_key);
SELECT SYSTEM$CLUSTERING_INFORMATION('fact_orders');

-- Materialized views for aggregations
CREATE MATERIALIZED VIEW mv_daily_sales AS
SELECT date_key, SUM(total_amount) AS daily_revenue, COUNT(*) AS order_count
FROM fact_orders GROUP BY date_key;

-- Search optimization
ALTER TABLE fact_orders ADD SEARCH OPTIMIZATION ON EQUALITY(order_id);

-- Time travel for debugging
SELECT * FROM fact_orders AT(TIMESTAMP => '2024-01-15 10:00:00'::TIMESTAMP);

-- Zero-copy cloning
CREATE TABLE fact_orders_dev CLONE fact_orders;

3. BigQuery Patterns

-- Partitioned and clustered table
CREATE TABLE `project.dataset.fact_events`
PARTITION BY DATE(event_timestamp)
CLUSTER BY user_id, event_type
OPTIONS (partition_expiration_days = 365, require_partition_filter = TRUE)
AS SELECT * FROM source_events;

-- Efficient query with partition pruning
SELECT event_type, COUNT(*) AS event_count
FROM `project.dataset.fact_events`
WHERE DATE(event_timestamp) BETWEEN '2024-01-01' AND '2024-01-31'
GROUP BY event_type;

-- BigQuery ML inline
CREATE OR REPLACE MODEL `project.dataset.churn_model`
OPTIONS (model_type = 'LOGISTIC_REG', input_label_cols = ['churned'])
AS SELECT tenure_months, monthly_spend, churned FROM customer_features;

Tools & Technologies

Tool	Purpose	Version (2025)
Snowflake	Cloud data warehouse	Latest
BigQuery	Serverless analytics	Latest
Redshift	AWS data warehouse	Serverless
Databricks SQL	Lakehouse analytics	Latest
dbt	Transformation	1.7+
Monte Carlo	Data observability	Latest

Troubleshooting Guide

Issue	Symptoms	Root Cause	Fix
Slow Query	Query timeout	No clustering	Add clustering key
High Cost	Budget exceeded	Large warehouse	Auto-suspend, right-size
Data Skew	Uneven processing	Poor partition key	Choose better key

Best Practices

-- ✅ DO: Use surrogate keys
customer_key INT AUTOINCREMENT PRIMARY KEY

-- ✅ DO: Add audit columns
_loaded_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP()

-- ✅ DO: Cluster on filter columns
CLUSTER BY (date_key)

-- ❌ DON'T: Use natural keys as PK
-- ❌ DON'T: SELECT * in production

Resources

Snowflake Docs
BigQuery Docs
"The Data Warehouse Toolkit" by Ralph Kimball

Skill Certification Checklist:

Can design star/snowflake schemas
Can implement SCD Type 2 dimensions
Can optimize with clustering/partitioning
Can monitor and optimize costs

data-warehousing

Safety Notice

Copy this and send it to your AI assistant to learn

Data Warehousing

Quick Start

Core Concepts

1. Dimensional Modeling (Kimball)

2. Snowflake Optimization

3. BigQuery Patterns

Tools & Technologies

Troubleshooting Guide

Best Practices

Resources

Source Transparency

Related Skills

statistics-math

deep-learning

data-engineering