data-warehousing

Snowflake, BigQuery, Redshift, dimensional modeling, and modern data warehouse architecture

Safety Notice

This listing is imported from skills.sh public index metadata. Review upstream SKILL.md and repository scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "data-warehousing" with this command: npx skills add pluginagentmarketplace/custom-plugin-data-engineer/pluginagentmarketplace-custom-plugin-data-engineer-data-warehousing

Data Warehousing

Production-grade data warehouse design with Snowflake, BigQuery, and dimensional modeling patterns.

Quick Start

-- Snowflake Modern Data Warehouse Setup
CREATE WAREHOUSE analytics_wh
    WITH WAREHOUSE_SIZE = 'MEDIUM'
    AUTO_SUSPEND = 300
    AUTO_RESUME = TRUE
    MIN_CLUSTER_COUNT = 1
    MAX_CLUSTER_COUNT = 4;

-- Create dimensional model
CREATE TABLE marts.fact_orders (
    order_key BIGINT AUTOINCREMENT PRIMARY KEY,
    date_key INT NOT NULL REFERENCES dim_date(date_key),
    customer_key INT NOT NULL,
    product_key INT NOT NULL,
    quantity INT NOT NULL,
    unit_price DECIMAL(10,2) NOT NULL,
    total_amount DECIMAL(12,2) NOT NULL,
    _loaded_at TIMESTAMP_NTZ DEFAULT CURRENT_TIMESTAMP()
) CLUSTER BY (date_key);

-- Dimension with SCD Type 2
CREATE TABLE marts.dim_customer (
    customer_key INT AUTOINCREMENT PRIMARY KEY,
    customer_id VARCHAR(50) NOT NULL,
    customer_name VARCHAR(255),
    segment VARCHAR(50),
    valid_from DATE NOT NULL,
    valid_to DATE DEFAULT '9999-12-31',
    is_current BOOLEAN DEFAULT TRUE
);

Core Concepts

1. Dimensional Modeling (Kimball)

-- Star Schema Design
-- Fact table: measurable business events
-- Dimension tables: context for analysis

-- Date dimension (conformed)
CREATE TABLE dim_date (
    date_key INT PRIMARY KEY,
    full_date DATE NOT NULL,
    day_of_week INT,
    day_name VARCHAR(10),
    month_num INT,
    month_name VARCHAR(10),
    quarter INT,
    year INT,
    is_weekend BOOLEAN,
    fiscal_year INT,
    fiscal_quarter INT
);

-- SCD Type 2 MERGE pattern
MERGE INTO dim_customer AS target
USING staging_customer AS source
ON target.customer_id = source.customer_id AND target.is_current = TRUE
WHEN MATCHED AND (
    target.customer_name != source.customer_name OR
    target.segment != source.segment
) THEN UPDATE SET valid_to = CURRENT_DATE - 1, is_current = FALSE
WHEN NOT MATCHED THEN INSERT (
    customer_id, customer_name, segment, valid_from
) VALUES (
    source.customer_id, source.customer_name, source.segment, CURRENT_DATE
);

2. Snowflake Optimization

-- Clustering for performance
ALTER TABLE fact_orders CLUSTER BY (date_key, customer_key);
SELECT SYSTEM$CLUSTERING_INFORMATION('fact_orders');

-- Materialized views for aggregations
CREATE MATERIALIZED VIEW mv_daily_sales AS
SELECT date_key, SUM(total_amount) AS daily_revenue, COUNT(*) AS order_count
FROM fact_orders GROUP BY date_key;

-- Search optimization
ALTER TABLE fact_orders ADD SEARCH OPTIMIZATION ON EQUALITY(order_id);

-- Time travel for debugging
SELECT * FROM fact_orders AT(TIMESTAMP => '2024-01-15 10:00:00'::TIMESTAMP);

-- Zero-copy cloning
CREATE TABLE fact_orders_dev CLONE fact_orders;

3. BigQuery Patterns

-- Partitioned and clustered table
CREATE TABLE `project.dataset.fact_events`
PARTITION BY DATE(event_timestamp)
CLUSTER BY user_id, event_type
OPTIONS (partition_expiration_days = 365, require_partition_filter = TRUE)
AS SELECT * FROM source_events;

-- Efficient query with partition pruning
SELECT event_type, COUNT(*) AS event_count
FROM `project.dataset.fact_events`
WHERE DATE(event_timestamp) BETWEEN '2024-01-01' AND '2024-01-31'
GROUP BY event_type;

-- BigQuery ML inline
CREATE OR REPLACE MODEL `project.dataset.churn_model`
OPTIONS (model_type = 'LOGISTIC_REG', input_label_cols = ['churned'])
AS SELECT tenure_months, monthly_spend, churned FROM customer_features;

Tools & Technologies

ToolPurposeVersion (2025)
SnowflakeCloud data warehouseLatest
BigQueryServerless analyticsLatest
RedshiftAWS data warehouseServerless
Databricks SQLLakehouse analyticsLatest
dbtTransformation1.7+
Monte CarloData observabilityLatest

Troubleshooting Guide

IssueSymptomsRoot CauseFix
Slow QueryQuery timeoutNo clusteringAdd clustering key
High CostBudget exceededLarge warehouseAuto-suspend, right-size
Data SkewUneven processingPoor partition keyChoose better key

Best Practices

-- ✅ DO: Use surrogate keys
customer_key INT AUTOINCREMENT PRIMARY KEY

-- ✅ DO: Add audit columns
_loaded_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP()

-- ✅ DO: Cluster on filter columns
CLUSTER BY (date_key)

-- ❌ DON'T: Use natural keys as PK
-- ❌ DON'T: SELECT * in production

Resources


Skill Certification Checklist:

  • Can design star/snowflake schemas
  • Can implement SCD Type 2 dimensions
  • Can optimize with clustering/partitioning
  • Can monitor and optimize costs

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Related Skills

Related by shared tags or category signals.

Automation

statistics-math

No summary provided by upstream source.

Repository SourceNeeds Review
Automation

deep-learning

No summary provided by upstream source.

Repository SourceNeeds Review
Automation

data-engineering

No summary provided by upstream source.

Repository SourceNeeds Review