Purpose
Quick Start
Core Concepts
Instructions
Examples & Patterns
References
Requirements

ClickHouse Materialized Views Skill

Purpose

Materialized views in ClickHouse are incremental triggers that automatically transform and aggregate incoming data in real-time. Unlike traditional databases, ClickHouse views process only new data as it arrives, enabling efficient streaming analytics, real-time dashboards, and continuous aggregation without scheduled batch jobs.

When to Use This Skill

Use when asked to:

Build real-time data aggregation pipelines ("create real-time metrics")
Implement streaming analytics or continuous aggregation
Create pre-aggregated tables for fast dashboard queries
Set up automatic data transformation as events arrive
Chain multi-level aggregations (hourly → daily → monthly)

Do NOT use when:

Data is batch-processed infrequently (use scheduled jobs instead)
Aggregation logic changes often (materialized views are harder to modify)
Source data is small and queries are already fast

Quick Start

Create a simple real-time aggregation pipeline:

-- Step 1: Create source table (raw events)
CREATE TABLE events (
    user_id UInt32,
    event_type String,
    timestamp DateTime,
    revenue Decimal(10, 2)
) ENGINE = MergeTree()
ORDER BY (user_id, timestamp);

-- Step 2: Create destination table (aggregated)
CREATE TABLE daily_revenue (
    date Date,
    total_revenue Decimal(18, 2),
    event_count UInt64
) ENGINE = SummingMergeTree()
ORDER BY date;

-- Step 3: Create materialized view (automatic aggregation)
CREATE MATERIALIZED VIEW daily_revenue_mv TO daily_revenue AS
SELECT
    toDate(timestamp) as date,
    SUM(revenue) as total_revenue,
    COUNT() as event_count
FROM events
GROUP BY date;

-- Now every INSERT into events automatically updates daily_revenue
INSERT INTO events VALUES (12345, 'purchase', '2024-01-15 10:30:00', 99.99);
SELECT * FROM daily_revenue;  -- Result: 2024-01-15 | 99.99 | 1

Instructions

Step 1: Choose Your Aggregation Pattern

Materialized views support three main patterns depending on your use case:

Incremental Views (Most Common)

Process only new data as it arrives
Best for streaming event data and continuous updates
Use with SummingMergeTree for automatic deduplication

Refreshable Views

Recalculate entire view on a schedule (e.g., hourly)
Best for complex queries that need full recalculation
Syntax: CREATE MATERIALIZED VIEW ... REFRESH EVERY 1 HOUR AS ...

Populate-Based Views

Backfill existing data when view is created
Warning: Can be expensive on large tables
Syntax: CREATE MATERIALIZED VIEW ... POPULATE AS ...

See examples/aggregation-patterns.md for complete pattern examples.

Step 2: Select the Right Destination Engine

The destination table engine determines how your aggregated data behaves:

SummingMergeTree (Simple Aggregates)

Use when aggregating with COUNT, SUM, AVG
Automatically sums columns during background merges
No duplicate rows in final results
Example: Hourly sales totals, daily revenue

AggregatingMergeTree (Complex Aggregates)

Use with State/Merge functions for uniq, quantile, topK
Stores intermediate aggregation states
Combine with uniqState(), quantileState(), topKState() in views
Query with uniqMerge(), quantileMerge(), topKMerge() functions
Example: Unique user counts, percentile calculations

ReplacingMergeTree (Deduplication)

Use when you need to replace/update existing rows
Maintains version column to determine latest record
Example: User profile updates, entity state snapshots

MergeTree (Raw Events)

Use when storing raw events without aggregation
No automatic deduplication
Example: Event audit trail, raw clickstream data

See references/engine-comparison.md for detailed comparison.

Step 3: Design Your Aggregation Query

Follow these principles:

GROUP BY Cardinality

Keep cardinality moderate (not millions of unique combinations)
Example: Good = GROUP BY date, product_id | Bad = GROUP BY user_id, timestamp

ORDER BY for Query Patterns

Design ORDER BY for your most common query filters
If queries filter by date first: ORDER BY (date, product_id)
If queries filter by product first: ORDER BY (product_id, date)

Include Necessary Columns

Include columns needed for post-aggregation filtering
Store raw values alongside aggregates when needed for flexibility
Example: Store timestamp and revenue along with hourly aggregates

Use Conditional Aggregation

Use countIf(), sumIf(), avgIf() for event filtering
Filter within the aggregation rather than in views
Example: countIf(event_type = 'purchase') as purchase_count

See examples/aggregation-patterns.md for SQL patterns.

Step 4: Implement State/Merge Functions for Complex Aggregates

When using AggregatingMergeTree, use special State/Merge function pairs:

In Materialized View (State Functions)

CREATE MATERIALIZED VIEW user_stats_mv TO user_stats AS
SELECT
    toDate(timestamp) as date,
    uniqState(user_id) as unique_users,
    quantileState(0.95)(response_time) as p95_latency
FROM events
GROUP BY date;

In Queries (Merge Functions)

SELECT
    date,
    uniqMerge(unique_users) as total_unique_users,
    quantileMerge(0.95)(p95_latency) as latency_p95
FROM user_stats
GROUP BY date;

Common function pairs:

uniq() ↔ uniqState() / uniqMerge()
sum() ↔ sumState() / sumMerge()
avg() ↔ avgState() / avgMerge()
quantile(0.95)() ↔ quantileState(0.95)() / quantileMerge(0.95)()
topK(10)() ↔ topKState(10)() / topKMerge(10)()

See references/state-merge-reference.md for all function pairs.

Step 5: Chain Views for Multi-Level Aggregation

Build hierarchical aggregations to reduce redundant computation:

-- Level 1: Raw events (source)
CREATE TABLE events (...) ENGINE = MergeTree() ORDER BY (user_id, timestamp);

-- Level 2: Hourly aggregation
CREATE TABLE hourly_stats (...) ENGINE = SummingMergeTree() ORDER BY (hour, product_id);
CREATE MATERIALIZED VIEW hourly_stats_mv TO hourly_stats AS
SELECT
    toStartOfHour(timestamp) as hour,
    product_id,
    COUNT() as event_count,
    SUM(revenue) as total_revenue
FROM events
GROUP BY hour, product_id;

-- Level 3: Daily aggregation (chain from hourly, not raw events)
CREATE TABLE daily_stats (...) ENGINE = SummingMergeTree() ORDER BY (date, product_id);
CREATE MATERIALIZED VIEW daily_stats_mv TO daily_stats AS
SELECT
    toDate(hour) as date,
    product_id,
    SUM(event_count) as event_count,
    SUM(total_revenue) as total_revenue
FROM hourly_stats
GROUP BY date, product_id;

Benefits:

Reduces redundant computation (don't re-process raw events for daily stats)
Each level optimized for different query patterns
Easier to debug and modify intermediate aggregations

Step 6: Handle Schema Evolution and Versioning

When modifying aggregation logic:

-- Create new versioned view with updated schema
CREATE MATERIALIZED VIEW hourly_stats_v2_mv TO hourly_stats_v2 AS
SELECT
    toStartOfHour(timestamp) as hour,
    product_id,
    COUNT() as sales_count,
    SUM(revenue) as total_revenue,
    COUNT(DISTINCT user_id) as unique_customers  -- New column
FROM orders
GROUP BY hour, product_id;

-- Migrate queries gradually to v2
-- Then drop old view:
DROP VIEW hourly_stats_v1_mv;
DROP TABLE hourly_stats_v1;

Avoid ALTER TABLE on materialized view destinations when possible (can cause data loss).

Step 7: Monitor and Optimize View Performance

Check View Execution Performance

SELECT
    view_name,
    elapsed,
    read_rows,
    memory_usage
FROM system.query_log
WHERE query_kind = 'MaterializedView'
  AND elapsed > 1
ORDER BY elapsed DESC;

Common Performance Issues and Solutions

High memory usage: Reduce GROUP BY cardinality or add WHERE filters
Slow execution: Simplify aggregation logic or chain views instead of complex single view
Data lag: Increase kafka_max_block_size if consuming from Kafka

See references/troubleshooting.md for debugging guide.

Step 8: Backfill Historical Data

When adding new views to existing data:

Option 1: Direct Insert (Safest)

-- Backfill historical data directly to destination
INSERT INTO hourly_stats
SELECT
    toStartOfHour(timestamp) as hour,
    product_id,
    COUNT() as sales_count,
    SUM(revenue) as total_revenue
FROM orders
WHERE date < '2024-01-01'
GROUP BY hour, product_id;

Option 2: Use POPULATE (for empty tables)

-- Only use if destination table is empty
CREATE MATERIALIZED VIEW hourly_stats_mv TO hourly_stats
POPULATE AS
SELECT ... FROM orders;

⚠️ Avoid POPULATE on large source tables (can be very slow).

Examples & Patterns

The following patterns are common use cases for materialized views:

Real-Time Dashboard Metrics

Pre-aggregate metrics to enable instant dashboard queries:

CREATE MATERIALIZED VIEW dashboard_metrics_mv TO dashboard_metrics AS
SELECT
    toStartOfHour(timestamp) as hour,
    COUNT() as total_events,
    uniq(user_id) as unique_users,
    countIf(event_type = 'purchase') as purchases
FROM events
GROUP BY hour;

User Segmentation

Automatically segment users based on behavior:

CREATE MATERIALIZED VIEW user_segments_mv TO user_segments AS
SELECT
    user_id,
    COUNT() as event_count,
    SUM(revenue) as lifetime_value,
    CASE
        WHEN lifetime_value > 1000 THEN 'vip'
        WHEN lifetime_value > 100 THEN 'regular'
        ELSE 'casual'
    END as segment
FROM events
GROUP BY user_id;

Multi-Destination Routing

Route data to multiple aggregations:

-- High-value orders
CREATE MATERIALIZED VIEW high_value_orders_mv TO high_value_orders AS
SELECT * FROM orders WHERE total_amount > 1000;

-- All orders
CREATE MATERIALIZED VIEW all_orders_mv TO all_orders AS
SELECT * FROM orders;

Complete Examples

For comprehensive, production-ready examples covering real-world scenarios, see the examples/ directory:

Real-time dashboards and metrics aggregation
User segmentation and cohort analysis
Funnel analysis and conversion tracking
Time-series downsampling and rollups
Kafka streaming pipelines with end-to-end examples

Requirements

ClickHouse 20.6+: Basic materialized views
ClickHouse 21.9+: Refreshable materialized views
ClickHouse 22.0+: Advanced State/Merge functions
Python client (for querying views): pip install clickhouse-driver
Kafka/Redpanda: For streaming pipelines (optional)

References

Engine Comparison

references/engine-comparison.md - Detailed comparison of ClickHouse table engines for materialized views:

SummingMergeTree vs AggregatingMergeTree: When to use each for aggregation
ReplacingMergeTree: For deduplication and entity updates
Engine characteristics: Memory usage, merge behavior, query performance
Decision matrix: Choosing the right engine for your use case

State/Merge Functions Reference

references/state-merge-reference.md - Complete reference for all State/Merge function pairs:

Function pair mapping (State ↔ Merge)
Supported aggregation functions
Parameter specifications and usage
Performance considerations
Examples for each function

Kafka Streaming Pipeline

examples/kafka-streaming-pipeline.md - Real-world Kafka integration patterns:

Consuming from Kafka with Kafka table engine
Chaining materialized views with Kafka sources
Error handling and data validation
Complete end-to-end pipeline example
Monitoring Kafka consumption in ClickHouse

Aggregation Patterns

examples/aggregation-patterns.md - SQL pattern examples and common scenarios:

Real-time dashboards and metrics
User segmentation and cohort analysis
Funnel analysis and conversion tracking
Time-series downsampling and rollups
Copy-paste ready SQL examples

Complex Aggregates with State/Merge

examples/state-merge-aggregates.md - Advanced aggregation techniques:

Using uniq, quantile, and topK in views
Multi-stage aggregation with State/Merge
Accuracy vs performance tradeoffs
Real-world performance tuning examples
Working with approximate functions

Troubleshooting Guide

references/troubleshooting.md - Comprehensive debugging guide for common issues:

Performance problems and optimization strategies
Data correctness and consistency issues
Schema evolution and migration challenges
Monitoring and health checks
Common error messages and solutions