ClickHouse Materialized Views

Load when creating Materialized Views for real-time aggregation, ETL pipelines, or query optimization.

Prerequisite: Understand MergeTree engine variants in clickhouse-schema-design skill.

Reference Documentation

Search terms: materialized view, MV, real-time aggregation, AggregateFunction, -State, -Merge, SummingMergeTree, pre-aggregation, incremental

Critical Rules

[CRITICAL]

MVs are triggers, not caches. They process INSERT data, not query results.
Use correct engine. AggregatingMergeTree for complex aggregates, SummingMergeTree for simple counters.
Query with -Merge functions or argMax. Aggregation completes at query time, not insert time.
Chained MVs see the pre-aggregated block, NOT the merged table state. A downstream MV (A→B→C) receives the block just inserted into B's target table — already grouped by the upstream MV's SELECT, but not merged with existing data. This means each insert batch is processed independently and -MergeState is required when chaining AggregatingMergeTree MVs.

[HIGH]

MV sees INSERT only. No backfill; existing data must be inserted manually.
ORDER BY in target must match GROUP BY in MV. Otherwise aggregation won't work properly.

[MEDIUM]

Avoid too many MVs on one source table when possible. Each MV adds overhead to every INSERT.

How Materialized Views Work

Source Table ──INSERT──► MV Transform ──► Target Table
                │
                └─ MV executes SELECT for each inserted block

Key insight: The MV's SELECT query runs on each INSERT batch. Results go to the target table. The MV does NOT query historical data.

Pattern 1: Real-Time Counters (SummingMergeTree)

Best for simple sums and counts that need real-time updates.

-- Source: raw events
CREATE TABLE events (
    event_time DateTime,
    tenant_id UInt32,
    event_type String,
    user_id UInt64
) ENGINE = MergeTree()
ORDER BY (tenant_id, event_time);

-- Target: daily counters
CREATE TABLE daily_event_counts (
    date Date,
    tenant_id UInt32,
    event_type LowCardinality(String),
    event_count UInt64,
    unique_users UInt64
)
ENGINE = SummingMergeTree()
ORDER BY (tenant_id, date, event_type);

-- MV: transform inserts
CREATE MATERIALIZED VIEW daily_event_counts_mv
TO daily_event_counts AS
SELECT
    toDate(event_time) AS date,
    tenant_id,
    event_type,
    count() AS event_count,
    uniq(user_id) AS unique_users  -- WARNING: Not additive!
FROM events
GROUP BY date, tenant_id, event_type;

Warning: uniq() in SummingMergeTree is not accurate—sums don't equal unique counts. Use AggregatingMergeTree for unique counts.

Pattern 2: Complex Aggregates (AggregatingMergeTree)

For accurate uniq, quantiles, or any non-additive aggregate.

-- Target: uses AggregateFunction types
CREATE TABLE user_metrics_agg (
    date Date,
    tenant_id UInt32,
    total_events AggregateFunction(sum, UInt64),
    unique_users AggregateFunction(uniq, UInt64),
    p95_duration AggregateFunction(quantile(0.95), Float32)
)
ENGINE = AggregatingMergeTree()
ORDER BY (tenant_id, date);

-- MV: use -State functions
CREATE MATERIALIZED VIEW user_metrics_mv
TO user_metrics_agg AS
SELECT
    toDate(event_time) AS date,
    tenant_id,
    sumState(1) AS total_events,
    uniqState(user_id) AS unique_users,
    quantileState(0.95)(duration_ms) AS p95_duration
FROM events
GROUP BY date, tenant_id;

-- Query: use -Merge functions
SELECT
    date,
    tenant_id,
    sumMerge(total_events) AS events,
    uniqMerge(unique_users) AS users,
    quantileMerge(0.95)(p95_duration) AS p95
FROM user_metrics_agg
WHERE tenant_id = 1
GROUP BY date, tenant_id;

Key pattern: -State to insert, -Merge to query.

Pattern 3: Data Transformation Pipeline

Transform/enrich data as it arrives.

-- Source: raw JSON logs
CREATE TABLE raw_logs (
    timestamp DateTime,
    raw_json String
) ENGINE = MergeTree()
ORDER BY timestamp;

-- Target: parsed structured data
CREATE TABLE parsed_logs (
    timestamp DateTime,
    level LowCardinality(String),
    service LowCardinality(String),
    message String,
    trace_id String
)
ENGINE = MergeTree()
ORDER BY (service, timestamp);

-- MV: parse JSON on insert
CREATE MATERIALIZED VIEW parsed_logs_mv
TO parsed_logs AS
SELECT
    timestamp,
    JSONExtractString(raw_json, 'level') AS level,
    JSONExtractString(raw_json, 'service') AS service,
    JSONExtractString(raw_json, 'message') AS message,
    JSONExtractString(raw_json, 'trace_id') AS trace_id
FROM raw_logs;

Pattern 4: Last Value Tracking (ReplacingMergeTree)

Track latest state per entity.

-- Target: latest user state
CREATE TABLE user_latest_state (
    user_id UInt64,
    last_seen DateTime,
    last_action LowCardinality(String),
    total_actions UInt64
)
ENGINE = ReplacingMergeTree(last_seen)
ORDER BY user_id;

-- MV: update on each event
CREATE MATERIALIZED VIEW user_state_mv
TO user_latest_state AS
SELECT
    user_id,
    max(event_time) AS last_seen,
    argMax(event_type, event_time) AS last_action,
    count() AS total_actions
FROM events
GROUP BY user_id;

Query with argMax pattern (avoid FINAL on large tables):

SELECT
    user_id,
    argMax(last_seen, last_seen) AS last_seen,
    argMax(last_action, last_seen) AS last_action,
    argMax(total_actions, last_seen) AS total_actions
FROM user_latest_state
WHERE user_id = 123
GROUP BY user_id;

Backfilling Existing Data

MVs don't process existing data. Backfill manually:

-- Insert historical data into target table
INSERT INTO daily_event_counts
SELECT
    toDate(event_time) AS date,
    tenant_id,
    event_type,
    count() AS event_count,
    uniq(user_id) AS unique_users
FROM events
WHERE event_time < '2024-01-01'  -- Before MV was created
GROUP BY date, tenant_id, event_type;

MV Management

Check MV Status

-- List all MVs
SELECT name, engine, create_table_query
FROM system.tables
WHERE engine = 'MaterializedView';

-- Check target table size
SELECT
    table,
    formatReadableSize(sum(bytes_on_disk)) AS size,
    sum(rows) AS rows
FROM system.parts
WHERE active AND table = 'daily_event_counts'
GROUP BY table;

Pause/Resume MV

-- Pause (stop processing inserts)
ALTER TABLE events DETACH MATERIALIZED VIEW daily_event_counts_mv;

-- Resume
ALTER TABLE events ATTACH MATERIALIZED VIEW daily_event_counts_mv;

Modify MV

MVs cannot be altered. Drop and recreate:

DROP VIEW daily_event_counts_mv;
-- Optionally truncate target: TRUNCATE TABLE daily_event_counts;
CREATE MATERIALIZED VIEW daily_event_counts_mv TO daily_event_counts AS ...;
-- Backfill if needed

Common Pitfalls

Pitfall	Problem	Solution
Using `uniq()` with SummingMergeTree	Sums don't equal uniques	Use AggregatingMergeTree with `uniqState/uniqMerge`
Forgetting argMax or -Merge	Incomplete aggregation results	Use `argMax` pattern for Replacing/Collapsing, `-Merge` for Aggregating
No backfill after MV creation	Missing historical data	Manually INSERT historical aggregates
MV on wrong table	Inserts to wrong source ignored	Ensure MV is on the table receiving INSERTs
Too many MVs on one source	Slow inserts	Consider fewer MVs or async processing
Assuming chain is cheaper than parallel	Wrong topology choice	Chain and parallel have nearly identical resource cost — choose topology based on query granularity needs, not performance
Chained MV reads final table state	Wrong aggregation logic, missing data	Downstream MVs see the pre-aggregated block, not merged state — use `-MergeState` when chaining AggregatingMergeTree

Decision Tree

Need to aggregate data at query time?
│
├─ Yes, and data changes frequently → Query raw data
│
├─ Yes, but queries are slow → Create MV
│  │
│  ├─ Simple sums/counts only?
│  │  └─ SummingMergeTree
│  │
│  ├─ Need uniq, quantile, or complex aggregates?
│  │  └─ AggregatingMergeTree with -State/-Merge
│  │
│  └─ Need latest value per key?
│     └─ ReplacingMergeTree with argMax pattern
│
└─ Need to transform/parse data on insert?
   └─ MV with regular MergeTree target

Complete Example: Multi-Level Aggregation

Chain MVs when each level serves different query granularity (hourly → daily). The downstream MV sees the pre-aggregated block (output of the upstream MV's SELECT), not the merged table state — use -MergeState to handle partial aggregation states.

Cost note: Chain (A→B→C) vs parallel (A→B, A→C) topologies have nearly identical resource consumption. The bottleneck is total data written to target tables, not the topology. Choose based on query needs, not performance assumptions.

-- Level 1: Raw events (source)
CREATE TABLE events (...) ENGINE = MergeTree() ORDER BY ...;

-- Level 2: Hourly aggregates
CREATE TABLE hourly_stats (
    hour DateTime,
    tenant_id UInt32,
    events AggregateFunction(sum, UInt64),
    users AggregateFunction(uniq, UInt64)
) ENGINE = AggregatingMergeTree() ORDER BY (tenant_id, hour);

CREATE MATERIALIZED VIEW hourly_mv TO hourly_stats AS
SELECT
    toStartOfHour(event_time) AS hour,
    tenant_id,
    sumState(1) AS events,
    uniqState(user_id) AS users
FROM events GROUP BY hour, tenant_id;

-- Level 3: Daily aggregates (from hourly)
-- daily_mv receives the pre-aggregated block inserted into hourly_stats
-- (already grouped by hour/tenant with -State values), but NOT merged
-- with existing hourly_stats data. sumMergeState/uniqMergeState handles
-- this: merges partial -State values within the block, then re-wraps as -State.
CREATE TABLE daily_stats (
    date Date,
    tenant_id UInt32,
    events AggregateFunction(sum, UInt64),
    users AggregateFunction(uniq, UInt64)
) ENGINE = AggregatingMergeTree() ORDER BY (tenant_id, date);

CREATE MATERIALIZED VIEW daily_mv TO daily_stats AS
SELECT
    toDate(hour) AS date,
    tenant_id,
    sumMergeState(events) AS events,   -- Merge partial states from block, then re-State
    uniqMergeState(users) AS users
FROM hourly_stats GROUP BY date, tenant_id;

Query any level:

-- Fast daily query
SELECT date, sumMerge(events), uniqMerge(users)
FROM daily_stats WHERE tenant_id = 1 GROUP BY date;

-- Drill down to hourly
SELECT hour, sumMerge(events), uniqMerge(users)
FROM hourly_stats WHERE tenant_id = 1 AND toDate(hour) = today() GROUP BY hour;

Troubleshooting

Always ask for user confirmation before creating/modifying MVs or target tables.

Wrong Aggregation Results

Problem: Counts are too high, uniq values don't match raw data, aggregates seem doubled

Diagnose:

-- Compare MV result vs raw query
SELECT count() FROM target_table;
SELECT count() FROM source_table WHERE <same_filters>;

-- Check for duplicate keys in target
SELECT date, tenant_id, count() AS rows
FROM target_table
GROUP BY date, tenant_id
HAVING rows > 1;

Solutions:

Cause	Fix
Using `uniq()` with SummingMergeTree	Switch to AggregatingMergeTree with `uniqState`/`uniqMerge`
Forgetting `-Merge` in query	Always use `sumMerge()`, `uniqMerge()` for AggregatingMergeTree
Forgetting `argMax` for ReplacingMergeTree	Use argMax pattern: `SELECT key, argMax(col, version) ... GROUP BY key`
Duplicate inserts to source	Deduplicate source or use ReplacingMergeTree for target

-- BAD: uniq in SummingMergeTree (sums don't work)
ENGINE = SummingMergeTree()
-- SELECT uniq(user_id) AS users  -- Wrong!

-- GOOD: AggregatingMergeTree with State/Merge
ENGINE = AggregatingMergeTree()
-- MV: uniqState(user_id) AS users
-- Query: uniqMerge(users)

MV Not Updating / Missing Data

Problem: Target table not receiving new data, counts stuck at old values

Diagnose:

-- Check if MV is attached
SELECT name, engine FROM system.tables WHERE engine = 'MaterializedView';

-- Check target table recent data
SELECT max(date), count() FROM target_table;

-- Verify source table is receiving inserts
SELECT max(event_time), count() FROM source_table WHERE event_time > now() - INTERVAL 1 HOUR;

Solutions:

Cause	Fix
MV detached	`ALTER TABLE source ATTACH MATERIALIZED VIEW mv_name`
MV on wrong source table	Drop MV, recreate with correct source
Historical data not backfilled	Manually INSERT aggregated historical data
Inserts going to different table	Ensure app inserts to the MV's source table

-- Backfill historical data
INSERT INTO target_table
SELECT
    toDate(event_time) AS date,
    tenant_id,
    sumState(1) AS events,
    uniqState(user_id) AS users
FROM source_table
WHERE event_time < '2024-01-01'  -- Before MV existed
GROUP BY date, tenant_id;

Slow Inserts After Adding MV

Problem: INSERT performance degraded after creating MV, insert latency increased

Diagnose:

-- Check MVs on this table
SELECT name, as_select FROM system.tables
WHERE engine = 'MaterializedView' AND as_select LIKE '%source_table%';

-- Check MV query complexity
EXPLAIN SELECT ... FROM source_table ...;  -- Use MV's SELECT query

Solutions:

Cause	Fix
Too many MVs on one source	Consolidate MVs or use async inserts
Complex MV query (JOINs, heavy transforms)	Simplify MV, move complexity to query time
MV target table has wrong ORDER BY	Match target ORDER BY to MV's GROUP BY

-- Example: MV groups by (tenant_id, date)
-- Target table ORDER BY should match
CREATE TABLE target (...)
ENGINE = AggregatingMergeTree()
ORDER BY (tenant_id, date);  -- Matches GROUP BY in MV

Target Table Growing Too Large

Problem: MV target table larger than expected, not aggregating properly

Diagnose:

-- Check rows per key (should be 1 after merge for same GROUP BY)
SELECT date, tenant_id, count() AS rows
FROM target_table
GROUP BY date, tenant_id
ORDER BY rows DESC
LIMIT 10;

Solutions:

Cause	Fix
ORDER BY doesn't match GROUP BY	Recreate target with ORDER BY matching MV's GROUP BY
Background merges haven't run	Wait for automatic merge, or use argMax in query
Wrong engine (MergeTree instead of Summing/Aggregating)	Recreate with correct engine

-- ORDER BY must match GROUP BY columns for proper aggregation
-- MV: GROUP BY (tenant_id, date, event_type)
-- Target: ORDER BY (tenant_id, date, event_type)  -- Must match!