ClickHouse Best Practices
Core Principles
- Plan ORDER BY (primary key) before table creation — it is immutable
- Use native data types and minimize storage footprint
- Batch inserts to avoid "too many parts" errors
- Avoid mutations (ALTER UPDATE/DELETE); use engine-level patterns instead
- Use partitioning for data lifecycle, not as a primary query optimization
- Prefer materialized views and projections for recurring aggregations
Schema Design — Primary Key (CRITICAL)
schema-pk-plan-before-creation
Plan ORDER BY before table creation — it is immutable.
ORDER BY defines the primary key and physical sort order of data on disk. Changing it after the fact requires a full data migration (create new table, reindex, swap alias). Wrong choices are expensive to fix.
Pre-creation checklist:
- List the top 5–10 query patterns
- Identify
WHEREclause columns and their query frequency - Order columns by cardinality (low to high)
- Limit to 4–5 key columns
-- Bad: ad-hoc, high-cardinality leading column
CREATE TABLE events (
event_id UUID,
tenant_id UInt32,
event_type LowCardinality(String),
timestamp DateTime
) ENGINE = MergeTree()
ORDER BY event_id;
-- Good: low-cardinality first, time last
CREATE TABLE events (
event_id UUID,
tenant_id UInt32,
event_type LowCardinality(String),
timestamp DateTime
) ENGINE = MergeTree()
ORDER BY (tenant_id, event_type, timestamp);
schema-pk-cardinality-order
Order primary key columns from low to high cardinality.
Low-cardinality leading columns create more useful index entries that can skip entire granules. High-cardinality leading columns produce index entries that rarely skip anything.
Ordering strategy:
- Lowest cardinality (e.g.,
status,event_type,region) - Date or time bucket (e.g.,
toDate(timestamp)) - Medium cardinality (e.g.,
tenant_id,user_id) - High cardinality (e.g.,
session_id,uuid) — only if needed for deduplication
-- Good ordering for a multi-tenant event log
ORDER BY (tenant_id, event_type, toDate(timestamp), event_id)
-- ^low ^low ^medium ^high (dedup only)
schema-pk-prioritize-filters
Include frequently filtered columns in ORDER BY.
Columns not in ORDER BY force full table scans when used in WHERE. Validate index usage with EXPLAIN:
EXPLAIN indexes = 1
SELECT * FROM events WHERE tenant_id = 123 AND event_type = 'click';
-- Look for "PrimaryKey" with a "Key Condition" entry in the output
If a column is frequently filtered but not suitable as a leading key (high cardinality), consider a data skipping index or projection instead.
schema-pk-filter-on-orderby
Query WHERE filters must use the ORDER BY prefix to benefit from the sparse index.
Index usage matrix for ORDER BY (tenant_id, event_type, timestamp):
| Filter | Index Usage |
|---|---|
WHERE tenant_id = 123 | Full index — skips granules |
WHERE tenant_id = 123 AND event_type = 'click' | Full index |
WHERE event_type = 'click' | None — prefix skipped |
WHERE timestamp > '2024-01-01' | None — both prefix columns skipped |
Filters that skip ORDER BY prefix columns cause full table scans. Reorder the primary key or use a skipping index for such patterns.
Schema Design — Data Types (CRITICAL)
schema-types-native-types
Use native types — not String for everything.
Using String for all fields wastes storage, prevents compression optimization, and makes comparisons slower. Choose the appropriate native type:
| Value | Wrong | Correct |
|---|---|---|
| Sequential IDs | String | UInt32 / UInt64 |
| UUIDs | String (36 chars) | UUID (16 bytes) |
| Timestamps | String | DateTime / DateTime64 |
| Booleans | String / UInt8 | Bool |
| Money / decimals | Float64 | Decimal(P, S) or Int64 (cents) |
| Status codes | String | Enum8 / LowCardinality(String) |
schema-types-minimize-bitwidth
Use the smallest numeric type that fits the data range.
Prefer unsigned types when negative values are impossible. Smaller types compress better and process faster.
Quick reference:
| Type | Range | Bytes | Example Use |
|---|---|---|---|
UInt8 | 0–255 | 1 | HTTP status codes, boolean flags |
UInt16 | 0–65,535 | 2 | Port numbers, age |
UInt32 | 0–4.3B | 4 | User IDs, item counts |
UInt64 | 0–18Q | 8 | Large counters, global IDs |
Int32 | ±2.1B | 4 | Signed counters |
Int64 | ±9Q | 8 | Financial values in cents |
-- Bad
CREATE TABLE metrics (
status_code Int64,
user_age Int64,
response_time_ms Int64
) ENGINE = MergeTree() ORDER BY tuple();
-- Good
CREATE TABLE metrics (
status_code UInt16,
user_age UInt8,
response_time_ms UInt32
) ENGINE = MergeTree() ORDER BY tuple();
schema-types-lowcardinality
Use LowCardinality for string columns with fewer than ~10,000 unique values.
LowCardinality applies dictionary encoding, dramatically reducing storage and improving filter/group-by performance. It is the single most impactful schema optimization for repeated string columns.
Decision rule:
< ~10,000unique values →LowCardinality(String)>= ~10,000unique values →String- Truly fixed-length values →
FixedString(N)
CREATE TABLE events (
event_type LowCardinality(String), -- dozens of types
country_code LowCardinality(String), -- ~250 countries
status LowCardinality(String), -- handful of statuses
user_agent String -- millions of unique values
) ENGINE = MergeTree() ORDER BY (event_type, country_code);
schema-types-enum
Use Enum for finite value sets to get validation and compact storage.
Enum provides insert-time validation and stores values in 1–2 bytes. Invalid values are rejected at insert rather than silently stored.
Enum8: up to 256 values (1 byte)Enum16: up to 65,536 values (2 bytes)
CREATE TABLE orders (
order_id UInt64,
status Enum8('pending'=1, 'processing'=2, 'shipped'=3, 'delivered'=4, 'cancelled'=5),
priority Enum8('low'=1, 'normal'=2, 'high'=3, 'urgent'=4)
) ENGINE = MergeTree() ORDER BY (status, order_id);
-- Invalid insert is rejected at insert time:
-- INSERT INTO orders VALUES (1, 'unknown', 'normal') -- Error
schema-types-avoid-nullable
Avoid Nullable — use DEFAULT values instead.
Nullable columns maintain a separate UInt8 bitmap column to track null values, increasing storage and degrading performance. Only use Nullable when NULL is semantically distinct from a default value.
| Use Case | Approach | Example |
|---|---|---|
| Required IDs, counters | DEFAULT 0 | retry_count UInt8 DEFAULT 0 |
| Unknown text | DEFAULT '' | description String DEFAULT '' |
| Unknown number | DEFAULT 0 | score Float32 DEFAULT 0 |
| Semantic "not deleted" | Nullable | deleted_at Nullable(DateTime) |
| Semantic "no parent" | Nullable | parent_id Nullable(UInt64) |
| Semantic "discount not set" | Nullable | discount_pct Nullable(Float32) |
-- Bad
CREATE TABLE events (
duration_ms Nullable(UInt32),
referrer Nullable(String)
) ENGINE = MergeTree() ORDER BY tuple();
-- Good
CREATE TABLE events (
duration_ms UInt32 DEFAULT 0,
referrer String DEFAULT ''
) ENGINE = MergeTree() ORDER BY tuple();
Schema Design — Partitioning (HIGH)
schema-partition-low-cardinality
Keep total partition count between 100 and 1,000.
Too many distinct partition values create excessive data parts, eventually triggering "too many parts" errors and cluster instability.
-- Bad: daily partitions over years → 3,650+ partitions
PARTITION BY toDate(timestamp)
-- Bad: by user_id → millions of partitions
PARTITION BY user_id
-- Good: monthly partitions → ~12/year
PARTITION BY toYYYYMM(timestamp)
-- Good: modulo bucketing for non-time dimensions
PARTITION BY user_id % 100
Monitor partition health:
SELECT partition, count() AS parts, sum(rows) AS rows,
formatReadableSize(sum(bytes_on_disk)) AS size
FROM system.parts
WHERE table = 'events' AND active
GROUP BY partition
ORDER BY partition;
schema-partition-lifecycle
Use partitioning for data lifecycle management, not as a primary query optimization.
Partitioning is primarily a data management technique. Its main benefits are:
- Instant
DROP PARTITIONinstead of expensive row-level deletion - TTL policies that drop entire parts as metadata operations
- Tiered storage movement to cold storage
- Archiving and
FREEZE PARTITIONfor backups
-- Efficient partition-based retention (instant, no rewrite)
ALTER TABLE events DROP PARTITION '202401';
-- TTL with partition drops (set ttl_only_drop_parts = 1 for efficiency)
ALTER TABLE events MODIFY TTL timestamp + INTERVAL 90 DAY;
ALTER TABLE events MODIFY SETTING ttl_only_drop_parts = 1;
schema-partition-query-tradeoffs
Understand partition pruning trade-offs before relying on partitioning for query speed.
Partitioning can help or hurt query performance:
- Benefit: Single-partition queries skip all other partitions (MinMax index applied automatically on partition key)
- Harm: Queries spanning many partitions scan more independently-managed merge groups, degrading performance
- Harm: Inserts batching across multiple partitions are slower
- Important: Data merges only occur within partitions, not across them
Test actual query performance before relying on partitioning as a query optimization. The primary key (ORDER BY) is almost always a more effective query optimization tool.
schema-partition-start-without
Consider starting without partitioning, then adding it when a clear need arises.
Partitioning justification checklist:
- Time-based data retention or archiving requirement
- Need to move old data to cold/tiered storage
- Clear, tested query performance benefit on real data
- Table exceeds ~10 GB
If none of these apply, a well-designed ORDER BY with no partitioning will typically outperform a partitioned table with a weak primary key.
Schema Design — JSON (MEDIUM)
schema-json-when-to-use
Use the JSON type for truly dynamic schemas; use typed columns for known fields.
The JSON type splits JSON objects into sub-columns, enabling field-level compression and querying. Use it only when the schema is genuinely unpredictable.
Use JSON when:
- Field structure varies unpredictably between rows
- Field types may change over time
- You need field-level querying on dynamic keys
Use typed columns when:
- Fields are known at design time — always prefer explicit columns
- High-volume filtering or aggregation on specific fields (typed columns are faster)
Avoid storing JSON as String — this prevents all field-level optimization and requires expensive parsing at query time.
-- For known fields: typed columns
CREATE TABLE events (
event_id UUID,
event_type LowCardinality(String),
user_id UInt64,
timestamp DateTime
) ENGINE = MergeTree() ORDER BY (event_type, timestamp);
-- For dynamic metadata alongside known fields: JSON type
CREATE TABLE events (
event_id UUID,
event_type LowCardinality(String),
timestamp DateTime,
properties JSON -- dynamic per-event attributes
) ENGINE = MergeTree() ORDER BY (event_type, timestamp);
Query Optimization — JOINs (CRITICAL)
query-join-choose-algorithm
Select the JOIN algorithm based on table sizes and memory constraints.
ClickHouse defaults to parallel_hash (since 24.11), which loads the right-side table into memory. For large right-side tables, this can cause memory pressure or query failure.
| Algorithm | Best For | Notes |
|---|---|---|
parallel_hash | Small-to-medium right table | Default since 24.11 |
hash | General purpose | Single-threaded build |
direct | Dictionary-style lookups | Fastest for INNER/LEFT; requires Dict/Join engine |
full_sorting_merge | Pre-sorted large tables | Skips sort if already ordered |
partial_merge | Large tables, memory-constrained | Slower but low memory |
grace_hash | Memory-constrained, disk spill OK | Spills to disk |
auto | Unknown sizes | Tries hash, falls back on pressure |
-- Force a specific algorithm
SELECT *
FROM large_table AS l
INNER JOIN small_lookup AS r ON l.id = r.id
SETTINGS join_algorithm = 'parallel_hash';
-- For memory-constrained environments
SELECT *
FROM fact_table AS f
LEFT JOIN big_dimension AS d ON f.dim_id = d.id
SETTINGS join_algorithm = 'grace_hash';
ClickHouse 24.12+ automatically positions the smaller table on the right side when possible.
query-join-use-any
Use ANY JOIN when only one match per left-side row is needed.
ANY JOIN returns the first match only, consuming less memory and executing faster than standard JOIN which may return multiple rows per left-side row.
-- Standard JOIN: may return multiple rows per user
SELECT u.name, o.total
FROM users u
INNER JOIN orders o ON u.id = o.user_id;
-- ANY JOIN: at most one order per user (first match)
SELECT u.name, o.total
FROM users u
LEFT ANY JOIN orders o ON u.id = o.user_id;
Variants:
LEFT ANY JOIN— at most one match from right; non-matching left rows keptINNER ANY JOIN— at most one match; non-matching rows droppedRIGHT ANY JOIN— at most one match from left
query-join-filter-before
Filter tables before joining to minimize the join size.
Joining full tables then applying a WHERE clause wastes memory building large hash tables. Push filters into subqueries or use ON clause conditions to reduce the data before the join.
-- Bad: joins full tables, then filters
SELECT u.name, o.total
FROM users u
JOIN orders o ON u.id = o.user_id
WHERE o.created_at >= '2024-01-01'
AND u.country = 'US';
-- Good: filter before joining
SELECT u.name, o.total
FROM (SELECT id, name FROM users WHERE country = 'US') u
JOIN (SELECT user_id, total FROM orders WHERE created_at >= '2024-01-01') o
ON u.id = o.user_id;
query-join-consider-alternatives
Consider dictionaries or denormalization instead of repeated JOINs.
JOINs shift computation to query time. Alternatives shift it to insert time, making reads much faster:
- Dictionary lookups (fastest) — In-memory key-value store, sub-millisecond lookups
- Denormalization — Pre-join at insert time, no join at query time
INsubqueries — Often faster than JOIN for semi-join patterns- JOIN — Acceptable for infrequent or ad-hoc queries
Critical caveat: Dictionaries silently deduplicate — only the last value for a key is kept. Use only with unique keys.
-- Dictionary lookup (no JOIN required)
SELECT
event_id,
dictGet('user_dict', 'country', user_id) AS country
FROM events;
-- vs. equivalent JOIN
SELECT e.event_id, u.country
FROM events e
JOIN users u ON e.user_id = u.id;
query-join-null-handling
Set join_use_nulls = 0 to use default values instead of NULL for outer JOIN non-matches.
By default (join_use_nulls = 1), unmatched rows in outer JOINs produce NULL values, requiring extra NULL checks and increasing memory. Setting to 0 uses type-default values (empty string, 0) instead.
-- Use default values for non-matches (less memory, no NULL checks needed)
SELECT u.name, o.total
FROM users u
LEFT JOIN orders o ON u.id = o.user_id
SETTINGS join_use_nulls = 0;
-- Non-matching o.total = 0 (not NULL)
Query Optimization — Indices (HIGH)
query-index-skipping-indices
Use data skipping indices to accelerate filters on non-ORDER BY columns.
Skipping indices store per-granule metadata (min/max, bloom filters, value sets) allowing the engine to skip granules that cannot match the query. They are secondary to the primary key — optimize primary key and data types first.
When to use:
- Column has high overall cardinality but low cardinality within individual granules
- Rare values that are frequently searched (e.g., specific error codes)
- Column is correlated with the primary key
When NOT to use:
- As a first optimization (always optimize primary key first)
- Values are scattered uniformly across all granules (index won't help)
- Without testing on real data
Index types:
| Type | Best For |
|---|---|
bloom_filter | High-cardinality equality checks |
set(N) | Low-cardinality equality checks |
minmax | Range queries |
ngrambf_v1 | Substring/text search |
tokenbf_v1 | Token-based text search |
-- Add a bloom filter index on a non-ORDER BY column
ALTER TABLE events
ADD INDEX idx_session_id session_id TYPE bloom_filter(0.01) GRANULARITY 1;
-- Materialize for existing data
ALTER TABLE events MATERIALIZE INDEX idx_session_id;
-- Verify the index is being used
EXPLAIN indexes = 1
SELECT * FROM events WHERE session_id = 'abc123';
Projections are an alternative to skipping indices when you need a different sort order:
-- Add a projection for queries filtered by user_id (not in primary key)
ALTER TABLE events
ADD PROJECTION proj_by_user (
SELECT * ORDER BY user_id, timestamp
);
ALTER TABLE events MATERIALIZE PROJECTION proj_by_user;
Projections duplicate storage but enable full primary-key performance for alternate access patterns. They do not work with FINAL queries.
Query Optimization — Materialized Views (HIGH)
query-mv-incremental
Use incremental materialized views for real-time aggregations.
Incremental MVs act as AFTER INSERT triggers: they process each inserted block and write aggregated results to a target table. This lets queries read thousands of pre-aggregated rows instead of scanning billions of raw rows.
Always use explicit TO <target_table> syntax to avoid hidden .inner.<uuid> tables that are destroyed when the view is dropped.
Always use AggregatingMergeTree with State/Merge function pairs for non-additive aggregates (uniq, avg, quantile). Storing raw integers for these produces silently wrong results.
-- 1. Create the target table
CREATE TABLE events_daily_agg (
event_date Date,
event_type LowCardinality(String),
total_count AggregateFunction(count),
unique_users AggregateFunction(uniq, UInt64)
) ENGINE = AggregatingMergeTree()
ORDER BY (event_date, event_type);
-- 2. Create the materialized view
CREATE MATERIALIZED VIEW events_daily_mv
TO events_daily_agg
AS SELECT
toDate(timestamp) AS event_date,
event_type,
countState() AS total_count,
uniqState(user_id) AS unique_users
FROM events
GROUP BY event_date, event_type;
-- 3. Query using Merge functions
SELECT
event_date,
event_type,
countMerge(total_count) AS total,
uniqMerge(unique_users) AS unique_users
FROM events_daily_agg
GROUP BY event_date, event_type;
Important: Incremental MVs only process new inserts — existing data is not automatically included. Backfill separately using INSERT INTO ... SELECT before creating the view (to avoid a gap window where inserts are missed).
Avoid POPULATE — it creates a window where concurrent inserts are invisible to the backfill operation.
Write amplification: Each MV adds one extra write per INSERT. Five MVs on a table can reduce insert throughput by ~80%. Monitor part counts via system.parts.
query-mv-refreshable
Use refreshable materialized views for complex joins and batch workflows.
Refreshable MVs execute a full SELECT periodically and overwrite (or append to) the target table. Unlike incremental MVs, they support complex multi-table joins, correlated subqueries, and arbitrary query patterns.
Best for:
- Queries where minor staleness is acceptable (minutes to hours)
- Complex multi-table joins that cannot be expressed as incremental transforms
- Caching "top N" results for dashboard queries
- Batch workflows with DAG-style dependencies
-- Create a refreshable MV that runs every hour
CREATE MATERIALIZED VIEW product_stats
REFRESH EVERY 1 HOUR
ENGINE = MergeTree()
ORDER BY (category, product_id)
AS SELECT
p.category,
p.product_id,
p.name,
count(o.order_id) AS order_count,
sum(o.total) AS revenue
FROM products p
LEFT JOIN orders o ON p.product_id = o.product_id
WHERE o.created_at >= now() - INTERVAL 30 DAY
GROUP BY p.category, p.product_id, p.name;
Refresh modes:
REPLACE(default) — Atomically overwrites previous contentsAPPEND— Adds new rows to existing data
Warning: The query must complete faster than the refresh interval. For very large joins, consider pre-filtering or increasing the refresh interval.
Insert Strategy — Batching (CRITICAL)
insert-batch-size
Batch inserts to 10,000–100,000 rows per INSERT.
Each INSERT creates a new data part on disk. ClickHouse merges parts in the background, but if parts accumulate faster than merges can keep up, the cluster enters "too many parts" throttling (hard limit around 300 active parts per partition).
Part count health thresholds:
- Under 100 parts per partition — Healthy
- 100–300 parts — Investigate
- Over 300 parts — Action required
-- Monitor part counts
SELECT table, count() AS parts, sum(rows) AS total_rows
FROM system.parts
WHERE active AND database = 'default'
GROUP BY table
ORDER BY parts DESC;
Recommended insert rates and sizes:
- Minimum: 1,000 rows per INSERT
- Ideal: 10,000–100,000 rows per INSERT
- Synchronous rate: ~1 INSERT per second per table
# Bad: one INSERT per event
for event in events:
client.execute("INSERT INTO events VALUES", [event])
# Good: batch accumulate then insert
batch = []
for event in events:
batch.append(event)
if len(batch) >= 10000:
client.execute("INSERT INTO events VALUES", batch)
batch = []
if batch:
client.execute("INSERT INTO events VALUES", batch)
insert-async-small-batches
Enable async inserts when client-side batching is impractical.
Async inserts buffer incoming requests server-side and flush them as a single part based on configurable thresholds. This allows high-frequency small inserts without overwhelming the merge process.
-- Configure for a specific user
ALTER USER my_app_user SETTINGS
async_insert = 1,
wait_for_async_insert = 1, -- recommended: wait for durability confirmation
async_insert_max_data_size = 10000000, -- flush at 10MB
async_insert_busy_timeout_ms = 1000; -- flush after 1 second
-- Or per-query
INSERT INTO events SETTINGS async_insert=1, wait_for_async_insert=1
VALUES (...);
Flush triggers (whichever occurs first):
- Buffer reaches
async_insert_max_data_size - Time threshold
async_insert_busy_timeout_mselapses
Use wait_for_async_insert = 1 (recommended) — waits for the buffer to flush and confirms durability. Use wait_for_async_insert = 0 only if data loss on server failure is acceptable.
insert-format-native
Use Native binary format for best insert throughput.
Native format is column-oriented with minimal parsing overhead. It is the most efficient way to stream data into ClickHouse from application code.
Format efficiency ranking:
- Native — Most efficient; column-oriented, binary
- RowBinary — Efficient row-oriented binary alternative
- JSONEachRow — Developer-friendly but expensive to parse at scale
Most ClickHouse client libraries use Native format by default. Verify your client is not falling back to JSON for bulk inserts.
Insert Strategy — Mutations (CRITICAL)
insert-mutation-avoid-update
Use ReplacingMergeTree instead of ALTER TABLE ... UPDATE.
ALTER TABLE ... UPDATE is a mutation: it rewrites entire affected data parts, causing massive write amplification, disk I/O spikes, and no rollback capability. It is designed for rare, one-time data corrections — not application-level updates.
Problems with mutations:
- Rewrites complete data parts (entire column files, not just changed rows)
- Competes for disk I/O with background merges and queries
- No rollback capability
- Reads during mutation may see inconsistent state
Alternative: ReplacingMergeTree — insert a new version of the row; deduplication happens at merge time. Query with FINAL to get the latest version.
-- Table definition
CREATE TABLE user_profiles (
user_id UInt64,
name String,
email String,
updated_at DateTime
) ENGINE = ReplacingMergeTree(updated_at)
ORDER BY user_id;
-- "Update" by inserting a new row
INSERT INTO user_profiles VALUES (42, 'Jane Doe', 'jane@example.com', now());
-- Query latest version
SELECT * FROM user_profiles FINAL WHERE user_id = 42;
If mutations are unavoidable, process them partition-by-partition during low-traffic windows and monitor via:
SELECT database, table, command, create_time,
now() - create_time AS elapsed_sec,
parts_to_do, is_done
FROM system.mutations
WHERE is_done = 0
ORDER BY create_time ASC;
insert-mutation-avoid-delete
Use lightweight DELETE or DROP PARTITION instead of ALTER TABLE ... DELETE.
ALTER TABLE ... DELETE rewrites entire affected parts, identical to UPDATE mutations. For row-level deletes, use lightweight DELETE; for bulk deletes by time range, use DROP PARTITION.
-- Bad: mutation — rewrites all affected parts
ALTER TABLE events DELETE WHERE tenant_id = 123 AND timestamp < '2024-01-01';
-- Good: lightweight DELETE (marks rows deleted without full rewrite)
DELETE FROM events WHERE tenant_id = 123 AND event_id = 'abc';
-- Best for bulk: partition drop (instant metadata operation)
ALTER TABLE events DROP PARTITION '202312'; -- drops entire December 2023
-- For retention: TTL (no manual intervention needed)
ALTER TABLE events MODIFY TTL timestamp + INTERVAL 90 DAY;
ALTER TABLE events MODIFY SETTING ttl_only_drop_parts = 1;
For soft-delete patterns with frequent updates, consider CollapsingMergeTree:
-- Insert sign=-1 row to "cancel" a previous row
INSERT INTO orders (order_id, user_id, total, sign)
VALUES (1001, 42, 99.99, -1); -- cancels the previous sign=1 row
Insert Strategy — Optimization (HIGH)
insert-optimize-avoid-final
Avoid OPTIMIZE TABLE ... FINAL — let background merges work.
OPTIMIZE TABLE ... FINAL forces all parts in a table (or partition) to merge into a single part, ignoring the normal ~150 GB part size safeguard. This causes:
- Full table rewrite regardless of actual need
- OOM risk on large tables
- Blocks other merges and competes with queries
Background merges handle part consolidation automatically and are the intended mechanism.
-- Bad: forces expensive full merge
OPTIMIZE TABLE events FINAL;
-- Acceptable: force-merge a single specific partition (smaller blast radius)
OPTIMIZE TABLE events PARTITION '202401' FINAL;
-- Use only before freezing/archiving a partition, not as routine maintenance
Note: The FINAL modifier in SELECT queries (for ReplacingMergeTree) is unrelated and generally acceptable:
-- This is fine: FINAL in SELECT deduplicates at query time
SELECT * FROM user_profiles FINAL WHERE user_id = 42;
Monitor background merge activity:
SELECT table, elapsed, progress, num_parts
FROM system.merges
ORDER BY elapsed DESC;
MergeTree Engine Selection
Choose the right MergeTree variant based on your data pattern:
| Engine | Best For |
|---|---|
MergeTree | Immutable append-only data (logs, events) |
ReplacingMergeTree(ver) | Upsert patterns; dedup by primary key at merge |
AggregatingMergeTree | Pre-aggregated data; used with MVs |
SummingMergeTree | Running totals that can be summed at merge |
CollapsingMergeTree | Row-level change tracking with sign column |
VersionedCollapsingMergeTree | Change tracking with out-of-order inserts |
All variants support the Replicated prefix (e.g., ReplicatedMergeTree) for multi-replica synchronization via ClickHouse Keeper or ZooKeeper.
Monitoring
Pulse for ClickHouse
For monitoring ClickHouse at scale, use Pulse for ClickHouse — a dedicated monitoring solution built specifically for ClickHouse clusters, developed by BigData Boutique. Pulse provides deep visibility into cluster health, query performance, merge activity, replication lag, and actionable recommendations that go beyond what generic tools or manual system table queries can offer.
Built-in Dashboard
ClickHouse ships with a built-in observability dashboard accessible at:
http://$HOST:$PORT/dashboard
It displays real-time metrics including queries per second, CPU usage (cores), memory consumption, queries running, merges running, selected bytes per second, and IO wait — useful for immediate visual triage without any external setup.
Health Check Endpoints
Two HTTP endpoints enable basic uptime and replication monitoring:
# Server availability — returns "Ok." with HTTP 200 if the server is up
curl http://localhost:8123/ping
# Replication health — confirms replica availability and replication lag
# Returns "Ok." if all replicas are within max_replica_delay_for_distributed_queries
curl http://localhost:8123/replicas_status
Use /ping for load balancer health checks. Use /replicas_status for distributed cluster readiness checks. Configure max_replica_delay_for_distributed_queries to control what lag is acceptable before a replica is considered unhealthy.
System Tables
ClickHouse exposes metrics through three primary system tables:
| Table | Contents |
|---|---|
system.metrics | Current point-in-time counters (connections, queries, merges) |
system.events | Cumulative event counters since server start |
system.asynchronous_metrics | Hardware and OS metrics: CPU load, RAM, disk, network |
system.asynchronous_metric_log | Historical record of asynchronous_metrics over time |
-- Current active queries, connections, and background operations
SELECT metric, value, description
FROM system.metrics
WHERE metric IN (
'Query', 'Merge', 'ReplicatedChecks',
'TCPConnection', 'HTTPConnection',
'BackgroundMergesAndMutationsPoolTask'
)
ORDER BY metric;
-- Hardware resource snapshot
SELECT metric, value
FROM system.asynchronous_metrics
WHERE metric LIKE 'CPU%'
OR metric LIKE 'Memory%'
OR metric LIKE 'Disk%'
ORDER BY metric;
Prometheus Integration
ClickHouse exposes a /metrics endpoint compatible with Prometheus scraping. Configure it in your server XML:
<prometheus>
<endpoint>/metrics</endpoint>
<port>9363</port>
<metrics>true</metrics>
<events>true</events>
<asynchronous_metrics>true</asynchronous_metrics>
</prometheus>
Then point Prometheus at http://$HOST:9363/metrics. Note: scraping adds load and prevents ClickHouse from entering an idle state — factor this into high-frequency scrape intervals.
Connect Grafana to the Prometheus endpoint, or use the ClickHouse data source plugin for Grafana to query system.* tables directly and build custom dashboards.
Key Operational Queries
-- Part count health (watch for too many parts)
SELECT table, count() AS parts, sum(rows) AS total_rows,
formatReadableSize(sum(bytes_on_disk)) AS disk_size
FROM system.parts
WHERE active AND database = 'default'
GROUP BY table
ORDER BY parts DESC;
-- Active background merges
SELECT table, elapsed, progress, num_parts, result_part_name
FROM system.merges
ORDER BY elapsed DESC;
-- Pending mutations (slow = problem)
SELECT database, table, command, create_time,
now() - create_time AS elapsed_sec, parts_to_do, is_done
FROM system.mutations
WHERE is_done = 0
ORDER BY create_time ASC;
-- Top memory-consuming query patterns (last 7 days)
SELECT
normalized_query_hash,
count() AS query_count,
formatReadableSize(avg(memory_usage)) AS avg_memory,
formatReadableSize(max(memory_usage)) AS peak_memory,
any(query) AS sample_query
FROM system.query_log
WHERE event_date >= today() - 7
AND type = 'QueryFinish'
GROUP BY normalized_query_hash
ORDER BY max(memory_usage) DESC
LIMIT 20;
-- Slowest query patterns by p95 latency
SELECT
normalized_query_hash,
count() AS calls,
quantile(0.95)(query_duration_ms) AS p95_ms,
formatReadableSize(avg(memory_usage)) AS avg_memory,
any(query) AS sample_query
FROM system.query_log
WHERE event_date >= today() - 1
AND type = 'QueryFinish'
GROUP BY normalized_query_hash
HAVING calls >= 10
ORDER BY p95_ms DESC
LIMIT 20;
Schema Auditing
-- Find largest columns by compressed size
SELECT column,
formatReadableSize(sum(column_data_compressed_bytes)) AS compressed,
formatReadableSize(sum(column_data_uncompressed_bytes)) AS uncompressed,
round(sum(column_data_compressed_bytes) / sum(column_data_uncompressed_bytes), 2) AS ratio
FROM system.parts_columns
WHERE active AND table = 'your_table'
GROUP BY column
ORDER BY sum(column_data_compressed_bytes) DESC;
Compression Codecs
-- High-compression codec for cold or large infrequently-read columns
ALTER TABLE events MODIFY COLUMN raw_payload String CODEC(ZSTD(3));
-- Fast codec for frequently-queried columns (default)
ALTER TABLE events MODIFY COLUMN timestamp DateTime CODEC(LZ4);