Diagnostics
Run all queries from checks.sql (cluster-wide) one by one and produce a decision-ready report.
Triage Order (Mandatory)
-
Check current active merges from system.merges .
-
Check merge success/failure trend from system.part_log .
-
Check table-level status (merge_ok , merge_failed , last success/failure).
-
Check merge reason + algorithm (merge_reason , merge_algorithm ).
-
Check merge RAM now + historical peak RAM (system.merges.memory_usage , system.part_log.peak_memory_usage ).
-
Check part-count offenders with split:
-
database = 'system'
-
database != 'system'
-
Check relevant settings, including TTL merge concurrency.
Decision Rules
Use one of these final verdicts explicitly:
-
PROVED : cluster-wide merge stop (no successful merges in the selected window).
-
DECLINED : cluster-wide stop is false.
-
PARTIAL : merges are blocked for specific table(s) while others still merge.
Additional rules:
-
If some tables still have successful merges in same timeframe, do not report global merge stop.
-
If a target table has merge_ok = 0 with repeated MEMORY_LIMIT_EXCEEDED , report table-level block.
-
If merges are 100% Horizontal , state that planner selected horizontal merges (do not say vertical is disabled unless settings prove it).
-
If max part count is driven by system.* tables, call out alert-source mismatch to avoid misattribution to business tables.
Problem-Specific Investigation
"Too Many Parts" Investigation
For a specific table, run ad-hoc checks (time-bound and limited):
select toStartOfMinute(event_time) as minute, countIf(event_type = 'NewPart') as new_parts, countIf(event_type = 'MergeParts') as merges, countIf(event_type = 'MergeParts') - countIf(event_type = 'NewPart') as net_reduction from system.part_log where database = '{database}' and table = '{table}' and event_time > now() - interval 1 hour group by minute order by minute desc limit 60
If net_reduction is negative consistently, inserts outpace merges.
TTL merge pressure and merge-size settings snapshot
Check system.merge_tree_settings if modified Suggest changing (reducing or increasing) in case of a problem as remediation.
-
max_parts_to_merge_at_once
-
max_bytes_to_merge_at_max_space_in_pool
-
max_bytes_to_merge_at_min_space_in_pool
-
enable_vertical_merge_algorithm
-
vertical_merge_algorithm_min_rows_to_activate
-
vertical_merge_algorithm_min_columns_to_activate
-
max_number_of_merges_with_ttl_in_pool
-
max_replicated_merges_with_ttl_in_queue
-
parts_to_delay_insert
-
parts_to_throw_insert
Structural Fix Guidance (When Settings Are Not Enough)
Call out anti-patterns explicitly:
-
Single hot partition (partition_id='all' )
-
Heavy TTL ... GROUP BY ... SET ... on a hot ingestion table
-
Persistent large horizontal merge attempts with OOM failures
Recommended long-term direction:
-
Add time-based partitioning
-
Move heavy rollup logic from TTL path to MV/batch table
-
Keep base-table TTL simple (delete-oriented)
Ad-Hoc Query Guidelines
Required Safeguards
-- Always include LIMIT limit 100
-- Always time-bound historical queries where event_time >= now() - interval 24 hour
-- For part_log, always filter event_type where event_type in ('NewPart', 'MergeParts', 'MutatePart')
Avoid
-
select * from system.part_log
-
Unbounded scans on *_log tables
-
Large joins in-context (aggregate in SQL)
Cross-Module Triggers
Finding Load Module Reason
High memory during merges / OOM altinity-expert-clickhouse-memory
Memory limits and pressure
Slow merges + normal disk altinity-expert-clickhouse-schema
ORDER BY/partitioning anti-patterns
Slow merges + high disk IO altinity-expert-clickhouse-storage
Storage bottleneck
Merges blocked by mutations altinity-expert-clickhouse-mutations
Mutation backlog
Replication lag + merge issues altinity-expert-clickhouse-replication
Queue/replica bottlenecks
Final Report Sections (Mandatory)
-
Environment header
-
Global vs table-specific merge status
-
Current and peak merge RAM
-
Merge reason and merge algorithm findings
-
Max-part offenders (system vs non-system )
-
Current settings and recommended deltas
-
Immediate mitigation and structural remediation