-
Notifications
You must be signed in to change notification settings - Fork 524
Description
Problem
The ClickHouse Operator collects metrics from ClickHouse nodes including per-CPU system metrics like:
metric.OSGuestTimeCPU{N}
metric.OSIOWaitTimeCPU{N}
metric.OSUserTimeCPU{N}
metric.CPUFrequencyMHz_{N}
etc.
On machines with high core counts, this generates an enormous amount of metrics that provide limited value.
Our Setup
15 shards × 2 replicas = 30 ClickHouse nodes
CPUs with ~380 cores per node
Each node exports ~5,000 CPU metrics
Total: ~150,000 CPU metrics across the cluster
Impact
/metrics endpoint response time: ~8 seconds
/metrics response size: ~40 MB
~95% of all metrics are these per-CPU metrics
Why These Metrics Are Less Useful
As ClickHouse documentation states:
This is a system-wide metric, it includes all the processes on the host machine, not just clickhouse-server.
These metrics reflect the entire host, not ClickHouse specifically, making them less actionable for ClickHouse monitoring.
