Skip to content

Commit 6083d8e

Browse files
committed
docs(clickhouse): add telemetry cold-tier and sharding setup guide
1 parent 4bd33b7 commit 6083d8e

2 files changed

Lines changed: 328 additions & 0 deletions

File tree

HelmChart/Docs/Clickhouse.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,7 @@
11
### Clickhouse Ops
22

3+
For telemetry cold-tier and app-level sharding setup, see [ClickhouseTelemetryColdTierAndSharding.md](./ClickhouseTelemetryColdTierAndSharding.md).
4+
35
To access clickhouse use port forwarding in kubernetes
46

57
```
Lines changed: 326 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,326 @@
1+
# ClickHouse Telemetry Cold Tier and Sharding
2+
3+
This document explains how to enable the new telemetry cold-tier and app-level sharding support in self-hosted OneUptime.
4+
5+
## What this feature does
6+
7+
### Cold tier
8+
When cold tier is enabled, telemetry tables can:
9+
- keep recent data on the local ClickHouse disk
10+
- move older parts to an object-storage-backed ClickHouse disk such as `s3_cold`
11+
- delete data later using the existing retention-based delete path
12+
13+
### Sharding
14+
When telemetry sharding is enabled, OneUptime:
15+
- keeps the existing local telemetry table names as the source of schema, TTL, and mutations
16+
- creates `...Distributed` wrapper tables for telemetry reads and writes
17+
- routes telemetry reads and writes through those distributed wrappers
18+
19+
This makes multi-shard ClickHouse topology meaningful at the app layer.
20+
21+
---
22+
23+
## Prerequisites
24+
25+
Use this only if all of the following are true:
26+
27+
1. You run ClickHouse under the Altinity operator path.
28+
2. Your ClickHouse deployment exposes:
29+
- a cluster name
30+
- replicated local tables
31+
- a Keeper / ZooKeeper-compatible coordinator
32+
3. For cold tier, your ClickHouse deployment also exposes:
33+
- a disk such as `s3_cold`
34+
- a storage policy such as `tiered`
35+
4. You can inject environment variables into the OneUptime **app** and **worker** containers.
36+
37+
> Recommended assumption: use this for new topology rollouts or fresh telemetry datasets. Historical telemetry backfill into a new shard layout is a separate task.
38+
39+
---
40+
41+
## 1. Enable operator-managed ClickHouse
42+
43+
Cold tier and meaningful sharding both assume the operator-managed ClickHouse path, not the legacy single-StatefulSet deployment.
44+
45+
Example:
46+
47+
```yaml
48+
clickhouseOperator:
49+
altinity:
50+
enabled: true
51+
cluster:
52+
shardsCount: 2
53+
replicasCount: 2
54+
keeper:
55+
enabled: true
56+
replicas: 3
57+
```
58+
59+
Notes:
60+
- `shardsCount` enables horizontal distribution.
61+
- `replicasCount` enables replicated local tables.
62+
- `keeper` is required when you use replicated local tables.
63+
64+
If you only want cold tier without multi-shard routing, keep:
65+
66+
```yaml
67+
cluster:
68+
shardsCount: 1
69+
replicasCount: 2
70+
```
71+
72+
---
73+
74+
## 2. Add the cold-tier ClickHouse disk and storage policy
75+
76+
You must provide a ClickHouse disk and policy yourself. The app only consumes them.
77+
78+
Example `files` entry for the Altinity operator values:
79+
80+
```yaml
81+
clickhouseOperator:
82+
altinity:
83+
files:
84+
config.d/storage-s3-cold.xml: |
85+
<clickhouse>
86+
<storage_configuration>
87+
<disks>
88+
<s3_cold>
89+
<type>s3</type>
90+
<endpoint>https://s3.ap-northeast-2.amazonaws.com/your-bucket/oneuptime/</endpoint>
91+
<use_environment_credentials>1</use_environment_credentials>
92+
<metadata_path>/var/lib/clickhouse/disks/s3_cold/</metadata_path>
93+
</s3_cold>
94+
</disks>
95+
<policies>
96+
<tiered>
97+
<volumes>
98+
<default>
99+
<disk>default</disk>
100+
</default>
101+
<s3_cold>
102+
<disk>s3_cold</disk>
103+
</s3_cold>
104+
</volumes>
105+
</tiered>
106+
</policies>
107+
</storage_configuration>
108+
</clickhouse>
109+
```
110+
111+
If you use IAM roles / workload identity instead of static S3 credentials, configure the ClickHouse pod template or service account so the ClickHouse pods can read and write the backing bucket.
112+
113+
---
114+
115+
## 3. Inject the OneUptime environment variables
116+
117+
These variables must be present in both the **app** and **worker** runtimes.
118+
119+
### Cold tier
120+
121+
```bash
122+
CLICKHOUSE_COLD_TIER_ENABLED=true
123+
CLICKHOUSE_COLD_TIER_STORAGE_POLICY=tiered
124+
CLICKHOUSE_COLD_TIER_VOLUME=s3_cold
125+
CLICKHOUSE_COLD_TIER_METRICS_DAYS=7
126+
CLICKHOUSE_COLD_TIER_LOGS_DAYS=7
127+
CLICKHOUSE_COLD_TIER_TRACES_DAYS=3
128+
```
129+
130+
Meaning:
131+
- `CLICKHOUSE_COLD_TIER_ENABLED`: turns cold-tier DDL/reconcile on
132+
- `CLICKHOUSE_COLD_TIER_STORAGE_POLICY`: local table `storage_policy`
133+
- `CLICKHOUSE_COLD_TIER_VOLUME`: volume used in `TO VOLUME ...`
134+
- `*_DAYS`: move-to-cold thresholds per signal
135+
136+
### Sharding
137+
138+
```bash
139+
CLICKHOUSE_TELEMETRY_SHARDING_ENABLED=true
140+
CLICKHOUSE_CLUSTER_NAME=ou
141+
```
142+
143+
Meaning:
144+
- `CLICKHOUSE_TELEMETRY_SHARDING_ENABLED`: turns on distributed-table routing for telemetry
145+
- `CLICKHOUSE_CLUSTER_NAME`: must match the ClickHouse cluster name exposed in your `remote_servers` / operator topology
146+
147+
> The public Helm chart in this repository does not automatically expose these env vars yet. Inject them using your own deployment overlay, Helm customization, or platform-specific env wiring.
148+
149+
---
150+
151+
## 4. Rollout behavior
152+
153+
### Cold tier behavior
154+
With cold tier enabled:
155+
- new local telemetry tables are created with:
156+
- `storage_policy = 'tiered'`
157+
- TTL clauses containing `TO VOLUME 's3_cold'`
158+
- existing local telemetry tables are reconciled at boot under the shared migration advisory lock
159+
160+
That reconcile updates:
161+
- table settings via `MODIFY SETTING storage_policy = ...`
162+
- table TTL via `MODIFY TTL ...`
163+
164+
### Sharding behavior
165+
With sharding enabled:
166+
- local telemetry tables remain the source of:
167+
- schema
168+
- TTL
169+
- mutations
170+
- reconcile
171+
- OneUptime creates and uses `...Distributed` wrappers for telemetry reads and writes
172+
173+
That means:
174+
- reads and inserts fan out across shards
175+
- `ALTER TABLE`, TTL, and mutation ownership stay on the local tables
176+
177+
---
178+
179+
## 5. Recommended enablement order
180+
181+
### Cold tier only
182+
1. enable operator-managed ClickHouse
183+
2. expose `s3_cold` disk and `tiered` policy
184+
3. inject `CLICKHOUSE_COLD_TIER_*` env vars
185+
4. deploy
186+
5. verify
187+
188+
### Sharding + cold tier together
189+
1. enable operator-managed ClickHouse
190+
2. set `shardsCount > 1`
191+
3. keep `replicasCount >= 1`
192+
4. expose `s3_cold` disk and `tiered` policy
193+
5. inject both:
194+
- `CLICKHOUSE_COLD_TIER_*`
195+
- `CLICKHOUSE_TELEMETRY_SHARDING_ENABLED=true`
196+
- `CLICKHOUSE_CLUSTER_NAME=<cluster>`
197+
6. deploy
198+
7. verify
199+
200+
---
201+
202+
## 6. Verification
203+
204+
After deployment, verify the storage capability first.
205+
206+
### Check storage policy
207+
208+
```sql
209+
SELECT *
210+
FROM system.storage_policies
211+
WHERE policy_name = 'tiered';
212+
```
213+
214+
### Check disk
215+
216+
```sql
217+
SELECT *
218+
FROM system.disks
219+
WHERE name = 's3_cold';
220+
```
221+
222+
### Check a local telemetry table
223+
224+
```sql
225+
SHOW CREATE TABLE oneuptime.MetricItemV3;
226+
SHOW CREATE TABLE oneuptime.LogItemV3;
227+
SHOW CREATE TABLE oneuptime.SpanItemV3;
228+
```
229+
230+
Expect to see:
231+
- a replicated local engine when sharding / replication is enabled
232+
- a TTL containing `TO VOLUME 's3_cold'`
233+
- `storage_policy = 'tiered'`
234+
235+
### Check the distributed wrapper
236+
237+
```sql
238+
SHOW CREATE TABLE oneuptime.MetricItemV3Distributed;
239+
SHOW CREATE TABLE oneuptime.LogItemV3Distributed;
240+
SHOW CREATE TABLE oneuptime.SpanItemV3Distributed;
241+
```
242+
243+
Expect to see:
244+
- `ENGINE = Distributed(...)`
245+
- the configured cluster name
246+
- the original local table name as the target
247+
248+
### Check part placement after TTL materialization
249+
250+
```sql
251+
SELECT database, table, disk_name, path
252+
FROM system.parts
253+
WHERE database = 'oneuptime'
254+
AND table = 'MetricItemV3'
255+
AND active;
256+
```
257+
258+
After enough data ages past the configured threshold, old parts should move onto `s3_cold`.
259+
260+
---
261+
262+
## 7. Important limitations
263+
264+
This feature does **not** include:
265+
- historical telemetry backfill into a new shard topology
266+
- migration of old local telemetry datasets into a newly sharded layout
267+
- automatic chart values for the new app env vars in the public Helm chart
268+
269+
Recommended production assumption:
270+
- treat sharded telemetry rollout as a fresh topology cutover
271+
- handle historical backfill separately if needed
272+
273+
---
274+
275+
## 8. Minimal example
276+
277+
```yaml
278+
clickhouseOperator:
279+
altinity:
280+
enabled: true
281+
cluster:
282+
shardsCount: 2
283+
replicasCount: 2
284+
keeper:
285+
enabled: true
286+
replicas: 3
287+
files:
288+
config.d/storage-s3-cold.xml: |
289+
<clickhouse>
290+
<storage_configuration>
291+
<disks>
292+
<s3_cold>
293+
<type>s3</type>
294+
<endpoint>https://s3.ap-northeast-2.amazonaws.com/your-bucket/oneuptime/</endpoint>
295+
<use_environment_credentials>1</use_environment_credentials>
296+
<metadata_path>/var/lib/clickhouse/disks/s3_cold/</metadata_path>
297+
</s3_cold>
298+
</disks>
299+
<policies>
300+
<tiered>
301+
<volumes>
302+
<default>
303+
<disk>default</disk>
304+
</default>
305+
<s3_cold>
306+
<disk>s3_cold</disk>
307+
</s3_cold>
308+
</volumes>
309+
</tiered>
310+
</policies>
311+
</storage_configuration>
312+
</clickhouse>
313+
```
314+
315+
Inject these env vars into both app and worker:
316+
317+
```bash
318+
CLICKHOUSE_COLD_TIER_ENABLED=true
319+
CLICKHOUSE_COLD_TIER_STORAGE_POLICY=tiered
320+
CLICKHOUSE_COLD_TIER_VOLUME=s3_cold
321+
CLICKHOUSE_COLD_TIER_METRICS_DAYS=7
322+
CLICKHOUSE_COLD_TIER_LOGS_DAYS=7
323+
CLICKHOUSE_COLD_TIER_TRACES_DAYS=3
324+
CLICKHOUSE_TELEMETRY_SHARDING_ENABLED=true
325+
CLICKHOUSE_CLUSTER_NAME=ou
326+
```

0 commit comments

Comments
 (0)