Setting the telemetry TTL can take a while, and that should be OK

We're dogfooding the work in #10366, and @paudmir found an inconsistency. They recorded the size of the tables before changing the policy, set it to one day (from 30), and then kept track of the sizes of the tables. They haven't changed as much as you'd expect.

Looking at the `omdb` output, it _looks_ like setting the policy worked. As of this moment:

```
root@oxz_switch0:~# omdb clickhouse-admin retention-policy
note: clickhouse-admin URL not specified.  Will pick one from DNS.
note: using DNS from system config (typically /etc/resolv.conf)
note: (if this is not right, use --dns-server to specify an alternate DNS server)
note: using clickhouse-admin URL http://[fd00:1122:3344:107::4]:8888
Retention:  1 days
root@oxz_switch0:~# omdb clickhouse-admin database-usage
note: clickhouse-admin URL not specified.  Will pick one from DNS.
note: using DNS from system config (typically /etc/resolv.conf)
note: (if this is not right, use --dns-server to specify an alternate DNS server)
note: using clickhouse-admin URL http://[fd00:1122:3344:107::4]:8888

Last usage
   Started: 2026-05-08T02:54:26.759Z
 Completed: 2026-05-08T02:54:26.762Z

TABLE_NAME                                          N_BYTES    N_ROWS
jmcarp.measurements_cumulativei64_doubledelta       1.52 GiB   358116000
jmcarp.measurements_cumulativeu64                   0 B        0
jmcarp.measurements_cumulativeu64_baseline          744.75 MiB 73433606
jmcarp.measurements_cumulativeu64_combined_delta    354.89 MiB 73433606
jmcarp.measurements_cumulativeu64_datum_delta       622.11 MiB 73433606
jmcarp.measurements_cumulativeu64_datum_doubledelta 632.18 MiB 73433606
jmcarp.measurements_cumulativeu64_datum_zstd1       665.90 MiB 73433606
jmcarp.measurements_cumulativeu64_lowcardinality    736.36 MiB 73433606
jmcarp.measurements_cumulativeu64_partitioned       0 B        0
jmcarp.measurements_cumulativeu64_start_time_dd     743.09 MiB 73433606
jmcarp.measurements_cumulativeu64_start_time_delta  742.85 MiB 73433606
jmcarp.measurements_cumulativeu64_timestamp_dd      528.46 MiB 73433606
jmcarp.measurements_cumulativeu64_timestamp_delta   477.53 MiB 73433606
jmcarp.measurements_f32_baseline                    715.21 MiB 88938784
jmcarp.measurements_f32_datum_doubledelta           678.61 MiB 88938784
jmcarp.measurements_f32_datum_gorilla               659.63 MiB 88938784
jmcarp.measurements_f32_datum_zstd1                 655.51 MiB 88938784
jmcarp.measurements_f32_timestamp_dd                497.66 MiB 88938784
jmcarp.measurements_f32_timestamp_delta             441.59 MiB 88938784
oximeter.fields_bool                                0 B        0
oximeter.fields_i16                                 98.23 KiB  18672
oximeter.fields_i32                                 0 B        0
oximeter.fields_i64                                 0 B        0
oximeter.fields_i8                                  0 B        0
oximeter.fields_ipaddr                              38.15 KiB  3212
oximeter.fields_string                              1.59 MiB   207340
oximeter.fields_u16                                 82.62 KiB  8671
oximeter.fields_u32                                 687.93 KiB 69997
oximeter.fields_u64                                 1.14 KiB   16
oximeter.fields_u8                                  100.07 KiB 18686
oximeter.fields_uuid                                1.49 MiB   172171
oximeter.measurements_bool                          3.36 MiB   441184
oximeter.measurements_bytes                         0 B        0
oximeter.measurements_cumulativef32                 0 B        0
oximeter.measurements_cumulativef64                 0 B        0
oximeter.measurements_cumulativei64                 129.81 MiB 15919950
oximeter.measurements_cumulativeu64                 2.82 GiB   291355741
oximeter.measurements_f32                           3.06 GiB   387420693
oximeter.measurements_f64                           56.39 KiB  3423
oximeter.measurements_histogramf32                  0 B        0
oximeter.measurements_histogramf64                  0 B        0
oximeter.measurements_histogrami16                  0 B        0
oximeter.measurements_histogrami32                  0 B        0
oximeter.measurements_histogrami64                  0 B        0
oximeter.measurements_histogrami8                   0 B        0
oximeter.measurements_histogramu16                  0 B        0
oximeter.measurements_histogramu32                  0 B        0
oximeter.measurements_histogramu64                  15.42 GiB  393367432
oximeter.measurements_histogramu8                   0 B        0
oximeter.measurements_i16                           0 B        0
oximeter.measurements_i32                           0 B        0
oximeter.measurements_i64                           2.28 MiB   254439
oximeter.measurements_i8                            0 B        0
oximeter.measurements_string                        0 B        0
oximeter.measurements_u16                           0 B        0
oximeter.measurements_u32                           0 B        0
oximeter.measurements_u64                           3.01 GiB   484829762
oximeter.measurements_u8                            0 B        0
oximeter.timeseries_schema                          5.51 KiB   159
oximeter.version                                    508 B      13
system.asynchronous_metric_log                      178.72 MiB 159945821
system.metric_log                                   747.72 MiB 2499180
system.query_log                                    123.32 GiB 32182426

Last error
  None
```

Ignore those `jmcarp` database tables. That's a separate bug.

So what gives? I tried to set the policy again, this time watching the clickhouse-admin server logs. That's the program doing the work for us. We see this:

```
01:43:40.324Z ERRO clickhouse-admin-single (ServerContext): failed to set retention policy on table
    error = Telemetry database unavailable: SQL query timed out after 30.000618437s
    file = oximeter/db/src/client/mod.rs:1003
    id = 77aecabd-96a0-47fd-b774-b913924ed90c
    table_name = oximeter.measurements_f32
01:44:10.328Z ERRO clickhouse-admin-single (ServerContext): failed to set retention policy on table
    error = Telemetry database unavailable: SQL query timed out after 30.001495462s
    file = oximeter/db/src/client/mod.rs:1003
    id = 77aecabd-96a0-47fd-b774-b913924ed90c
    table_name = oximeter.measurements_f64
```

So it's made its way up to the `measurements_f32` table, where it first failed. Then most of them after that fail too. What does the ClickHouse log show?

The log file here is `/pool/ext/b6bdfdaf-9c0d-4b74-926c-49ff3ed05562/crypt/debug/oxz_clickhouse_aa646c82-c6d7-4d0c-8401-150130927759/oxide-clickhouse:default.log.1778204701`.

That shows a bunch of successful attempts to set the TTL, and then

```
2026.05.08 01:43:10.322697 [ 456 ] {1b23a14b-21dd-4a23-92dd-d31a88acbc4a} <Trace> ContextAccess (default): Access granted: ALTER TTL ON oximeter.measurements_f32
2026.05.08 01:43:10.324246 [ 456 ] {1b23a14b-21dd-4a23-92dd-d31a88acbc4a} <Information> oximeter.measurements_f32 (10e7391e-79a8-4148-8e53-054c88e8e576): Added mutation: mutation_8159251.txt
2026.05.08 01:43:10.324290 [ 456 ] {1b23a14b-21dd-4a23-92dd-d31a88acbc4a} <Information> oximeter.measurements_f32 (10e7391e-79a8-4148-8e53-054c88e8e576): Waiting mutation: mutation_8159251.txt
```

followed by a whole bunch of other work. We can see that the server created a file to represent the mutation that will apply the TTL, then is waiting for it. When does that complete? It's so long that we're actually into the next file, which is `/pool/ext/b6bdfdaf-9c0d-4b74-926c-49ff3ed05562/crypt/debug/oxz_clickhouse_aa646c82-c6d7-4d0c-8401-150130927759/oxide-clickhouse:default.log.1778205601`. Looking for that query ID `1b23a14b-21dd-4a23-92dd-d31a88acbc4a`, we find it here:

```
026.05.08 01:52:05.800418 [ 456 ] {1b23a14b-21dd-4a23-92dd-d31a88acbc4a} <Information> oximeter.measurements_f32 (10e7391e-79a8-4148-8e53-054c88e8e576): Mutation mutation_8159251.txt done
2026.05.08 01:52:05.800508 [ 191 ] {} <Debug> MemoryTracker: Peak memory usage background process to apply mutate/merge in table: 12.88 MiB.
2026.05.08 01:52:05.800660 [ 456 ] {1b23a14b-21dd-4a23-92dd-d31a88acbc4a} <Debug> TCPHandler: Processed in 535.447900916 sec.
```

So the query completed after about 9 minutes, which is much longer than our 30s client-side timeout. That is not great, but it _could_ continue to work correctly. Except for this:

https://github.com/oxidecomputer/omicron/blob/eb2ff2d353a8985f9d95c427f7a490ce536ce64b/oximeter/db/src/client/mod.rs#L966-L995

We're claiming a connection up front, and then using that for the request to set the TTL for all tables. We're still in the clear, sort of, because the retention-policy task is still running and the server is still handling our requests and applying them. We see more successful attempts to set the TTL until this:

```
2026.05.08 01:52:05.878246 [ 456 ] {} <Trace> TCP-Session: 61d8d8a4-f583-4fa6-b85a-ba1836cf484b Creating query context from session context, user_id: 94309d50-4f52-5250-31bd-74fecac179db, parent context user: default
2026.05.08 01:52:05.878382 [ 401018 ] {} <Trace> TCPHandlerFactory: TCP Request. Address: [fd00:1122:3344:107::4]:49983
2026.05.08 01:52:05.878385 [ 456 ] {3d929e40-91d1-4eae-b537-1bd5ea23da52} <Debug> executeQuery: (from [fd00:1122:3344:107::4]:56353) ALTER TABLE oximeter.measurements_histogrami64 MODIFY TTL toDateTime(timestamp) + toIntervalDay(1) (stage: Complete)
2026.05.08 01:52:05.878496 [ 456 ] {3d929e40-91d1-4eae-b537-1bd5ea23da52} <Trace> ContextAccess (default): Access granted: ALTER TTL ON oximeter.measurements_histogrami64
2026.05.08 01:52:05.878501 [ 401018 ] {} <Debug> TCPHandler: Connected oximeter version 23.8.0, revision: 54465, database: default, user: default.
2026.05.08 01:52:05.878554 [ 401018 ] {} <Debug> TCP-Session: a6a89af0-1c24-4a00-9a57-0f53ee7c5f36 Authenticating user 'default' from [fd00:1122:3344:107::4]:49983
2026.05.08 01:52:05.878590 [ 401018 ] {} <Debug> TCP-Session: a6a89af0-1c24-4a00-9a57-0f53ee7c5f36 Authenticated with global context as user 94309d50-4f52-5250-31bd-74fecac179db
2026.05.08 01:52:05.878616 [ 401018 ] {} <Debug> TCP-Session: a6a89af0-1c24-4a00-9a57-0f53ee7c5f36 Creating session context with user_id: 94309d50-4f52-5250-31bd-74fecac179db
2026.05.08 01:52:05.880968 [ 456 ] {3d929e40-91d1-4eae-b537-1bd5ea23da52} <Information> oximeter.measurements_histogrami64 (da18660f-b6c5-4b7f-8b72-a46b261f1ecd): Added mutation: mutation_10.txt
2026.05.08 01:52:05.881044 [ 456 ] {3d929e40-91d1-4eae-b537-1bd5ea23da52} <Information> oximeter.measurements_histogrami64 (da18660f-b6c5-4b7f-8b72-a46b261f1ecd): Waiting mutation: mutation_10.txt
2026.05.08 01:52:05.881072 [ 456 ] {3d929e40-91d1-4eae-b537-1bd5ea23da52} <Information> oximeter.measurements_histogrami64 (da18660f-b6c5-4b7f-8b72-a46b261f1ecd): Mutation mutation_10.txt done
2026.05.08 01:52:05.881350 [ 456 ] {3d929e40-91d1-4eae-b537-1bd5ea23da52} <Debug> TCPHandler: Processed in 0.003204128 sec.
2026.05.08 01:52:05.881469 [ 456 ] {3d929e40-91d1-4eae-b537-1bd5ea23da52} <Error> Client: TimeoutSetter: Can't reset timeouts: Poco::Exception. Code: 1000, e.code() = 22, Invalid argument, Stack trace (when copying this message, always include the lines below):

 (version 23.8.7.1)
2026.05.08 01:52:05.882354 [ 456 ] {} <Trace> TCP-Session: 61d8d8a4-f583-4fa6-b85a-ba1836cf484b Creating query context from session context, user_id: 94309d50-4f52-5250-31bd-74fecac179db, parent context user: default
2026.05.08 01:52:05.882610 [ 456 ] {6321cc8f-fa9f-44b3-bcbf-0ddb10fc64fe} <Error> TCPHandler: Code: 1000. DB::Exception: Invalid argument. (POCO_EXCEPTION), Stack trace (when copying this message, always include the lines below):

0. StackTrace::tryCapture() @ 0x000000000fd4b977 in /opt/oxide/clickhouse/clickhouse
1. DB::Exception::Exception(DB::Exception::CreateFromPocoTag, Poco::Exception const&) @ 0x000000000fd1e5a9 in /opt/oxide/clickhouse/clickhouse
2. std::__detail::_MakeUniq<DB::Exception>::__single_object std::make_unique<DB::Exception, DB::Exception::CreateFromPocoTag, Poco::Exception const&>(DB::Exception::CreateFromPocoTag&&, Poco::Exception const&) @ 0x0000000015855448 in /opt/oxide/clickhouse/clickhouse
3. DB::TCPHandler::runImpl() @ 0x0000000015843aa9 in /opt/oxide/clickhouse/clickhouse
4. DB::TCPHandler::run() @ 0x000000001585317c in /opt/oxide/clickhouse/clickhouse
5. Poco::Net::TCPServerConnection::start() @ 0x00000000191bb7ec in /opt/oxide/clickhouse/clickhouse
6. Poco::Net::TCPServerDispatcher::run() @ 0x00000000191bbcc1 in /opt/oxide/clickhouse/clickhouse
7. Poco::PooledThread::run() @ 0x0000000019220bd2 in /opt/oxide/clickhouse/clickhouse
8. Poco::ThreadImpl::runnableEntry(void*) @ 0x000000001921e685 in /opt/oxide/clickhouse/clickhouse
9. _thrp_setup @ 0x00000000001171c7 in /system/lib/amd64/libc.so.1

2026.05.08 01:52:05.882809 [ 456 ] {6321cc8f-fa9f-44b3-bcbf-0ddb10fc64fe} <Warning> TCPHandler: Client has gone away.
2026.05.08 01:52:05.882836 [ 456 ] {6321cc8f-fa9f-44b3-bcbf-0ddb10fc64fe} <Debug> TCPHandler: Processed in 0.000609382 sec.
2026.05.08 01:52:05.882858 [ 456 ] {} <Debug> TCPHandler: Done processing connection.
2026.05.08 01:52:05.882884 [ 456 ] {} <Debug> TCP-Session: 61d8d8a4-f583-4fa6-b85a-ba1836cf484b Logout, user_id: 94309d50-4f52-5250-31bd-74fecac179db
```

So that's the same TCP session, and we're now up to the table `measurements_histogrami64`. We actually _do_ set the TTL, but it looks like the server gets an error after that trying to set a socket timeout. I'm not sure why. But after that, we see no more attempts to `ALTER TABLE` at all.

I'm really unclear how this got wedge so badly, and why ClickHouse received another 7 requests but we sent another 13 (up to `measurements_string`). Regardless, clearly we could have (1) handled this more gracefully, and (2) made it more obvious that it didn't actually work. As was noted during review, we're swallowing this error and only logging it, and then only reporting the retention policy from one table, even if setting all of them failed.

We should

- claim a new connection here, letting the old one recycle / reset
- return all the errors, rather than misleadingly claiming that it worked
- list or at least check the policies on each table when fetching the policy, and noting that at least some of them have not been applied


The good news here is that we can...try again.

	pub async fn set_retention_policy(
	&self,
	policy: RetentionPolicy,
	replicated: bool,
	) -> Result<(), Error> {
	let mut handle = self.claim_connection().await?;
	let tables =
	self.list_retention_policy_tables(&mut handle, replicated).await?;
	let mut res = Ok(());
	for table in tables.iter() {
	if let Err(e) = self
	.set_one_table_retention_policy(
	&mut handle,
	&policy,
	table,
	replicated,
	)
	.await
	{
	error!(
	self.log,
	"failed to set retention policy on table";
	"table_name" => table,
	"error" => InlineErrorChain::new(&e),
	);
	res = Err(e);
	}
	}
	res
	}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Setting the telemetry TTL can take a while, and that should be OK #10415

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Setting the telemetry TTL can take a while, and that should be OK #10415

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions