We're dogfooding the work in #10366, and @paudmir found an inconsistency. They recorded the size of the tables before changing the policy, set it to one day (from 30), and then kept track of the sizes of the tables. They haven't changed as much as you'd expect.
Looking at the omdb output, it looks like setting the policy worked. As of this moment:
root@oxz_switch0:~# omdb clickhouse-admin retention-policy
note: clickhouse-admin URL not specified. Will pick one from DNS.
note: using DNS from system config (typically /etc/resolv.conf)
note: (if this is not right, use --dns-server to specify an alternate DNS server)
note: using clickhouse-admin URL http://[fd00:1122:3344:107::4]:8888
Retention: 1 days
root@oxz_switch0:~# omdb clickhouse-admin database-usage
note: clickhouse-admin URL not specified. Will pick one from DNS.
note: using DNS from system config (typically /etc/resolv.conf)
note: (if this is not right, use --dns-server to specify an alternate DNS server)
note: using clickhouse-admin URL http://[fd00:1122:3344:107::4]:8888
Last usage
Started: 2026-05-08T02:54:26.759Z
Completed: 2026-05-08T02:54:26.762Z
TABLE_NAME N_BYTES N_ROWS
jmcarp.measurements_cumulativei64_doubledelta 1.52 GiB 358116000
jmcarp.measurements_cumulativeu64 0 B 0
jmcarp.measurements_cumulativeu64_baseline 744.75 MiB 73433606
jmcarp.measurements_cumulativeu64_combined_delta 354.89 MiB 73433606
jmcarp.measurements_cumulativeu64_datum_delta 622.11 MiB 73433606
jmcarp.measurements_cumulativeu64_datum_doubledelta 632.18 MiB 73433606
jmcarp.measurements_cumulativeu64_datum_zstd1 665.90 MiB 73433606
jmcarp.measurements_cumulativeu64_lowcardinality 736.36 MiB 73433606
jmcarp.measurements_cumulativeu64_partitioned 0 B 0
jmcarp.measurements_cumulativeu64_start_time_dd 743.09 MiB 73433606
jmcarp.measurements_cumulativeu64_start_time_delta 742.85 MiB 73433606
jmcarp.measurements_cumulativeu64_timestamp_dd 528.46 MiB 73433606
jmcarp.measurements_cumulativeu64_timestamp_delta 477.53 MiB 73433606
jmcarp.measurements_f32_baseline 715.21 MiB 88938784
jmcarp.measurements_f32_datum_doubledelta 678.61 MiB 88938784
jmcarp.measurements_f32_datum_gorilla 659.63 MiB 88938784
jmcarp.measurements_f32_datum_zstd1 655.51 MiB 88938784
jmcarp.measurements_f32_timestamp_dd 497.66 MiB 88938784
jmcarp.measurements_f32_timestamp_delta 441.59 MiB 88938784
oximeter.fields_bool 0 B 0
oximeter.fields_i16 98.23 KiB 18672
oximeter.fields_i32 0 B 0
oximeter.fields_i64 0 B 0
oximeter.fields_i8 0 B 0
oximeter.fields_ipaddr 38.15 KiB 3212
oximeter.fields_string 1.59 MiB 207340
oximeter.fields_u16 82.62 KiB 8671
oximeter.fields_u32 687.93 KiB 69997
oximeter.fields_u64 1.14 KiB 16
oximeter.fields_u8 100.07 KiB 18686
oximeter.fields_uuid 1.49 MiB 172171
oximeter.measurements_bool 3.36 MiB 441184
oximeter.measurements_bytes 0 B 0
oximeter.measurements_cumulativef32 0 B 0
oximeter.measurements_cumulativef64 0 B 0
oximeter.measurements_cumulativei64 129.81 MiB 15919950
oximeter.measurements_cumulativeu64 2.82 GiB 291355741
oximeter.measurements_f32 3.06 GiB 387420693
oximeter.measurements_f64 56.39 KiB 3423
oximeter.measurements_histogramf32 0 B 0
oximeter.measurements_histogramf64 0 B 0
oximeter.measurements_histogrami16 0 B 0
oximeter.measurements_histogrami32 0 B 0
oximeter.measurements_histogrami64 0 B 0
oximeter.measurements_histogrami8 0 B 0
oximeter.measurements_histogramu16 0 B 0
oximeter.measurements_histogramu32 0 B 0
oximeter.measurements_histogramu64 15.42 GiB 393367432
oximeter.measurements_histogramu8 0 B 0
oximeter.measurements_i16 0 B 0
oximeter.measurements_i32 0 B 0
oximeter.measurements_i64 2.28 MiB 254439
oximeter.measurements_i8 0 B 0
oximeter.measurements_string 0 B 0
oximeter.measurements_u16 0 B 0
oximeter.measurements_u32 0 B 0
oximeter.measurements_u64 3.01 GiB 484829762
oximeter.measurements_u8 0 B 0
oximeter.timeseries_schema 5.51 KiB 159
oximeter.version 508 B 13
system.asynchronous_metric_log 178.72 MiB 159945821
system.metric_log 747.72 MiB 2499180
system.query_log 123.32 GiB 32182426
Last error
None
Ignore those jmcarp database tables. That's a separate bug.
So what gives? I tried to set the policy again, this time watching the clickhouse-admin server logs. That's the program doing the work for us. We see this:
01:43:40.324Z ERRO clickhouse-admin-single (ServerContext): failed to set retention policy on table
error = Telemetry database unavailable: SQL query timed out after 30.000618437s
file = oximeter/db/src/client/mod.rs:1003
id = 77aecabd-96a0-47fd-b774-b913924ed90c
table_name = oximeter.measurements_f32
01:44:10.328Z ERRO clickhouse-admin-single (ServerContext): failed to set retention policy on table
error = Telemetry database unavailable: SQL query timed out after 30.001495462s
file = oximeter/db/src/client/mod.rs:1003
id = 77aecabd-96a0-47fd-b774-b913924ed90c
table_name = oximeter.measurements_f64
So it's made its way up to the measurements_f32 table, where it first failed. Then most of them after that fail too. What does the ClickHouse log show?
The log file here is /pool/ext/b6bdfdaf-9c0d-4b74-926c-49ff3ed05562/crypt/debug/oxz_clickhouse_aa646c82-c6d7-4d0c-8401-150130927759/oxide-clickhouse:default.log.1778204701.
That shows a bunch of successful attempts to set the TTL, and then
2026.05.08 01:43:10.322697 [ 456 ] {1b23a14b-21dd-4a23-92dd-d31a88acbc4a} <Trace> ContextAccess (default): Access granted: ALTER TTL ON oximeter.measurements_f32
2026.05.08 01:43:10.324246 [ 456 ] {1b23a14b-21dd-4a23-92dd-d31a88acbc4a} <Information> oximeter.measurements_f32 (10e7391e-79a8-4148-8e53-054c88e8e576): Added mutation: mutation_8159251.txt
2026.05.08 01:43:10.324290 [ 456 ] {1b23a14b-21dd-4a23-92dd-d31a88acbc4a} <Information> oximeter.measurements_f32 (10e7391e-79a8-4148-8e53-054c88e8e576): Waiting mutation: mutation_8159251.txt
followed by a whole bunch of other work. We can see that the server created a file to represent the mutation that will apply the TTL, then is waiting for it. When does that complete? It's so long that we're actually into the next file, which is /pool/ext/b6bdfdaf-9c0d-4b74-926c-49ff3ed05562/crypt/debug/oxz_clickhouse_aa646c82-c6d7-4d0c-8401-150130927759/oxide-clickhouse:default.log.1778205601. Looking for that query ID 1b23a14b-21dd-4a23-92dd-d31a88acbc4a, we find it here:
026.05.08 01:52:05.800418 [ 456 ] {1b23a14b-21dd-4a23-92dd-d31a88acbc4a} <Information> oximeter.measurements_f32 (10e7391e-79a8-4148-8e53-054c88e8e576): Mutation mutation_8159251.txt done
2026.05.08 01:52:05.800508 [ 191 ] {} <Debug> MemoryTracker: Peak memory usage background process to apply mutate/merge in table: 12.88 MiB.
2026.05.08 01:52:05.800660 [ 456 ] {1b23a14b-21dd-4a23-92dd-d31a88acbc4a} <Debug> TCPHandler: Processed in 535.447900916 sec.
So the query completed after about 9 minutes, which is much longer than our 30s client-side timeout. That is not great, but it could continue to work correctly. Except for this:
|
pub async fn set_retention_policy( |
|
&self, |
|
policy: RetentionPolicy, |
|
replicated: bool, |
|
) -> Result<(), Error> { |
|
let mut handle = self.claim_connection().await?; |
|
let tables = |
|
self.list_retention_policy_tables(&mut handle, replicated).await?; |
|
let mut res = Ok(()); |
|
for table in tables.iter() { |
|
if let Err(e) = self |
|
.set_one_table_retention_policy( |
|
&mut handle, |
|
&policy, |
|
table, |
|
replicated, |
|
) |
|
.await |
|
{ |
|
error!( |
|
self.log, |
|
"failed to set retention policy on table"; |
|
"table_name" => table, |
|
"error" => InlineErrorChain::new(&e), |
|
); |
|
res = Err(e); |
|
} |
|
} |
|
res |
|
} |
We're claiming a connection up front, and then using that for the request to set the TTL for all tables. We're still in the clear, sort of, because the retention-policy task is still running and the server is still handling our requests and applying them. We see more successful attempts to set the TTL until this:
2026.05.08 01:52:05.878246 [ 456 ] {} <Trace> TCP-Session: 61d8d8a4-f583-4fa6-b85a-ba1836cf484b Creating query context from session context, user_id: 94309d50-4f52-5250-31bd-74fecac179db, parent context user: default
2026.05.08 01:52:05.878382 [ 401018 ] {} <Trace> TCPHandlerFactory: TCP Request. Address: [fd00:1122:3344:107::4]:49983
2026.05.08 01:52:05.878385 [ 456 ] {3d929e40-91d1-4eae-b537-1bd5ea23da52} <Debug> executeQuery: (from [fd00:1122:3344:107::4]:56353) ALTER TABLE oximeter.measurements_histogrami64 MODIFY TTL toDateTime(timestamp) + toIntervalDay(1) (stage: Complete)
2026.05.08 01:52:05.878496 [ 456 ] {3d929e40-91d1-4eae-b537-1bd5ea23da52} <Trace> ContextAccess (default): Access granted: ALTER TTL ON oximeter.measurements_histogrami64
2026.05.08 01:52:05.878501 [ 401018 ] {} <Debug> TCPHandler: Connected oximeter version 23.8.0, revision: 54465, database: default, user: default.
2026.05.08 01:52:05.878554 [ 401018 ] {} <Debug> TCP-Session: a6a89af0-1c24-4a00-9a57-0f53ee7c5f36 Authenticating user 'default' from [fd00:1122:3344:107::4]:49983
2026.05.08 01:52:05.878590 [ 401018 ] {} <Debug> TCP-Session: a6a89af0-1c24-4a00-9a57-0f53ee7c5f36 Authenticated with global context as user 94309d50-4f52-5250-31bd-74fecac179db
2026.05.08 01:52:05.878616 [ 401018 ] {} <Debug> TCP-Session: a6a89af0-1c24-4a00-9a57-0f53ee7c5f36 Creating session context with user_id: 94309d50-4f52-5250-31bd-74fecac179db
2026.05.08 01:52:05.880968 [ 456 ] {3d929e40-91d1-4eae-b537-1bd5ea23da52} <Information> oximeter.measurements_histogrami64 (da18660f-b6c5-4b7f-8b72-a46b261f1ecd): Added mutation: mutation_10.txt
2026.05.08 01:52:05.881044 [ 456 ] {3d929e40-91d1-4eae-b537-1bd5ea23da52} <Information> oximeter.measurements_histogrami64 (da18660f-b6c5-4b7f-8b72-a46b261f1ecd): Waiting mutation: mutation_10.txt
2026.05.08 01:52:05.881072 [ 456 ] {3d929e40-91d1-4eae-b537-1bd5ea23da52} <Information> oximeter.measurements_histogrami64 (da18660f-b6c5-4b7f-8b72-a46b261f1ecd): Mutation mutation_10.txt done
2026.05.08 01:52:05.881350 [ 456 ] {3d929e40-91d1-4eae-b537-1bd5ea23da52} <Debug> TCPHandler: Processed in 0.003204128 sec.
2026.05.08 01:52:05.881469 [ 456 ] {3d929e40-91d1-4eae-b537-1bd5ea23da52} <Error> Client: TimeoutSetter: Can't reset timeouts: Poco::Exception. Code: 1000, e.code() = 22, Invalid argument, Stack trace (when copying this message, always include the lines below):
(version 23.8.7.1)
2026.05.08 01:52:05.882354 [ 456 ] {} <Trace> TCP-Session: 61d8d8a4-f583-4fa6-b85a-ba1836cf484b Creating query context from session context, user_id: 94309d50-4f52-5250-31bd-74fecac179db, parent context user: default
2026.05.08 01:52:05.882610 [ 456 ] {6321cc8f-fa9f-44b3-bcbf-0ddb10fc64fe} <Error> TCPHandler: Code: 1000. DB::Exception: Invalid argument. (POCO_EXCEPTION), Stack trace (when copying this message, always include the lines below):
0. StackTrace::tryCapture() @ 0x000000000fd4b977 in /opt/oxide/clickhouse/clickhouse
1. DB::Exception::Exception(DB::Exception::CreateFromPocoTag, Poco::Exception const&) @ 0x000000000fd1e5a9 in /opt/oxide/clickhouse/clickhouse
2. std::__detail::_MakeUniq<DB::Exception>::__single_object std::make_unique<DB::Exception, DB::Exception::CreateFromPocoTag, Poco::Exception const&>(DB::Exception::CreateFromPocoTag&&, Poco::Exception const&) @ 0x0000000015855448 in /opt/oxide/clickhouse/clickhouse
3. DB::TCPHandler::runImpl() @ 0x0000000015843aa9 in /opt/oxide/clickhouse/clickhouse
4. DB::TCPHandler::run() @ 0x000000001585317c in /opt/oxide/clickhouse/clickhouse
5. Poco::Net::TCPServerConnection::start() @ 0x00000000191bb7ec in /opt/oxide/clickhouse/clickhouse
6. Poco::Net::TCPServerDispatcher::run() @ 0x00000000191bbcc1 in /opt/oxide/clickhouse/clickhouse
7. Poco::PooledThread::run() @ 0x0000000019220bd2 in /opt/oxide/clickhouse/clickhouse
8. Poco::ThreadImpl::runnableEntry(void*) @ 0x000000001921e685 in /opt/oxide/clickhouse/clickhouse
9. _thrp_setup @ 0x00000000001171c7 in /system/lib/amd64/libc.so.1
2026.05.08 01:52:05.882809 [ 456 ] {6321cc8f-fa9f-44b3-bcbf-0ddb10fc64fe} <Warning> TCPHandler: Client has gone away.
2026.05.08 01:52:05.882836 [ 456 ] {6321cc8f-fa9f-44b3-bcbf-0ddb10fc64fe} <Debug> TCPHandler: Processed in 0.000609382 sec.
2026.05.08 01:52:05.882858 [ 456 ] {} <Debug> TCPHandler: Done processing connection.
2026.05.08 01:52:05.882884 [ 456 ] {} <Debug> TCP-Session: 61d8d8a4-f583-4fa6-b85a-ba1836cf484b Logout, user_id: 94309d50-4f52-5250-31bd-74fecac179db
So that's the same TCP session, and we're now up to the table measurements_histogrami64. We actually do set the TTL, but it looks like the server gets an error after that trying to set a socket timeout. I'm not sure why. But after that, we see no more attempts to ALTER TABLE at all.
I'm really unclear how this got wedge so badly, and why ClickHouse received another 7 requests but we sent another 13 (up to measurements_string). Regardless, clearly we could have (1) handled this more gracefully, and (2) made it more obvious that it didn't actually work. As was noted during review, we're swallowing this error and only logging it, and then only reporting the retention policy from one table, even if setting all of them failed.
We should
- claim a new connection here, letting the old one recycle / reset
- return all the errors, rather than misleadingly claiming that it worked
- list or at least check the policies on each table when fetching the policy, and noting that at least some of them have not been applied
The good news here is that we can...try again.
We're dogfooding the work in #10366, and @paudmir found an inconsistency. They recorded the size of the tables before changing the policy, set it to one day (from 30), and then kept track of the sizes of the tables. They haven't changed as much as you'd expect.
Looking at the
omdboutput, it looks like setting the policy worked. As of this moment:Ignore those
jmcarpdatabase tables. That's a separate bug.So what gives? I tried to set the policy again, this time watching the clickhouse-admin server logs. That's the program doing the work for us. We see this:
So it's made its way up to the
measurements_f32table, where it first failed. Then most of them after that fail too. What does the ClickHouse log show?The log file here is
/pool/ext/b6bdfdaf-9c0d-4b74-926c-49ff3ed05562/crypt/debug/oxz_clickhouse_aa646c82-c6d7-4d0c-8401-150130927759/oxide-clickhouse:default.log.1778204701.That shows a bunch of successful attempts to set the TTL, and then
followed by a whole bunch of other work. We can see that the server created a file to represent the mutation that will apply the TTL, then is waiting for it. When does that complete? It's so long that we're actually into the next file, which is
/pool/ext/b6bdfdaf-9c0d-4b74-926c-49ff3ed05562/crypt/debug/oxz_clickhouse_aa646c82-c6d7-4d0c-8401-150130927759/oxide-clickhouse:default.log.1778205601. Looking for that query ID1b23a14b-21dd-4a23-92dd-d31a88acbc4a, we find it here:So the query completed after about 9 minutes, which is much longer than our 30s client-side timeout. That is not great, but it could continue to work correctly. Except for this:
omicron/oximeter/db/src/client/mod.rs
Lines 966 to 995 in eb2ff2d
We're claiming a connection up front, and then using that for the request to set the TTL for all tables. We're still in the clear, sort of, because the retention-policy task is still running and the server is still handling our requests and applying them. We see more successful attempts to set the TTL until this:
So that's the same TCP session, and we're now up to the table
measurements_histogrami64. We actually do set the TTL, but it looks like the server gets an error after that trying to set a socket timeout. I'm not sure why. But after that, we see no more attempts toALTER TABLEat all.I'm really unclear how this got wedge so badly, and why ClickHouse received another 7 requests but we sent another 13 (up to
measurements_string). Regardless, clearly we could have (1) handled this more gracefully, and (2) made it more obvious that it didn't actually work. As was noted during review, we're swallowing this error and only logging it, and then only reporting the retention policy from one table, even if setting all of them failed.We should
The good news here is that we can...try again.