Too big timeout specified DFATAL spew on every BackfillIndex RPC — proxy check (7200s) contradicts release default (24h)
Every RPC issued through the BackfillIndex client retry path unconditionally trips a LOG(DFATAL) sanity check in Proxy::PrepareCall (proxy.cc:170-172). The check rejects controller timeouts > 7200s (hardcoded), but the release default of --backfill_index_client_rpc_timeout_ms is 86400000 ms (24h), set in client.cc:239-254 — 12× the threshold. The two locations directly contradict each other: the comment at client.cc:239 explicitly acknowledges the proxy check ("debug build has 1h timeout limitation: 'Too big timeout specified'") but only lowers the default for debug builds. Release ships at the over-threshold value, and the scenario it triggers — large backfills taking more than 2h — is fully plausible in production.
Call path
YBClient::BackfillIndex
→ WaitForBackfillIndexToFinish
→ RetryUntilShutdown
→ RetryFunc
→ GetTableSchema RPC (repeated)
RpcRetrier::PrepareController (rpc.cc:284-287) sets the controller timeout to deadline - now, which stays > 7200s for the first ~22h of any backfill. LOG(DFATAL) is unthrottled, so every retry produces a line.
Impact
Hundreds of thousands of ERROR-level lines per master/tserver per backfill cycle. The spew drowns out other diagnostics in the same window and bloats .ERROR.* files. Severity is also misleading — DFATAL implies a programming error, but this is fully under YB's control given the documented defaults.
Suggested fixes (smallest → largest)
- Rate-limit at the source. Replace LOG(DFATAL) at proxy.cc:171 with YB_LOG_EVERY_N_SECS(DFATAL, 60). One-line hotfix; stops the spew today.
- Update the 7200s literal. Raise it to reflect the largest legitimate caller (e.g. 24h), or make it a flag. The literal predates the release bump of kDefaultBackfillIndexClientRpcTimeoutMs to 24h — it's just stale.
- Cap the per-RPC controller timeout in the BackfillIndex retry path. GetTableSchemaRpc and friends don't need the overall 24h backfill deadline — only the outer RetryFunc does. Structurally correct fix; the proxy check exists for good reasons.
(1) is the one-line hotfix. (3) is the durable answer.
Jira Link: DB-21352
Too big timeout specified DFATAL spew on every BackfillIndex RPC — proxy check (7200s) contradicts release default (24h)
Every RPC issued through the BackfillIndex client retry path unconditionally trips a LOG(DFATAL) sanity check in Proxy::PrepareCall (proxy.cc:170-172). The check rejects controller timeouts > 7200s (hardcoded), but the release default of --backfill_index_client_rpc_timeout_ms is 86400000 ms (24h), set in client.cc:239-254 — 12× the threshold. The two locations directly contradict each other: the comment at client.cc:239 explicitly acknowledges the proxy check ("debug build has 1h timeout limitation: 'Too big timeout specified'") but only lowers the default for debug builds. Release ships at the over-threshold value, and the scenario it triggers — large backfills taking more than 2h — is fully plausible in production.
Call path
RpcRetrier::PrepareController (
rpc.cc:284-287) sets the controller timeout to deadline - now, which stays > 7200s for the first ~22h of any backfill. LOG(DFATAL) is unthrottled, so every retry produces a line.Impact
Hundreds of thousands of ERROR-level lines per master/tserver per backfill cycle. The spew drowns out other diagnostics in the same window and bloats .ERROR.* files. Severity is also misleading — DFATAL implies a programming error, but this is fully under YB's control given the documented defaults.
Suggested fixes (smallest → largest)
(1) is the one-line hotfix. (3) is the durable answer.
Jira Link: DB-21352