Too big timeout specified DFATAL spew during BackfillIndex

**Too big timeout specified DFATAL spew on every BackfillIndex RPC — proxy check (7200s) contradicts release default (24h)**
Every RPC issued through the BackfillIndex client retry path unconditionally trips a LOG(DFATAL) sanity check in Proxy::PrepareCall ([proxy.cc:170-172](https://github.com/yugabyte/yugabyte-db/blob/master/src/yb/rpc/proxy.cc#L170-L172)). The check rejects controller timeouts > 7200s (hardcoded), but the release default of --backfill_index_client_rpc_timeout_ms is 86400000 ms (24h), set in [client.cc:239-254](https://github.com/yugabyte/yugabyte-db/blob/master/src/yb/client/client.cc#L239-L254) — 12× the threshold. The two locations directly contradict each other: the comment at client.cc:239 explicitly acknowledges the proxy check ("debug build has 1h timeout limitation: 'Too big timeout specified'") but only lowers the default for debug builds. Release ships at the over-threshold value, and the scenario it triggers — large backfills taking more than 2h — is fully plausible in production.
Call path
```
YBClient::BackfillIndex
  → WaitForBackfillIndexToFinish
    → RetryUntilShutdown
      → RetryFunc
        → GetTableSchema RPC (repeated)
```
RpcRetrier::PrepareController ([`rpc.cc:284-287`](https://github.com/yugabyte/yugabyte-db/blob/master/src/yb/rpc/rpc.cc#L284-L287)) sets the controller timeout to deadline - now, which stays > 7200s for the first ~22h of any backfill. LOG(DFATAL) is unthrottled, so every retry produces a line.
Impact
Hundreds of thousands of ERROR-level lines per master/tserver per backfill cycle. The spew drowns out other diagnostics in the same window and bloats .ERROR.* files. Severity is also misleading — DFATAL implies a programming error, but this is fully under YB's control given the documented defaults.

Suggested fixes (smallest → largest)

1. Rate-limit at the source. Replace LOG(DFATAL) at proxy.cc:171 with YB_LOG_EVERY_N_SECS(DFATAL, 60). One-line hotfix; stops the spew today.
2. Update the 7200s literal. Raise it to reflect the largest legitimate caller (e.g. 24h), or make it a flag. The literal predates the release bump of kDefaultBackfillIndexClientRpcTimeoutMs to 24h — it's just stale.
3. Cap the per-RPC controller timeout in the BackfillIndex retry path. GetTableSchemaRpc and friends don't need the overall 24h backfill deadline — only the outer RetryFunc does. Structurally correct fix; the proxy check exists for good reasons.

(1) is the one-line hotfix. (3) is the durable answer.

Jira Link: [DB-21352](https://yugabyte.atlassian.net/browse/DB-21352)


[DB-21352]: https://yugabyte.atlassian.net/browse/DB-21352?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Too big timeout specified DFATAL spew during BackfillIndex #31559

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Too big timeout specified DFATAL spew during BackfillIndex #31559

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions