Skip to content

Too big timeout specified DFATAL spew during BackfillIndex #31559

@yugabyte-ci

Description

@yugabyte-ci

Too big timeout specified DFATAL spew on every BackfillIndex RPC — proxy check (7200s) contradicts release default (24h)
Every RPC issued through the BackfillIndex client retry path unconditionally trips a LOG(DFATAL) sanity check in Proxy::PrepareCall (proxy.cc:170-172). The check rejects controller timeouts > 7200s (hardcoded), but the release default of --backfill_index_client_rpc_timeout_ms is 86400000 ms (24h), set in client.cc:239-254 — 12× the threshold. The two locations directly contradict each other: the comment at client.cc:239 explicitly acknowledges the proxy check ("debug build has 1h timeout limitation: 'Too big timeout specified'") but only lowers the default for debug builds. Release ships at the over-threshold value, and the scenario it triggers — large backfills taking more than 2h — is fully plausible in production.
Call path

YBClient::BackfillIndex
  → WaitForBackfillIndexToFinish
    → RetryUntilShutdown
      → RetryFunc
        → GetTableSchema RPC (repeated)

RpcRetrier::PrepareController (rpc.cc:284-287) sets the controller timeout to deadline - now, which stays > 7200s for the first ~22h of any backfill. LOG(DFATAL) is unthrottled, so every retry produces a line.
Impact
Hundreds of thousands of ERROR-level lines per master/tserver per backfill cycle. The spew drowns out other diagnostics in the same window and bloats .ERROR.* files. Severity is also misleading — DFATAL implies a programming error, but this is fully under YB's control given the documented defaults.

Suggested fixes (smallest → largest)

  1. Rate-limit at the source. Replace LOG(DFATAL) at proxy.cc:171 with YB_LOG_EVERY_N_SECS(DFATAL, 60). One-line hotfix; stops the spew today.
  2. Update the 7200s literal. Raise it to reflect the largest legitimate caller (e.g. 24h), or make it a flag. The literal predates the release bump of kDefaultBackfillIndexClientRpcTimeoutMs to 24h — it's just stale.
  3. Cap the per-RPC controller timeout in the BackfillIndex retry path. GetTableSchemaRpc and friends don't need the overall 24h backfill deadline — only the outer RetryFunc does. Structurally correct fix; the proxy check exists for good reasons.

(1) is the one-line hotfix. (3) is the durable answer.

Jira Link: DB-21352

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions