-
Notifications
You must be signed in to change notification settings - Fork 250
Description
Background
When a microservice starts and a TSO leader transfer (switchover) happens (/tso/api/v1/primary/transfer), TiDB requests can experience a noticeably long interruption. The switchover itself is typically quick, but the kv-client retry/backoff behavior treats these transient events like longer failures, which stretches the interruption seen by upstream services.
Problem
The retry/backoff intervals in tikv/client-go are long enough that a short TSO switchover results in extended observable downtime for TiDB requests. Treating switchover as ordinary failover (with long backoff) amplifies impact unnecessarily.
Expect
client-go should distinguish TSO switchover (short, expected leader transfer) from failover (longer outage). When a switchover is detected, the client should use a fast-retry policy (short intervals, bounded attempts) to reduce upstream interruption time. For other errors, existing backoff behavior remains.Add detection and a fast-retry path in the client retry logic:
If an error/response from PD/TSO indicates a leader transfer (e.g., explicit error code, gRPC status + metadata, or other deterministic signal), switch to a SwitchoverFastRetry mode: short sleep (e.g., 50–200ms) and limited attempts.
Preferable long-term improvement: expose an unambiguous signal from PD/TSO for leader transfer (e.g., an error code or metadata). The client can rely on the signal to decide fast-retry confidently.