Skip to content

feature: reduce long request interruption during TSO switchover by distinguishing switchover vs failover in retry logic #1787

@King-Dylan

Description

@King-Dylan

Background

When a microservice starts and a TSO leader transfer (switchover) happens (/tso/api/v1/primary/transfer), TiDB requests can experience a noticeably long interruption. The switchover itself is typically quick, but the kv-client retry/backoff behavior treats these transient events like longer failures, which stretches the interruption seen by upstream services.

Problem

The retry/backoff intervals in tikv/client-go are long enough that a short TSO switchover results in extended observable downtime for TiDB requests. Treating switchover as ordinary failover (with long backoff) amplifies impact unnecessarily.

Expect

client-go should distinguish TSO switchover (short, expected leader transfer) from failover (longer outage). When a switchover is detected, the client should use a fast-retry policy (short intervals, bounded attempts) to reduce upstream interruption time. For other errors, existing backoff behavior remains.Add detection and a fast-retry path in the client retry logic:
If an error/response from PD/TSO indicates a leader transfer (e.g., explicit error code, gRPC status + metadata, or other deterministic signal), switch to a SwitchoverFastRetry mode: short sleep (e.g., 50–200ms) and limited attempts.
Preferable long-term improvement: expose an unambiguous signal from PD/TSO for leader transfer (e.g., an error code or metadata). The client can rely on the signal to decide fast-retry confidently.

Metadata

Metadata

Assignees

No one assigned

    Labels

    contributionThis PR is from a community contributor.first-time-contributorIndicates that the PR was contributed by an external member and is a first-time contributor.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions