Skip to content

Conversation

@KyrinCode
Copy link

Description

This PR adds an optional round-robin leader transfer mechanism to op-conductor, configurable via the --raft.round-robin-leader-transfer flag.

Problem

The current leader transfer in op-conductor relies on Raft's default LeadershipTransfer(), which selects the next leader based on log completeness (the follower with the most up-to-date log). However, this selection mechanism is unaware of application-level health status (op-node & execution layer).
This can cause a problematic scenario where unhealthy nodes keep being elected as leader in a loop:

  1. Node A becomes leader but is unhealthy → transfers to Node B (most complete log)
  2. Node B becomes leader but is also unhealthy → transfers back to Node A (now has most complete log)
  3. Healthy Node C never gets elected because its log is slightly behind

Solution

Introduce a deterministic round-robin leader transfer that cycles through all voters in sorted order (by ServerID):

  • Node A → Node B → Node C → Node A → ...
    This ensures that even if only one node in the cluster is healthy, it will eventually become the leader after at most N-1 transfers.

Changes

  • Added --raft.round-robin-leader-transfer flag (default: false, backward compatible)
  • Added transferLeaderRoundRobin() function that:
    • Sorts all voters by ServerID for consistent ordering across nodes
    • Transfers to the next node in sorted order
    • If transfer fails (e.g., target has stale logs), tries the next node
    • Handles "leadership transfer in progress" gracefully

Tests

Manual testing was performed with a 4-node conductor cluster where 2 nodes were intentionally made unhealthy. With round-robin enabled, leadership eventually transferred to a healthy node. Without round-robin, leadership kept bouncing between the unhealthy nodes.

Additional context

This feature is opt-in and disabled by default to maintain backward compatibility. When disabled, the behavior is identical to the current implementation.
The round-robin approach trades off Raft's "most up-to-date log" optimization for guaranteed leader rotation. In practice, this is acceptable because:
All voters in a healthy cluster should have similar log states
The primary goal of leader transfer in op-conductor is to find a healthy sequencer, not to optimize for log completeness
If a target node's log is too stale, Raft will reject the transfer and we try the next node

Metadata

  • Related to sequencer high-availability and failover reliability

@KyrinCode KyrinCode requested review from a team as code owners December 29, 2025 09:35
@KyrinCode KyrinCode changed the title Fix conductor leader loop switch by adding round-robin leadership transfer feat(conductor): fix leader loop switch by adding round-robin leadership transfer Dec 29, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant