Skip to content

feat: fail over to the highest-offset replica#249

Open
melancholictheory wants to merge 2 commits into
valkey-io:mainfrom
melancholictheory:feat/failover-highest-offset
Open

feat: fail over to the highest-offset replica#249
melancholictheory wants to merge 2 commits into
valkey-io:mainfrom
melancholictheory:feat/failover-highest-offset

Conversation

@melancholictheory

Copy link
Copy Markdown

Summary

When the operator proactively fails a primary over before rolling it, it currently promotes the first synced replica in discovery order, which can be the one furthest behind.

A graceful CLUSTER FAILOVER holds writes on the primary until the target replica catches up, so promoting the most caught-up replica shortens that write pause and narrows the exposure if the primary dies mid-failover. This is improvement #1 from the discussion in #231.

Features / Behaviour Changes

  • Proactive failover picks the synced replica with the highest replication offset instead of the first one in discovery order.

Implementation

  • (*NodeState).ReplicationOffset() reads slave_repl_offset from the node's INFO replication, which getNodeState already collects into node.Info, so there are no extra queries.
  • highestOffsetReplica chooses the furthest-ahead replica. Replicas with no available offset sort last and ties keep discovery order, so the result is stable and never nil for a non-empty input.

Limitations

Testing

  • Unit tests for highestOffsetReplica: greatest offset wins, a missing offset sorts last, ties keep discovery order, single replica.
  • make test and make lint pass locally.

Checklist

  • This Pull Request is related to one issue.
  • Commit message explains what changed and why
  • Tests are added or updated.
  • Documentation files are updated.
  • I have run pre-commit locally (ran make test and make lint instead)

When proactively failing a primary over before a roll, pick the synced
replica with the greatest replication offset instead of the first one in
discovery order. A graceful CLUSTER FAILOVER holds writes on the primary
until the target replica catches up, so promoting the furthest-ahead
replica minimises that write pause and the exposure if the primary dies
mid-failover.

The offset is read from slave_repl_offset in each node's INFO replication,
which the operator already collects, so no extra queries are needed.

Signed-off-by: melancholictheory <selimvhorst@gmail.com>
@greptile-apps

greptile-apps Bot commented Jun 12, 2026

Copy link
Copy Markdown

Greptile Summary

This PR improves the proactive failover logic in the Valkey cluster operator by selecting the replica with the highest replication offset instead of simply picking the first one in discovery order, which reduces the write-pause window during a graceful CLUSTER FAILOVER.

  • Adds ReplicationOffset() on NodeState that reads slave_repl_offset from the pre-collected INFO map, returning -1 when unavailable, so no extra network queries are needed.
  • Adds highestOffsetReplica() in failover.go with a stable tie-breaking (discovery order) and graceful handling of missing offsets, and wires it into proactiveFailover in place of the old replicas[0] selection.
  • Unit tests cover all key branches: greatest offset wins, missing offset sorts last, tie-breaking, single replica, and the all-missing-offset fallback.

Confidence Score: 5/5

Safe to merge — the change is narrowly scoped to how a target replica is selected, the logic is correct, and all documented edge cases are tested.

The implementation is a straightforward linear scan with a stable fallback. It reuses already-collected Info data, so there are no new network calls or concurrency concerns. The five unit tests fully cover the documented contracts. The caller contract (replicas is non-empty) is enforced before highestOffsetReplica is reached, so the replicas[0] initialisation is safe.

No files require special attention.

Important Files Changed

Filename Overview
internal/valkey/clusterstate.go Adds ReplicationOffset() method to NodeState; reads slave_repl_offset from the already-populated Info map and parses it safely with strconv.ParseInt, returning -1 on any failure.
internal/controller/failover.go Adds highestOffsetReplica() helper and replaces the replicas[0] selection in proactiveFailover with it. Logic is correct with stable discovery-order tie-breaking.
internal/controller/failover_test.go Adds TestHighestOffsetReplica with five sub-tests covering all documented behavioural contracts.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[proactiveFailover called] --> B[highestOffsetReplica]
    B --> C["best = replicas[0], bestOffset = ReplicationOffset()"]
    C --> D{More replicas?}
    D -- yes --> E["offset = replica.ReplicationOffset()"]
    E --> F{offset > bestOffset?}
    F -- yes --> G["best = replica, bestOffset = offset"]
    F -- no --> D
    G --> D
    D -- no --> H[return best replica]
    H --> I[Issue CLUSTER FAILOVER to target]
    I --> J{Poll: role == master?}
    J -- yes --> K[Failover complete]
    J -- timeout --> L[Return timeout error]
    J -- ctx done --> M[Return ctx error]
Loading

Reviews (2): Last reviewed commit: "test: cover all-offsets-unavailable fail..." | Re-trigger Greptile

Comment thread internal/controller/failover_test.go
Assert highestOffsetReplica returns the first replica in discovery order
when no replica exposes slave_repl_offset, documenting the fallback
contract and guarding the -1 sentinel against regression.

Signed-off-by: melancholictheory <selimvhorst@gmail.com>
Comment on lines +89 to 90
target := highestOffsetReplica(replicas)
log.Info("initiating proactive failover", "shard", shard.Id, "target", target.Address)

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens when there is no available replica?

best, bestOffset = replica, offset
}
}
return best

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is some similar code also in #244, could these align to prevent duplication?

best, bestOffset = replica, offset
}
}
return best

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens when offset=0 (replica is connected but is in replicating state)? We wouldn't want to failover to that replica.

In my scripts for our helm deployment, I look at this criteria to determine if the replica is healthy:

  1. status=online
  2. offset>0
  3. lag<=1

Would it be useful to have something similar here?

@melancholictheory

Copy link
Copy Markdown
Author

thanks for the review. going through the three.

on the no-available-replica case: this path doesn't reach highestOffsetReplica. findFailoverShard returns nil when there are no synced replicas, so proactiveFailover is only ever called with a non-empty slice, and a shard with no synced replica just rolls the primary without a proactive failover (same as today). i left the function with that precondition rather than a nil return, but happy to add an explicit guard plus comment if you'd rather it be defensive at the boundary.

on the overlap with #244: agree, we're both picking the highest-offset replica (promoteOrphanedReplicas does the same for the TAKEOVER path), so it makes sense to have one primitive. i'd move the selection next to ReplicationOffset in the valkey package so the proactive-failover and the recovery path share it. happy to either pull it out here and have #244 rebase onto it, or rebase onto #244 if that lands first, whichever is easier for you and @bjosv.

on offset=0: good catch, that's the real gap. a replica that's link-up but still at offset 0 hasn't taken the dataset yet, so promoting it would lose data, and the graceful failover would probably just time out waiting for it to catch up. mapping your three criteria onto what the operator already has:

  • status=online is roughly the synced filter we apply today (master_link_status:up in GetSyncedReplicas).
  • offset>0 is directly available, it's the same slave_repl_offset this PR already reads, so i can gate on it right away.
  • lag<=1 lives on the primary's INFO (the slaveN ...,lag= line), or as a byte-delta from master_repl_offset - slave_repl_offset on the replica. doable, just a bit more plumbing.

since the graceful CLUSTER FAILOVER already holds writes until the target catches up, offset>0 is the one that actually prevents data loss (don't promote an empty replica); lag mostly affects how long that write pause is, which picking the highest offset already minimises. so my inclination is offset>0 as a hard filter in the shared primitive, with lag as a follow-up if you want the full health triplet.

one question on the hard filter: if no replica clears offset>0, would you rather skip the proactive failover and just roll (letting the cluster do an unplanned failover if it needs to), or keep it best-effort? i'd lean skip-and-roll.

happy to push the offset>0 gate and the shared primitive once we settle where it should live.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

data-safety enhancement New feature or request

Projects

Development

Successfully merging this pull request may close these issues.

2 participants