feat: fail over to the highest-offset replica by melancholictheory · Pull Request #249 · valkey-io/valkey-operator

melancholictheory · 2026-06-12T11:38:03Z

Summary

When the operator proactively fails a primary over before rolling it, it currently promotes the first synced replica in discovery order, which can be the one furthest behind.

A graceful CLUSTER FAILOVER holds writes on the primary until the target replica catches up, so promoting the most caught-up replica shortens that write pause and narrows the exposure if the primary dies mid-failover. This is improvement #1 from the discussion in #231.

Features / Behaviour Changes

Proactive failover picks the synced replica with the highest replication offset instead of the first one in discovery order.

Implementation

(*NodeState).ReplicationOffset() reads slave_repl_offset from the node's INFO replication, which getNodeState already collects into node.Info, so there are no extra queries.
highestOffsetReplica chooses the furthest-ahead replica. Replicas with no available offset sort last and ties keep discovery order, so the result is stable and never nil for a non-empty input.

Limitations

It does not yet assert that the chosen replica is on the latest ValkeyNode spec (improvement initial kubebuilder init #2 in Design: proactive (zero-downtime) rolling restart — pre-empt the failover before recreating the primary #231). The sequential roll already promotes replicas before the primary, so in practice the target is on the new spec, but an explicit check would turn that ordering invariant into a hard guarantee. Happy to follow up with it.

Testing

Unit tests for highestOffsetReplica: greatest offset wins, a missing offset sorts last, ties keep discovery order, single replica.
make test and make lint pass locally.

Checklist

This Pull Request is related to one issue.
Commit message explains what changed and why
Tests are added or updated.
Documentation files are updated.
I have run pre-commit locally (ran make test and make lint instead)

When proactively failing a primary over before a roll, pick the synced replica with the greatest replication offset instead of the first one in discovery order. A graceful CLUSTER FAILOVER holds writes on the primary until the target replica catches up, so promoting the furthest-ahead replica minimises that write pause and the exposure if the primary dies mid-failover. The offset is read from slave_repl_offset in each node's INFO replication, which the operator already collects, so no extra queries are needed. Signed-off-by: melancholictheory <selimvhorst@gmail.com>

greptile-apps · 2026-06-12T11:40:36Z

Greptile Summary

This PR improves the proactive failover logic in the Valkey cluster operator by selecting the replica with the highest replication offset instead of simply picking the first one in discovery order, which reduces the write-pause window during a graceful CLUSTER FAILOVER.

Adds ReplicationOffset() on NodeState that reads slave_repl_offset from the pre-collected INFO map, returning -1 when unavailable, so no extra network queries are needed.
Adds highestOffsetReplica() in failover.go with a stable tie-breaking (discovery order) and graceful handling of missing offsets, and wires it into proactiveFailover in place of the old replicas[0] selection.
Unit tests cover all key branches: greatest offset wins, missing offset sorts last, tie-breaking, single replica, and the all-missing-offset fallback.

Confidence Score: 5/5

Safe to merge — the change is narrowly scoped to how a target replica is selected, the logic is correct, and all documented edge cases are tested.

The implementation is a straightforward linear scan with a stable fallback. It reuses already-collected Info data, so there are no new network calls or concurrency concerns. The five unit tests fully cover the documented contracts. The caller contract (replicas is non-empty) is enforced before highestOffsetReplica is reached, so the replicas[0] initialisation is safe.

No files require special attention.

Important Files Changed

Filename	Overview
internal/valkey/clusterstate.go	Adds ReplicationOffset() method to NodeState; reads slave_repl_offset from the already-populated Info map and parses it safely with strconv.ParseInt, returning -1 on any failure.
internal/controller/failover.go	Adds highestOffsetReplica() helper and replaces the replicas[0] selection in proactiveFailover with it. Logic is correct with stable discovery-order tie-breaking.
internal/controller/failover_test.go	Adds TestHighestOffsetReplica with five sub-tests covering all documented behavioural contracts.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[proactiveFailover called] --> B[highestOffsetReplica]
    B --> C["best = replicas[0], bestOffset = ReplicationOffset()"]
    C --> D{More replicas?}
    D -- yes --> E["offset = replica.ReplicationOffset()"]
    E --> F{offset > bestOffset?}
    F -- yes --> G["best = replica, bestOffset = offset"]
    F -- no --> D
    G --> D
    D -- no --> H[return best replica]
    H --> I[Issue CLUSTER FAILOVER to target]
    I --> J{Poll: role == master?}
    J -- yes --> K[Failover complete]
    J -- timeout --> L[Return timeout error]
    J -- ctx done --> M[Return ctx error]

_{Reviews (2): Last reviewed commit: "test: cover all-offsets-unavailable fail..." | Re-trigger Greptile}

Assert highestOffsetReplica returns the first replica in discovery order when no replica exposes slave_repl_offset, documenting the fallback contract and guarding the -1 sentinel against regression. Signed-off-by: melancholictheory <selimvhorst@gmail.com>

jdheyburn · 2026-06-12T11:49:37Z

+	target := highestOffsetReplica(replicas)
 	log.Info("initiating proactive failover", "shard", shard.Id, "target", target.Address)


What happens when there is no available replica?

jdheyburn · 2026-06-12T11:49:50Z

+			best, bestOffset = replica, offset
+		}
+	}
+	return best


There is some similar code also in #244, could these align to prevent duplication?

jdheyburn · 2026-06-12T11:52:43Z

+			best, bestOffset = replica, offset
+		}
+	}
+	return best


What happens when offset=0 (replica is connected but is in replicating state)? We wouldn't want to failover to that replica.

In my scripts for our helm deployment, I look at this criteria to determine if the replica is healthy:

status=online

offset>0

lag<=1

Would it be useful to have something similar here?

melancholictheory · 2026-06-12T12:00:27Z

thanks for the review. going through the three.

on the no-available-replica case: this path doesn't reach highestOffsetReplica. findFailoverShard returns nil when there are no synced replicas, so proactiveFailover is only ever called with a non-empty slice, and a shard with no synced replica just rolls the primary without a proactive failover (same as today). i left the function with that precondition rather than a nil return, but happy to add an explicit guard plus comment if you'd rather it be defensive at the boundary.

on the overlap with #244: agree, we're both picking the highest-offset replica (promoteOrphanedReplicas does the same for the TAKEOVER path), so it makes sense to have one primitive. i'd move the selection next to ReplicationOffset in the valkey package so the proactive-failover and the recovery path share it. happy to either pull it out here and have #244 rebase onto it, or rebase onto #244 if that lands first, whichever is easier for you and @bjosv.

on offset=0: good catch, that's the real gap. a replica that's link-up but still at offset 0 hasn't taken the dataset yet, so promoting it would lose data, and the graceful failover would probably just time out waiting for it to catch up. mapping your three criteria onto what the operator already has:

status=online is roughly the synced filter we apply today (master_link_status:up in GetSyncedReplicas).
offset>0 is directly available, it's the same slave_repl_offset this PR already reads, so i can gate on it right away.
lag<=1 lives on the primary's INFO (the slaveN ...,lag= line), or as a byte-delta from master_repl_offset - slave_repl_offset on the replica. doable, just a bit more plumbing.

since the graceful CLUSTER FAILOVER already holds writes until the target catches up, offset>0 is the one that actually prevents data loss (don't promote an empty replica); lag mostly affects how long that write pause is, which picking the highest offset already minimises. so my inclination is offset>0 as a hard filter in the shared primitive, with lag as a follow-up if you want the full health triplet.

one question on the hard filter: if no replica clears offset>0, would you rather skip the proactive failover and just roll (letting the cluster do an unplanned failover if it needs to), or keep it best-effort? i'd lean skip-and-roll.

happy to push the offset>0 gate and the shared primitive once we settle where it should live.

greptile-apps Bot reviewed Jun 12, 2026

View reviewed changes

Comment thread internal/controller/failover_test.go

jdheyburn reviewed Jun 12, 2026

View reviewed changes

jdheyburn added this to valkey-operator 0.3.0 Jun 12, 2026

github-project-automation Bot moved this to Todo in valkey-operator 0.3.0 Jun 12, 2026

jdheyburn added enhancement New feature or request data-safety labels Jun 12, 2026

jdheyburn mentioned this pull request Jun 12, 2026

fix: recover cluster when majority of primaries are lost #244

Open

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: fail over to the highest-offset replica#249

feat: fail over to the highest-offset replica#249
melancholictheory wants to merge 2 commits into
valkey-io:mainfrom
melancholictheory:feat/failover-highest-offset

melancholictheory commented Jun 12, 2026

Uh oh!

greptile-apps Bot commented Jun 12, 2026 •

edited

Loading

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Flowchart

Uh oh!

Uh oh!

jdheyburn Jun 12, 2026

Uh oh!

jdheyburn Jun 12, 2026

Uh oh!

jdheyburn Jun 12, 2026

Uh oh!

melancholictheory commented Jun 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		target := highestOffsetReplica(replicas)
		log.Info("initiating proactive failover", "shard", shard.Id, "target", target.Address)

Conversation

melancholictheory commented Jun 12, 2026

Summary

Features / Behaviour Changes

Implementation

Limitations

Testing

Checklist

Uh oh!

greptile-apps Bot commented Jun 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Flowchart

Uh oh!

Uh oh!

jdheyburn Jun 12, 2026

Choose a reason for hiding this comment

Uh oh!

jdheyburn Jun 12, 2026

Choose a reason for hiding this comment

Uh oh!

jdheyburn Jun 12, 2026

Choose a reason for hiding this comment

Uh oh!

melancholictheory commented Jun 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

greptile-apps Bot commented Jun 12, 2026 •

edited

Loading