feat: fail over to the highest-offset replica#249
Conversation
When proactively failing a primary over before a roll, pick the synced replica with the greatest replication offset instead of the first one in discovery order. A graceful CLUSTER FAILOVER holds writes on the primary until the target replica catches up, so promoting the furthest-ahead replica minimises that write pause and the exposure if the primary dies mid-failover. The offset is read from slave_repl_offset in each node's INFO replication, which the operator already collects, so no extra queries are needed. Signed-off-by: melancholictheory <selimvhorst@gmail.com>
|
| Filename | Overview |
|---|---|
| internal/valkey/clusterstate.go | Adds ReplicationOffset() method to NodeState; reads slave_repl_offset from the already-populated Info map and parses it safely with strconv.ParseInt, returning -1 on any failure. |
| internal/controller/failover.go | Adds highestOffsetReplica() helper and replaces the replicas[0] selection in proactiveFailover with it. Logic is correct with stable discovery-order tie-breaking. |
| internal/controller/failover_test.go | Adds TestHighestOffsetReplica with five sub-tests covering all documented behavioural contracts. |
Flowchart
%%{init: {'theme': 'neutral'}}%%
flowchart TD
A[proactiveFailover called] --> B[highestOffsetReplica]
B --> C["best = replicas[0], bestOffset = ReplicationOffset()"]
C --> D{More replicas?}
D -- yes --> E["offset = replica.ReplicationOffset()"]
E --> F{offset > bestOffset?}
F -- yes --> G["best = replica, bestOffset = offset"]
F -- no --> D
G --> D
D -- no --> H[return best replica]
H --> I[Issue CLUSTER FAILOVER to target]
I --> J{Poll: role == master?}
J -- yes --> K[Failover complete]
J -- timeout --> L[Return timeout error]
J -- ctx done --> M[Return ctx error]
Reviews (2): Last reviewed commit: "test: cover all-offsets-unavailable fail..." | Re-trigger Greptile
Assert highestOffsetReplica returns the first replica in discovery order when no replica exposes slave_repl_offset, documenting the fallback contract and guarding the -1 sentinel against regression. Signed-off-by: melancholictheory <selimvhorst@gmail.com>
| target := highestOffsetReplica(replicas) | ||
| log.Info("initiating proactive failover", "shard", shard.Id, "target", target.Address) |
There was a problem hiding this comment.
What happens when there is no available replica?
| best, bestOffset = replica, offset | ||
| } | ||
| } | ||
| return best |
There was a problem hiding this comment.
There is some similar code also in #244, could these align to prevent duplication?
| best, bestOffset = replica, offset | ||
| } | ||
| } | ||
| return best |
There was a problem hiding this comment.
What happens when offset=0 (replica is connected but is in replicating state)? We wouldn't want to failover to that replica.
In my scripts for our helm deployment, I look at this criteria to determine if the replica is healthy:
status=onlineoffset>0lag<=1
Would it be useful to have something similar here?
|
thanks for the review. going through the three. on the no-available-replica case: this path doesn't reach highestOffsetReplica. findFailoverShard returns nil when there are no synced replicas, so proactiveFailover is only ever called with a non-empty slice, and a shard with no synced replica just rolls the primary without a proactive failover (same as today). i left the function with that precondition rather than a nil return, but happy to add an explicit guard plus comment if you'd rather it be defensive at the boundary. on the overlap with #244: agree, we're both picking the highest-offset replica (promoteOrphanedReplicas does the same for the TAKEOVER path), so it makes sense to have one primitive. i'd move the selection next to ReplicationOffset in the valkey package so the proactive-failover and the recovery path share it. happy to either pull it out here and have #244 rebase onto it, or rebase onto #244 if that lands first, whichever is easier for you and @bjosv. on offset=0: good catch, that's the real gap. a replica that's link-up but still at offset 0 hasn't taken the dataset yet, so promoting it would lose data, and the graceful failover would probably just time out waiting for it to catch up. mapping your three criteria onto what the operator already has:
since the graceful CLUSTER FAILOVER already holds writes until the target catches up, offset>0 is the one that actually prevents data loss (don't promote an empty replica); lag mostly affects how long that write pause is, which picking the highest offset already minimises. so my inclination is offset>0 as a hard filter in the shared primitive, with lag as a follow-up if you want the full health triplet. one question on the hard filter: if no replica clears offset>0, would you rather skip the proactive failover and just roll (letting the cluster do an unplanned failover if it needs to), or keep it best-effort? i'd lean skip-and-roll. happy to push the offset>0 gate and the shared primitive once we settle where it should live. |
Summary
When the operator proactively fails a primary over before rolling it, it currently promotes the first synced replica in discovery order, which can be the one furthest behind.
A graceful
CLUSTER FAILOVERholds writes on the primary until the target replica catches up, so promoting the most caught-up replica shortens that write pause and narrows the exposure if the primary dies mid-failover. This is improvement #1 from the discussion in #231.Features / Behaviour Changes
Implementation
(*NodeState).ReplicationOffset()readsslave_repl_offsetfrom the node'sINFO replication, whichgetNodeStatealready collects intonode.Info, so there are no extra queries.highestOffsetReplicachooses the furthest-ahead replica. Replicas with no available offset sort last and ties keep discovery order, so the result is stable and never nil for a non-empty input.Limitations
Testing
highestOffsetReplica: greatest offset wins, a missing offset sorts last, ties keep discovery order, single replica.make testandmake lintpass locally.Checklist
make testandmake lintinstead)