fix(operator): persist reapply-on-reboot reset before advancing boot id#261
fix(operator): persist reapply-on-reboot reset before advancing boot id#261ayuskauskas wants to merge 1 commit into
Conversation
With REAPPLY_ON_REBOOT=true, a reboot of a node under heavy controller churn could be detected and then silently lost. TrackReboots persisted the per-node state reset with a full Update that lost an optimistic- concurrency race against unrelated label/annotation/status writes, yet advanced the node's boot id anyway — marking the reboot handled. The node kept its stale "complete" state and the package was never reapplied (status went unknown -> complete with no pod scheduled). Quiet nodes were unaffected because their Update did not conflict. Persist the reset via a strategic-merge Patch (not resourceVersion-gated, matching the rest of the reconcile) so unrelated node churn no longer conflicts, and advance NodeBootIds only after that write succeeds so a failed reset leaves the reboot pending to be re-detected and retried. Also fix Reset() deleting the cordon annotation with a key missing the Skyhook name, and invalidate the in-memory nodeState cache on reset. Adds a deterministic envtest reproduction that drives a real apiserver 409 via out-of-band node churn (fails on the old Update path, passes on the Patch path). Signed-off-by: Alex Yuskauskas <ayuskauskas@nvidia.com>
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: ASSERTIVE Plan: Enterprise Run ID: 📒 Files selected for processing (4)
📝 WalkthroughWalkthroughThis PR fixes a bug in the reapply-on-reboot feature where concurrent node modifications could cause the operator to lose write races. The root cause is an incorrect key construction in Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes 🚥 Pre-merge checks | ✅ 4✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Comment |
Problem
With
REAPPLY_ON_REBOOT=true, rebooting a node that hosts many controllers (high churn of pods / node events / label + annotation updates) caused the operator to detect the reboot, transitionunknown -> complete, and never schedule the reapply pod. Quiet nodes worked fine.Root cause
TrackReboots(internal/controller/skyhook_controller.go) handled a reboot as two independent, non-atomic writes:node.Reset()clears thenodeState_<name>annotation, then the node is persisted with a fullr.Update.Status.NodeBootIds[node]is advanced and the Skyhook status is written independently.On a busy node the node's
resourceVersionchurns constantly, so the fullUpdateloses the optimistic-concurrency race and returns409 Conflict(there was no retry on this path). But the boot-id advance still committed, marking the reboot handled forever, while the node kept its stalecompletenodeStateannotation — whichState()reads as the source of truth on the next reconcile. Result:unknown -> complete, no pod.Fix (defense in depth)
Patch(client.StrategicMergeFrom), matchingSaveNodesAndSkyhook. Merge patches are notresourceVersion-gated, so unrelated node churn no longer conflicts — the reset lands on busy nodes.NodeBootIdsonly after that write succeeds. If the reset fails for any reason, the boot-id is left unchanged so the reboot is re-detected and retried next reconcile instead of being silently consumed.Reset(): fix the cordon annotation delete key (wascordon_, missing the Skyhook name) and invalidate the in-memorynodeStatecache.Test
Adds a deterministic envtest reproduction that drives a real apiserver 409 via out-of-band node churn (no mocked/injected error). It fails on the old
Updatepath (stalecompleteannotation survives) and passes on thePatchpath (annotation cleared, boot-id advanced).Verification
make lint: 0 issues.Docs / contracts
No
docs/page covers reapply-on-reboot, and no CLI-visible annotation/status shapes changed, so no doc or CLI update is required.CHANGELOG.mdupdated under[Unreleased].🤖 Generated with Claude Code