Skip to content

fix(operator): persist reapply-on-reboot reset before advancing boot id#261

Open
ayuskauskas wants to merge 1 commit into
mainfrom
worktree-investigate-reapply-reboot
Open

fix(operator): persist reapply-on-reboot reset before advancing boot id#261
ayuskauskas wants to merge 1 commit into
mainfrom
worktree-investigate-reapply-reboot

Conversation

@ayuskauskas
Copy link
Copy Markdown
Collaborator

Problem

With REAPPLY_ON_REBOOT=true, rebooting a node that hosts many controllers (high churn of pods / node events / label + annotation updates) caused the operator to detect the reboot, transition unknown -> complete, and never schedule the reapply pod. Quiet nodes worked fine.

Root cause

TrackReboots (internal/controller/skyhook_controller.go) handled a reboot as two independent, non-atomic writes:

  1. node.Reset() clears the nodeState_<name> annotation, then the node is persisted with a full r.Update.
  2. Status.NodeBootIds[node] is advanced and the Skyhook status is written independently.

On a busy node the node's resourceVersion churns constantly, so the full Update loses the optimistic-concurrency race and returns 409 Conflict (there was no retry on this path). But the boot-id advance still committed, marking the reboot handled forever, while the node kept its stale complete nodeState annotation — which State() reads as the source of truth on the next reconcile. Result: unknown -> complete, no pod.

Fix (defense in depth)

  • Persist the reset via a strategic-merge Patch (client.StrategicMergeFrom), matching SaveNodesAndSkyhook. Merge patches are not resourceVersion-gated, so unrelated node churn no longer conflicts — the reset lands on busy nodes.
  • Advance NodeBootIds only after that write succeeds. If the reset fails for any reason, the boot-id is left unchanged so the reboot is re-detected and retried next reconcile instead of being silently consumed.
  • Housekeeping in Reset(): fix the cordon annotation delete key (was cordon_, missing the Skyhook name) and invalidate the in-memory nodeState cache.

Test

Adds a deterministic envtest reproduction that drives a real apiserver 409 via out-of-band node churn (no mocked/injected error). It fails on the old Update path (stale complete annotation survives) and passes on the Patch path (annotation cleared, boot-id advanced).

Verification

  • New regression test: red on buggy code, green on fix.
  • Full controller suite: 163/163 pass. Wrapper suite: 68/68 pass.
  • make lint: 0 issues.

Docs / contracts

No docs/ page covers reapply-on-reboot, and no CLI-visible annotation/status shapes changed, so no doc or CLI update is required. CHANGELOG.md updated under [Unreleased].

🤖 Generated with Claude Code

With REAPPLY_ON_REBOOT=true, a reboot of a node under heavy controller
churn could be detected and then silently lost. TrackReboots persisted
the per-node state reset with a full Update that lost an optimistic-
concurrency race against unrelated label/annotation/status writes, yet
advanced the node's boot id anyway — marking the reboot handled. The node
kept its stale "complete" state and the package was never reapplied
(status went unknown -> complete with no pod scheduled). Quiet nodes were
unaffected because their Update did not conflict.

Persist the reset via a strategic-merge Patch (not resourceVersion-gated,
matching the rest of the reconcile) so unrelated node churn no longer
conflicts, and advance NodeBootIds only after that write succeeds so a
failed reset leaves the reboot pending to be re-detected and retried.

Also fix Reset() deleting the cordon annotation with a key missing the
Skyhook name, and invalidate the in-memory nodeState cache on reset.

Adds a deterministic envtest reproduction that drives a real apiserver
409 via out-of-band node churn (fails on the old Update path, passes on
the Patch path).

Signed-off-by: Alex Yuskauskas <ayuskauskas@nvidia.com>
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 29, 2026

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: b2640008-3945-48cc-99e4-6559cf242c45

📥 Commits

Reviewing files that changed from the base of the PR and between b264019 and b9be3da.

📒 Files selected for processing (4)
  • operator/CHANGELOG.md
  • operator/internal/controller/skyhook_controller.go
  • operator/internal/controller/skyhook_controller_test.go
  • operator/internal/wrapper/node.go

📝 Walkthrough

Walkthrough

This PR fixes a bug in the reapply-on-reboot feature where concurrent node modifications could cause the operator to lose write races. The root cause is an incorrect key construction in Reset() that was dropping the cordon annotation. The fix corrects the annotation keys in Reset(), clears the in-memory cache to prevent stale state, switches TrackReboots to use a guarded strategic merge patch instead of unconditional full Updates, and defers boot-ID advancement until the reset persists successfully. A new regression test validates the fix under optimistic-concurrency load.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

🚥 Pre-merge checks | ✅ 4
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The pull request title accurately and concisely describes the primary change: persisting the reapply-on-reboot reset before advancing the boot ID, which is the core fix addressing the bug where reboots were being lost on busy nodes.
Description check ✅ Passed The pull request description is comprehensive and directly related to the changeset, clearly explaining the problem (reapply-on-reboot failures on busy nodes), root cause (non-atomic writes causing resourceVersion conflicts), the fix (strategic-merge Patch and sequenced writes), the test (deterministic envtest), and verification results.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch worktree-investigate-reapply-reboot

Comment @coderabbitai help to get the list of available commands and usage tips.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants