Skip to content

fix: add retry logic for one time instruction #226

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion pkg/k8splan/watcher.go
Original file line number Diff line number Diff line change
Expand Up @@ -215,10 +215,10 @@ func (w *watcher) start(ctx context.Context, strictVerify bool) {
logrus.Infof("Detected first start, force-applying one-time instruction set")
needsApplied = true
hasRunOnce = true
secret.Data[appliedChecksumKey] = []byte("")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@HarrisonWAffel I believe this will work, but I think we need to address ResetFailureCountOnStartup if we are changing the default behavior. Do you think we would be fine removing it from the plan and then removing it from Rancher? What about re-running failed plans at startup? Asking because these seem like mostly windows cases so I need a refresher on the context (I remember there is some interesting behavior regarding potentially cyclic/competing services).

There is also the fact that this introduces a change to the system-agent-upgrader plan which will re-run the latest plan during the upgrade, not a bug but definitely worth noting, since plans may be running during agent upgrades (all ready a possibility now, but much more likely something shows up in the UI).

Copy link
Contributor

@HarrisonWAffel HarrisonWAffel Apr 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In v2.10.0 the planner was updated to reattempt the Windows install plans multiple times before marking them as failed, as there can be transient issues that are not representative of a true plan failure. The problem I encountered was that, if a plan failed to apply 3 times but then succeeded, it would only be reattempted two times after the next reboot as the failure-count would still be equal to 3.

Handling this situation was complicated by an issue in rke2 which resulted in calico HNS namespaces being deleted each time rke2 was restarted (typically via the one-time-instruction). In that case, the plan should not be reattempted. If it was, the node might eventually be marked as available, even though some behavior (like deleting pods) would be completely broken. The solution was to introduce this field and conditionally set it based off of the clusters k8s version.

I think we still need to consider that situation. The rke2 fix was delivered in August of last year, so users have had plenty of time to upgrade, but removing this field and changing the default behavior could potentially silently break some existing clusters. I would be in support of doing that for 2.12, and communicating it in the release notes.

The existing change for applied-checksum shouldn't run into the above issue though.

// Plans which have previously succeeded but need to be force applied
// should continue to respect the specified failure count.
if cp.Plan.ResetFailureCountOnStartup {
secret.Data[appliedChecksumKey] = []byte("")
secret.Data[failureCountKey] = []byte("0")
}
}
Expand Down