-
Notifications
You must be signed in to change notification settings - Fork 46
fix: add retry logic for one time instruction #226
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
pratikjagrut
wants to merge
1
commit into
rancher:main
Choose a base branch
from
pratikjagrut:retry.onetime.instruction
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from all commits
Commits
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@HarrisonWAffel I believe this will work, but I think we need to address
ResetFailureCountOnStartup
if we are changing the default behavior. Do you think we would be fine removing it from the plan and then removing it from Rancher? What about re-running failed plans at startup? Asking because these seem like mostly windows cases so I need a refresher on the context (I remember there is some interesting behavior regarding potentially cyclic/competing services).There is also the fact that this introduces a change to the
system-agent-upgrader
plan which will re-run the latest plan during the upgrade, not a bug but definitely worth noting, since plans may be running during agent upgrades (all ready a possibility now, but much more likely something shows up in the UI).There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In
v2.10.0
the planner was updated to reattempt the Windows install plans multiple times before marking them as failed, as there can be transient issues that are not representative of a true plan failure. The problem I encountered was that, if a plan failed to apply 3 times but then succeeded, it would only be reattempted two times after the next reboot as the failure-count would still be equal to3
.Handling this situation was complicated by an issue in rke2 which resulted in calico HNS namespaces being deleted each time rke2 was restarted (typically via the one-time-instruction). In that case, the plan should not be reattempted. If it was, the node might eventually be marked as available, even though some behavior (like deleting pods) would be completely broken. The solution was to introduce this field and conditionally set it based off of the clusters k8s version.
I think we still need to consider that situation. The rke2 fix was delivered in August of last year, so users have had plenty of time to upgrade, but removing this field and changing the default behavior could potentially silently break some existing clusters. I would be in support of doing that for 2.12, and communicating it in the release notes.
The existing change for
applied-checksum
shouldn't run into the above issue though.