-
Notifications
You must be signed in to change notification settings - Fork 182
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: autorecover from stuck situations #354
Open
gerrnot
wants to merge
5
commits into
helm:main
Choose a base branch
from
gerrnot:feat/hip-0019
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from all commits
Commits
Show all changes
5 commits
Select commit
Hold shift + click to select a range
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,107 @@ | ||
--- | ||
hip: 9999 | ||
title: "Autorecover from stuck situations" | ||
authors: [ "Gernot Feichter <[email protected]>" ] | ||
created: "2024-07-12" | ||
type: "feature" | ||
status: "draft" | ||
helm-version: 3 | ||
--- | ||
|
||
## Abstract | ||
|
||
The idea is to simplify the handling for both manual users and CI/CD pipelines, | ||
to auto-recover from a state of stuck deployments, which is currently not possible unless users implement | ||
boilerplate code around their helm invocations. | ||
|
||
## Motivation | ||
|
||
If a helm deployment fails, I want to be able to retry it, | ||
ideally by running the same command again to keep things simple. | ||
|
||
There are two known situations how the user can run into such a situation where a retry will NOT work: | ||
1. A helm upgrade/install process is killed while the release is in state `PENDING-UPGRADE` or `PENDING-INSTALL`. | ||
2. The initial helm release installation (as performed via `helm upgrade --install`) is in state `FAILED`. | ||
|
||
Known Workarounds that should become OBSOLETE when this HIP is implemented to recover from such situations: | ||
1. `kubectl delete secret '<the name of the secret where helm stores release information>'.` (Not possible if you don't want to lose all history) | ||
2. `helm delete` your release. (Not possible if you don't want to lose all history) | ||
3. `helm rollback` your release. (Not possible if it is the first installation) | ||
|
||
## Rationale | ||
|
||
The proposed solution uses a locking mechanism that is stored in k8s, such that all clients know whether a helm | ||
release is locked by themselves or not and for how long the lock is valid. | ||
|
||
It uses existing helm parameters like the --timeout parameter (which defaults to 5m) to determine for how long a helm release | ||
may be stuck in a pending state. | ||
|
||
## Specification | ||
|
||
The --timout parameter gets a deeper meaning. | ||
Previously the --timout parameter only had an effect on the helm process running on the respective client. | ||
After implementation, the --timout parameter will be stored in the helm release object (secret) in k8s and | ||
have an indirect impact on possible parallel processes. | ||
|
||
`helm ls -a` shows two new columns, regular `helm ls` does NOT show those: | ||
- `LOCKED TILL` | ||
datetime calculated by the helm client: current time + timeout parameter value | ||
Originally k8s server time was intended as "current time", but since helm exclusively uses the | ||
client time everywhere else, we do not change that via this HIP, such a refactoring would need | ||
to be performed via a separate HIP against the entire codebase. | ||
- `SESSION ID` | ||
|
||
Unique, random session id generated by the client. | ||
|
||
Furthermore, if the helm client process gets killed (SIGTERM), it tries to clear the LOCKED TILL value, | ||
SESSION ID and sets the release into a failed state before terminating in order to free the lock. | ||
|
||
## Backwards compatibility | ||
|
||
It is assumed that the helm release object as stored in k8s will not break | ||
older clients if new fields are added while existing fields are untouched. | ||
|
||
Backwards compatibility will be tested during implementation! | ||
|
||
## Security implications | ||
|
||
The proposed solution should not have an impact on security. | ||
|
||
## How to teach this | ||
|
||
Since the way that helm is invoked is not altered, there will not be much to teach here. | ||
The usage of the timeout parameter is encouraged, but since the default timeout is already 5m, not even that | ||
needs to be encouraged. | ||
|
||
It should just reduce the amount of frustration when dealing with pending and failed helm releases. | ||
|
||
A retry of a failed command should just work (assuming the retry happens when no other client has a valid lock). | ||
|
||
## Reference implementation | ||
|
||
helm: https://github.com/gerrnot/helm/tree/feat/autorecover-from-stuck-situations | ||
|
||
acceptance-testing: https://github.com/gerrnot/acceptance-testing/tree/feat/autorecover-from-stuck-situations | ||
|
||
## Rejected ideas | ||
|
||
None | ||
|
||
## Open issues | ||
|
||
- [ ] HIP status `accepted' | ||
- [x] Reference implementation | ||
- [x] Test for concurrent upgrade (valid lock should still block concurrent upgrade attempts) | ||
- [x] Test for upgrading from pending state | ||
- [x] Test for upgrading from failed state | ||
- [ ] Decision: Helm ls -> which flag should show the new fields `LOCKED TILL` and `SESSION ID`? | ||
- [ ] Decision: k8s Lease object vs helm relesae secret for storing the `LOCKED TILL` and `SESSION ID` | ||
- [x] Backwards compatibility check (part of acceptance tests repo, looking good already, even when storing the state in the release object) | ||
|
||
## References | ||
|
||
https://github.com/helm/helm/issues/7476 | ||
|
||
https://github.com/rancher/rancher/issues/44530 | ||
|
||
https://github.com/helm/helm/issues/11863 |
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe that would not be fully backwards compatible. Other programs that depend on Helm may parse the output from
helm ls -a
and expect the number of columns to be unchanged.Not sure how bad that is, though. Every change is a breaking change to someone, and I think this seems like a reasonable change. And parsing the default (table) output is not robust anyway, if they do parse the output they should probably use
-o json
or-o yaml
instead.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am very open to change the behaviour regarding
-a
, since-a
traditionally means all releases (in all states), rather than the colums displayed.Maybe rather tie it to the existing flag
--debug
or add a new flag? I actually would prefer that, maybe call it--concurrency
? It's really just that most users will never need to see it...@lindhe : What do you think?
Sidenote: There is also the
--short
flag, but that false per default, so we cannot use that and having a long flag could yield to funny situations, e.g. specifying short and long....There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm impartial to changing it or not. If this feature can get added with the modification to
-a
, I think that's a good change. But if that causes any roadblocks I do not think it's worth it.I think adding a new flag sounds a bit convoluted. If it cannot be part of
-a
by default , then it might make sense to tie into--debug
like you suggest. But I would like to get some input from Helm maintainers before going too far with the specifics of that implementation. If they agree that it is OK to introduce a new column by default, that's probably the cleanest solution.