helm · gerrnot · Jul 12, 2024 · Jul 24, 2024 · Jul 24, 2024 · Jul 24, 2024
diff --git a/hips/hip-9999.md b/hips/hip-9999.md
@@ -0,0 +1,107 @@
+---
+hip: 9999
+title: "Autorecover from stuck situations"
+authors: [ "Gernot Feichter <[email protected]>" ]
+created: "2024-07-12"
+type: "feature"
+status: "draft"
+helm-version: 3
+---
+
+## Abstract
+
+The idea is to simplify the handling for both manual users and CI/CD pipelines,
+to auto-recover from a state of stuck deployments, which is currently not possible unless users implement
+boilerplate code around their helm invocations.
+
+## Motivation
+
+If a helm deployment fails, I want to be able to retry it,
+ideally by running the same command again to keep things simple.
+
+There are two known situations how the user can run into such a situation where a retry will NOT work:
+1. A helm upgrade/install process is killed while the release is in state `PENDING-UPGRADE` or `PENDING-INSTALL`.
+2. The initial helm release installation (as performed via `helm upgrade --install`) is in state `FAILED`.
+
+Known Workarounds that should become OBSOLETE when this HIP is implemented to recover from such situations:
+1. `kubectl delete secret '<the name of the secret where helm stores release information>'.` (Not possible if you don't want to lose all history)
+2. `helm delete` your release. (Not possible if you don't want to lose all history)
+3. `helm rollback` your release. (Not possible if it is the first installation)
+
+## Rationale
+
+The proposed solution uses a locking mechanism that is stored in k8s, such that all clients know whether a helm
+release is locked by themselves or not and for how long the lock is valid.
+
+It uses existing helm parameters like the --timeout parameter (which defaults to 5m) to determine for how long a helm release
+may be stuck in a pending state.
+
+## Specification
+
+The --timout parameter gets a deeper meaning.
+Previously the --timout parameter only had an effect on the helm process running on the respective client.
+After implementation, the --timout parameter will be stored in the helm release object (secret) in k8s and
+have an indirect impact on possible parallel processes.
+
+`helm ls -a` shows two new columns, regular `helm ls` does NOT show those:
+- `LOCKED TILL`
+  datetime calculated by the helm client: current time + timeout parameter value
+  Originally k8s server time was intended as "current time", but since helm exclusively uses the
+  client time everywhere else, we do not change that via this HIP, such a refactoring would need
+  to be performed via a separate HIP against the entire codebase.
+- `SESSION ID`
+
+  Unique, random session id generated by the client.
+
+Furthermore, if the helm client process gets killed (SIGTERM), it tries to clear the LOCKED TILL value,
+SESSION ID and sets the release into a failed state before terminating in order to free the lock.
+
+## Backwards compatibility
+
+It is assumed that the helm release object as stored in k8s will not break
+older clients if new fields are added while existing fields are untouched.
+
+Backwards compatibility will be tested during implementation!
+
+## Security implications
+
+The proposed solution should not have an impact on security.
+
+## How to teach this
+
+Since the way that helm is invoked is not altered, there will not be much to teach here.
+The usage of the timeout parameter is encouraged, but since the default timeout is already 5m, not even that
+needs to be encouraged.
+
+It should just reduce the amount of frustration when dealing with pending and failed helm releases.
+
+A retry of a failed command should just work (assuming the retry happens when no other client has a valid lock).
+
+## Reference implementation
+
+helm: https://github.com/gerrnot/helm/tree/feat/autorecover-from-stuck-situations
+
+acceptance-testing: https://github.com/gerrnot/acceptance-testing/tree/feat/autorecover-from-stuck-situations
+
+## Rejected ideas
+
+None
+
+## Open issues
+
+- [ ] HIP status `accepted'
+- [x] Reference implementation
+- [x] Test for concurrent upgrade (valid lock should still block concurrent upgrade attempts)
+- [x] Test for upgrading from pending state
+- [x] Test for upgrading from failed state
+- [ ] Decision: Helm ls -> which flag should show the new fields `LOCKED TILL` and `SESSION ID`?
+- [ ] Decision: k8s Lease object vs helm relesae secret for storing the `LOCKED TILL` and `SESSION ID`
+- [x] Backwards compatibility check (part of acceptance tests repo, looking good already, even when storing the state in the release object)
+
+## References
+
+https://github.com/helm/helm/issues/7476
+
+https://github.com/rancher/rancher/issues/44530
+
+https://github.com/helm/helm/issues/11863