-
Notifications
You must be signed in to change notification settings - Fork 263
Description
This is a backport issue for #4572, automatically created via GitHub Actions workflow initiated by @0xavi0
Original issue body:
SURE-11213
Issue description
During a Longhorn upgrade managed by Fleet, Fleet unexpectedly initiated a Helm uninstall of the existing Longhorn release instead of performing an in-place upgrade.
The uninstall was blocked by a protective uninstall hook in the Longhorn Helm chart, preventing data loss. However, this left the Helm release in a permanent uninstalling state.
Fleet had already deleted all previous Helm revisions, making rollback impossible. As a result, the Longhorn deployment was stuck and could not be upgraded further without manual intervention.
This behavior occurred without any explicit request to uninstall Longhorn.
Business impact:
- Fleet initiated a Helm uninstall of Longhorn during an upgrade attempt.
- Helm release status became permanently stuck in uninstalling.
- helm history showed only a single revision in uninstalling state:
REVISION STATUS DESCRIPTION
1 uninstalling Deletion in progress (or silently failed)
- All previous Helm revisions were removed by Fleet.
- Helm rollback was not possible due to missing revision history.
- Longhorn uninstall hook prevented data loss and blocked uninstall. Otherwise it would have lead to data loss.
- Upgrade to Longhorn 1.10.1 could not proceed while release remained in uninstalling.
Repro steps:
Reproduction is not straightforward through normal upgrade procedures; however, during testing in a lab environment, the following scenario resulted in a similar failure state as observed in the customer environment and led to identification of the documented workaround:
- Longhorn is installed and managed by Fleet (e.g., version 1.10.1).
- Triggered upgrade to an incorrect or non-existent Longhorn chart version (e.g., 1.10.2) is configured in fleet.yaml.
- Fleet attempts the upgrade and fails due to chart/version not found.
- The version in fleet.yaml is reverted back to a valid version (e.g., 1.10.1).
- Fleet then initiates a Helm uninstall of the existing release instead of recovering.
- Helm release becomes stuck in uninstalling state with no rollback path.
Workaround:
Is a workaround available and implemented? Yes.
- Pause the git repo
- Scale down fleet agent to 0
- Change helm secret:
#!/bin/bash # Get the secret and decode it properly SECRET_NAME="sh.helm.release.v1.longhorn.v1" # Replace with actual name
NAMESPACE="longhorn-system" # Replace with actual namespace
# Extract and decode the release data
$ kubectl get secret $SECRET_NAME -n $NAMESPACE -o jsonpath='{.data.release}' | base64 -d | base64 -d | gunzip > release.json
# View the release info
$ cat release.json | jq '.'
$ cat release.json | jq '.info.status = "deployed"' | jq '.info.deleted = ""' | jq 'del(.info.deletion_timestamp)' > release-modified.json
$ cat release-modified.json | gzip | base64 | base64 -w0 > release-encoded.txt
#Create patch.json with:
cat > patch.json <<EOF
{
"data": {
"release": "$(cat release-encoded.txt)"
},
"metadata": {
"labels": {
"status": "deployed"
}
}
}
EOF
$ kubectl patch secret $SECRET_NAME -n $NAMESPACE --type=merge --patch-file=patch.json
Actual behavior
Fleet triggers a Helm uninstall during an upgrade attempt.
Expected behavior
Fleet must never initiate a Helm uninstall for Longhorn during an upgrade unless explicitly instructed
Additional notes:
- The issue was observed across multiple clusters using the same Fleet bundle.
- Clusters that still had Helm revision history could be recovered via rollback, indicating inconsistent Fleet behavior.
- Given Longhorn’s role as a storage provider, this behavior represents a critical risk and should be explicitly prevented or guarded.
Metadata
Metadata
Labels
Type
Projects
Status