Skip to content

[backport v2.13] [SURE-11213] Fleet initiates unintended Longhorn uninstall during longhorn upgrade #4637

@rancherbot

Description

@rancherbot

This is a backport issue for #4572, automatically created via GitHub Actions workflow initiated by @0xavi0

Original issue body:

SURE-11213

Issue description

During a Longhorn upgrade managed by Fleet, Fleet unexpectedly initiated a Helm uninstall of the existing Longhorn release instead of performing an in-place upgrade.

The uninstall was blocked by a protective uninstall hook in the Longhorn Helm chart, preventing data loss. However, this left the Helm release in a permanent uninstalling state.

Fleet had already deleted all previous Helm revisions, making rollback impossible. As a result, the Longhorn deployment was stuck and could not be upgraded further without manual intervention.

This behavior occurred without any explicit request to uninstall Longhorn.

Business impact:

  • Fleet initiated a Helm uninstall of Longhorn during an upgrade attempt.
  • Helm release status became permanently stuck in uninstalling.
  • helm history showed only a single revision in uninstalling state:
    REVISION  STATUS         DESCRIPTION
    1         uninstalling   Deletion in progress (or silently failed)
  • All previous Helm revisions were removed by Fleet.
  • Helm rollback was not possible due to missing revision history.
  • Longhorn uninstall hook prevented data loss and blocked uninstall. Otherwise it would have lead to data loss.
  • Upgrade to Longhorn 1.10.1 could not proceed while release remained in uninstalling.

Repro steps:

Reproduction is not straightforward through normal upgrade procedures; however, during testing in a lab environment, the following scenario resulted in a similar failure state as observed in the customer environment and led to identification of the documented workaround:

  • Longhorn is installed and managed by Fleet (e.g., version 1.10.1).
  • Triggered upgrade to an incorrect or non-existent Longhorn chart version (e.g., 1.10.2) is configured in fleet.yaml.
  • Fleet attempts the upgrade and fails due to chart/version not found.
  • The version in fleet.yaml is reverted back to a valid version (e.g., 1.10.1).
  • Fleet then initiates a Helm uninstall of the existing release instead of recovering.
  • Helm release becomes stuck in uninstalling state with no rollback path.

Workaround:

Is a workaround available and implemented? Yes.

  1. Pause the git repo
  2. Scale down fleet agent to 0
  3. Change helm secret:
#!/bin/bash # Get the secret and decode it properly SECRET_NAME="sh.helm.release.v1.longhorn.v1" # Replace with actual name 

NAMESPACE="longhorn-system" # Replace with actual namespace 

# Extract and decode the release data 
$ kubectl get secret $SECRET_NAME -n $NAMESPACE -o jsonpath='{.data.release}' | base64 -d | base64 -d | gunzip > release.json 

# View the release info 
$ cat release.json | jq '.' 
$ cat release.json | jq '.info.status = "deployed"' | jq '.info.deleted = ""' | jq 'del(.info.deletion_timestamp)' > release-modified.json 
$ cat release-modified.json | gzip | base64 | base64 -w0 > release-encoded.txt 

#Create patch.json with: 

cat > patch.json <<EOF
{
 "data": {
  "release": "$(cat release-encoded.txt)"
 },
 "metadata": {
  "labels": {
   "status": "deployed"
  }
 }
}
EOF

$ kubectl patch secret $SECRET_NAME -n $NAMESPACE --type=merge --patch-file=patch.json 

Actual behavior

Fleet triggers a Helm uninstall during an upgrade attempt.

Expected behavior

Fleet must never initiate a Helm uninstall for Longhorn during an upgrade unless explicitly instructed

Additional notes:

  • The issue was observed across multiple clusters using the same Fleet bundle.
  • Clusters that still had Helm revision history could be recovered via rollback, indicating inconsistent Fleet behavior.
  • Given Longhorn’s role as a storage provider, this behavior represents a critical risk and should be explicitly prevented or guarded.

Metadata

Metadata

Assignees

Type

Projects

Status

✅ Done

Relationships

None yet

Development

No branches or pull requests

Issue actions