Fix stale bundle error when cluster is offline after bad commit by thardeck · Pull Request #4780 · rancher/fleet

thardeck · 2026-03-10T08:34:34Z

When a GitRepo contains a YAML parse error and the cluster agent is offline, the bundle's Ready condition retains the error message even after a fix commit is pushed. Three interdependent changes are needed to clear the stale state.

deployer.go: Add MalformedYAMLError to the deployErrToStatus regex. Helm v4 changed the error format from "YAML parse error" to "MalformedYAMLError"; without this match the error is routed to the Deployed condition instead of Installed, bypassing the staleness guard entirely.
summary.go: In MessageFromDeployment, skip the Installed condition message when AppliedDeploymentID differs from Spec.DeploymentID, so a stale error from a superseded apply attempt is not surfaced. Unit tests added for all three condition precedence cases.
target.go: Add effectiveDeployment so state and message compare against t.DeploymentID (the ID the controller is about to write) rather than the stale Spec.DeploymentID still held in the cached BundleDeployment. The bundle controller calls SetReadyConditions before updating BD specs, so the summary.go guard would otherwise never trigger while the agent is offline. Uses a shallow struct copy;

Refers #594

Copilot

Pull request overview

This PR addresses a Fleet status staleness issue where Bundles can keep showing an old “YAML parse” error when a downstream cluster agent is offline and a subsequent commit fixes the repo content. The fix aligns error classification with Helm v4, avoids surfacing stale Installed messages, and ensures the controller compares status against the deployment ID it is about to write (not the cached BD spec).

Changes:

Update agent error-to-status mapping to recognize Helm v4 MalformedYAMLError as an Installed-condition error (preventing stale message routing).
Suppress Installed condition messages in summaries when they belong to a superseded deployment attempt (deployment ID mismatch).
In target status computation, evaluate state/message against an “effective” deployment ID that matches what the controller will write, and add an e2e regression test plus a minimal chart asset.

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 3 comments.

Show a summary per file

File	Description
internal/cmd/controller/target/target.go	Computes state/message using an effective deployment ID to avoid stale cached BD spec issues while the agent is offline.
internal/cmd/controller/summary/summary.go	Prevents stale `Installed` messages from being surfaced when `AppliedDeploymentID` doesn’t match the current spec deployment ID.
internal/cmd/agent/deployer/deployer.go	Extends deploy error classification to include Helm v4 `MalformedYAMLError` as a YAML parse/install error.
e2e/single-cluster/status_test.go	Adds an e2e regression test simulating offline agent + bad commit + fix commit to ensure stale errors clear.
e2e/assets/single-cluster/offline-bundle-stuck/templates/configmap.yaml	Minimal chart content for the new e2e scenario.
e2e/assets/single-cluster/offline-bundle-stuck/fleet.yaml	Minimal Fleet config for the new e2e scenario.
e2e/assets/single-cluster/offline-bundle-stuck/Chart.yaml	Minimal Helm chart metadata for the new e2e scenario.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

When a GitRepo contains a YAML parse error and the cluster agent is offline, the bundle's Ready condition retains the error message even after a fix commit is pushed. Three interdependent changes are needed. deployer.go: Add MalformedYAMLError to the deployErrToStatus regex. Helm v4 changed the error format from "YAML parse error" to "MalformedYAMLError"; without this match the error is routed to the Deployed condition instead of Installed, bypassing the staleness guard. summary.go: In MessageFromDeployment, skip the Installed condition message when AppliedDeploymentID differs from Spec.DeploymentID, so a stale error from a superseded apply attempt is not surfaced. target.go: Add effectiveDeployment so state and message compare against t.DeploymentID (the ID the controller is about to write) rather than the stale Spec.DeploymentID still held in the cached BundleDeployment. The bundle controller calls SetReadyConditions before updating BD specs, so the summary.go guard would otherwise never trigger while the agent is offline.

After labels change the controller uses effectiveDeployment to compute WaitApplied=1 (the new deployment ID hasn't been applied yet). The test was checking WaitApplied==0 without simulating the agent re-applying the updated bundle deployment. Split the assertion into three steps: wait for BD spec change, simulate agent applying the new deployment, then assert the bundle shows WaitApplied=0.

weyfonk

Looking mostly good, with a couple of questions on my end :)
This seems complementary with #2933.

Keeping stale errors in bundle statuses would make the existence of an issue visible, but do not help users understand what the issue might be.

Resolved merge conflict in integrationtests/controller/bundle/status_test.go. The main branch refactored the test to use DescribeTable, while implement_594 added the fix for effectiveDeployment. Both are now integrated: the test splits the assertion into three steps - wait for BD spec change, simulate agent applying, then verify WaitApplied==0.

* Fix stale bundle error when cluster is offline after bad commit When a GitRepo contains a YAML parse error and the cluster agent is offline, the bundle's Ready condition retains the error message even after a fix commit is pushed. Three interdependent changes are needed. deployer.go: Add MalformedYAMLError to the deployErrToStatus regex. Helm v4 changed the error format from "YAML parse error" to "MalformedYAMLError"; without this match the error is routed to the Deployed condition instead of Installed, bypassing the staleness guard. summary.go: In MessageFromDeployment, skip the Installed condition message when AppliedDeploymentID differs from Spec.DeploymentID, so a stale error from a superseded apply attempt is not surfaced. target.go: Add effectiveDeployment so state and message compare against t.DeploymentID (the ID the controller is about to write) rather than the stale Spec.DeploymentID still held in the cached BundleDeployment. The bundle controller calls SetReadyConditions before updating BD specs, so the summary.go guard would otherwise never trigger while the agent is offline. * Fix integration test after effectiveDeployment change After labels change the controller uses effectiveDeployment to compute WaitApplied=1 (the new deployment ID hasn't been applied yet). The test was checking WaitApplied==0 without simulating the agent re-applying the updated bundle deployment. Split the assertion into three steps: wait for BD spec change, simulate agent applying the new deployment, then assert the bundle shows WaitApplied=0.

… (#4823) * Fix stale bundle error when cluster is offline after bad commit When a GitRepo contains a YAML parse error and the cluster agent is offline, the bundle's Ready condition retains the error message even after a fix commit is pushed. Three interdependent changes are needed. deployer.go: Add MalformedYAMLError to the deployErrToStatus regex. Helm v4 changed the error format from "YAML parse error" to "MalformedYAMLError"; without this match the error is routed to the Deployed condition instead of Installed, bypassing the staleness guard. summary.go: In MessageFromDeployment, skip the Installed condition message when AppliedDeploymentID differs from Spec.DeploymentID, so a stale error from a superseded apply attempt is not surfaced. target.go: Add effectiveDeployment so state and message compare against t.DeploymentID (the ID the controller is about to write) rather than the stale Spec.DeploymentID still held in the cached BundleDeployment. The bundle controller calls SetReadyConditions before updating BD specs, so the summary.go guard would otherwise never trigger while the agent is offline. * Fix integration test after effectiveDeployment change After labels change the controller uses effectiveDeployment to compute WaitApplied=1 (the new deployment ID hasn't been applied yet). The test was checking WaitApplied==0 without simulating the agent re-applying the updated bundle deployment. Split the assertion into three steps: wait for BD spec change, simulate agent applying the new deployment, then assert the bundle shows WaitApplied=0.

When Helm v4 uses server-side apply with strict field validation, the API server rejects resources containing unknown fields with an error message containing "unknown field" rather than "error validating data". The previous regex only matched the client-side OpenAPI schema validation format, so SSA rejections were returned to the reconciler as real errors (Deployed=False, AppliedDeploymentID not updated). This caused the stale error suppression added in #4780 to not take effect when an offline cluster had a prior bad commit followed by a fix commit. Add "(unknown field)" to the deployErrToStatus pattern list so that both client-side and server-side field validation failures are stored in the Installed condition (AppliedDeploymentID updated, Deployed=True), allowing the existing ID-mismatch guard in MessageFromDeployment to suppress the stale message once a fix commit has been pushed. Refers #594

* Catch unknown-field errors in deployErrToStatus When Helm v4 uses server-side apply with strict field validation, the API server rejects resources containing unknown fields with an error message containing "unknown field" rather than "error validating data". The previous regex only matched the client-side OpenAPI schema validation format, so SSA rejections were returned to the reconciler as real errors (Deployed=False, AppliedDeploymentID not updated). This caused the stale error suppression added in #4780 to not take effect when an offline cluster had a prior bad commit followed by a fix commit. Add "(unknown field)" to the deployErrToStatus pattern list so that both client-side and server-side field validation failures are stored in the Installed condition (AppliedDeploymentID updated, Deployed=True), allowing the existing ID-mismatch guard in MessageFromDeployment to suppress the stale message once a fix commit has been pushed. Refers #594 * Add integration test for stale unknown-field error suppression Tests that when the agent's deployErrToStatus routes an "unknown field" error into the Installed condition (Deployed=True, AppliedDeploymentID set to the failing commit's ID), the bundle controller correctly suppresses the stale error once the spec advances to a new DeploymentID while the cluster stays offline. * Move deployErrPattern to package-level var Compiling the regex on every call to deployErrToStatus wastes CPU when deploy errors trigger repeated reconciliations. The pattern is static, so a single compilation at init time is sufficient.

…it (#4780) (#4823)" This reverts commit 9b1d80b.

…it (#4780) (#4823)" (#4985) This reverts commit 9b1d80b.

thardeck self-assigned this Mar 10, 2026

thardeck requested a review from a team as a code owner March 10, 2026 08:34

thardeck added this to Fleet Mar 10, 2026

Copilot AI review requested due to automatic review settings March 10, 2026 08:34

Copilot started reviewing on behalf of thardeck March 10, 2026 08:35 View session

Copilot AI reviewed Mar 10, 2026

View reviewed changes

Comment thread internal/cmd/controller/target/target.go Outdated

Comment thread internal/cmd/controller/summary/summary.go

Comment thread e2e/single-cluster/status_test.go

kkaempf added this to the v2.14.1 milestone Mar 10, 2026

kkaempf added kind/bug area/helm labels Mar 10, 2026

thardeck force-pushed the implement_594 branch from 59e3a92 to 833f378 Compare March 10, 2026 09:29

thardeck moved this to 👀 In review in Fleet Mar 10, 2026

thardeck force-pushed the implement_594 branch from 833f378 to 32fe08b Compare March 10, 2026 09:47

weyfonk reviewed Mar 13, 2026

View reviewed changes

Comment thread internal/cmd/controller/target/target.go

Comment thread integrationtests/controller/bundle/status_test.go Outdated

thardeck force-pushed the implement_594 branch from 5cb3514 to 3bdb930 Compare March 13, 2026 15:30

weyfonk approved these changes Mar 16, 2026

View reviewed changes

thardeck merged commit ff94719 into main Mar 16, 2026
22 checks passed

thardeck deleted the implement_594 branch March 16, 2026 09:28

github-project-automation Bot moved this from 👀 In review to ✅ Done in Fleet Mar 16, 2026

thardeck mentioned this pull request Mar 16, 2026

[v0.15] Fix stale bundle error when cluster is offline after bad commit (#4780) #4823

Merged

mmartin24 mentioned this pull request Apr 15, 2026

If any clusters are offline, bundle status can get stuck #594

Open

thardeck added a commit that referenced this pull request Apr 15, 2026

Revert "Fix stale bundle error when cluster is offline after bad comm…

5b1d68e

…it (#4780) (#4823)" This reverts commit 9b1d80b.

thardeck mentioned this pull request Apr 15, 2026

[v0.15] Revert stale bundle error fix (#4823) #4985

Merged

thardeck added a commit that referenced this pull request Apr 16, 2026

Revert "Fix stale bundle error when cluster is offline after bad comm…

6b800be

…it (#4780) (#4823)" (#4985) This reverts commit 9b1d80b.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix stale bundle error when cluster is offline after bad commit#4780

Fix stale bundle error when cluster is offline after bad commit#4780
thardeck merged 3 commits into
mainfrom
implement_594

thardeck commented Mar 10, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

weyfonk left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

thardeck commented Mar 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

weyfonk left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

thardeck commented Mar 10, 2026 •

edited

Loading