Skip to content

Fix stale bundle error when cluster is offline after bad commit#4780

Merged
thardeck merged 3 commits into
mainfrom
implement_594
Mar 16, 2026
Merged

Fix stale bundle error when cluster is offline after bad commit#4780
thardeck merged 3 commits into
mainfrom
implement_594

Conversation

@thardeck
Copy link
Copy Markdown
Collaborator

@thardeck thardeck commented Mar 10, 2026

When a GitRepo contains a YAML parse error and the cluster agent is offline, the bundle's Ready condition retains the error message even after a fix commit is pushed. Three interdependent changes are needed to clear the stale state.

  • deployer.go: Add MalformedYAMLError to the deployErrToStatus regex. Helm v4 changed the error format from "YAML parse error" to "MalformedYAMLError"; without this match the error is routed to the Deployed condition instead of Installed, bypassing the staleness guard entirely.
  • summary.go: In MessageFromDeployment, skip the Installed condition message when AppliedDeploymentID differs from Spec.DeploymentID, so a stale error from a superseded apply attempt is not surfaced. Unit tests added for all three condition precedence cases.
  • target.go: Add effectiveDeployment so state and message compare against t.DeploymentID (the ID the controller is about to write) rather than the stale Spec.DeploymentID still held in the cached BundleDeployment. The bundle controller calls SetReadyConditions before updating BD specs, so the summary.go guard would otherwise never trigger while the agent is offline. Uses a shallow struct copy;

Refers #594

@thardeck thardeck self-assigned this Mar 10, 2026
@thardeck thardeck requested a review from a team as a code owner March 10, 2026 08:34
@thardeck thardeck added this to Fleet Mar 10, 2026
Copilot AI review requested due to automatic review settings March 10, 2026 08:34
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR addresses a Fleet status staleness issue where Bundles can keep showing an old “YAML parse” error when a downstream cluster agent is offline and a subsequent commit fixes the repo content. The fix aligns error classification with Helm v4, avoids surfacing stale Installed messages, and ensures the controller compares status against the deployment ID it is about to write (not the cached BD spec).

Changes:

  • Update agent error-to-status mapping to recognize Helm v4 MalformedYAMLError as an Installed-condition error (preventing stale message routing).
  • Suppress Installed condition messages in summaries when they belong to a superseded deployment attempt (deployment ID mismatch).
  • In target status computation, evaluate state/message against an “effective” deployment ID that matches what the controller will write, and add an e2e regression test plus a minimal chart asset.

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
internal/cmd/controller/target/target.go Computes state/message using an effective deployment ID to avoid stale cached BD spec issues while the agent is offline.
internal/cmd/controller/summary/summary.go Prevents stale Installed messages from being surfaced when AppliedDeploymentID doesn’t match the current spec deployment ID.
internal/cmd/agent/deployer/deployer.go Extends deploy error classification to include Helm v4 MalformedYAMLError as a YAML parse/install error.
e2e/single-cluster/status_test.go Adds an e2e regression test simulating offline agent + bad commit + fix commit to ensure stale errors clear.
e2e/assets/single-cluster/offline-bundle-stuck/templates/configmap.yaml Minimal chart content for the new e2e scenario.
e2e/assets/single-cluster/offline-bundle-stuck/fleet.yaml Minimal Fleet config for the new e2e scenario.
e2e/assets/single-cluster/offline-bundle-stuck/Chart.yaml Minimal Helm chart metadata for the new e2e scenario.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread internal/cmd/controller/target/target.go Outdated
Comment thread internal/cmd/controller/summary/summary.go
Comment thread e2e/single-cluster/status_test.go
@kkaempf kkaempf added this to the v2.14.1 milestone Mar 10, 2026
@thardeck thardeck moved this to 👀 In review in Fleet Mar 10, 2026
When a GitRepo contains a YAML parse error and the cluster agent is
offline, the bundle's Ready condition retains the error message even
after a fix commit is pushed. Three interdependent changes are needed.

deployer.go: Add MalformedYAMLError to the deployErrToStatus regex.
Helm v4 changed the error format from "YAML parse error" to
"MalformedYAMLError"; without this match the error is routed to the
Deployed condition instead of Installed, bypassing the staleness guard.

summary.go: In MessageFromDeployment, skip the Installed condition
message when AppliedDeploymentID differs from Spec.DeploymentID, so a
stale error from a superseded apply attempt is not surfaced.

target.go: Add effectiveDeployment so state and message compare against
t.DeploymentID (the ID the controller is about to write) rather than the
stale Spec.DeploymentID still held in the cached BundleDeployment. The
bundle controller calls SetReadyConditions before updating BD specs, so
the summary.go guard would otherwise never trigger while the agent is
offline.
After labels change the controller uses effectiveDeployment to compute
WaitApplied=1 (the new deployment ID hasn't been applied yet). The test
was checking WaitApplied==0 without simulating the agent re-applying the
updated bundle deployment. Split the assertion into three steps: wait for
BD spec change, simulate agent applying the new deployment, then assert
the bundle shows WaitApplied=0.
Copy link
Copy Markdown
Contributor

@weyfonk weyfonk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking mostly good, with a couple of questions on my end :)
This seems complementary with #2933.

Keeping stale errors in bundle statuses would make the existence of an issue visible, but do not help users understand what the issue might be.

Comment thread internal/cmd/controller/target/target.go
Comment thread integrationtests/controller/bundle/status_test.go Outdated
Resolved merge conflict in integrationtests/controller/bundle/status_test.go.
The main branch refactored the test to use DescribeTable, while implement_594
added the fix for effectiveDeployment. Both are now integrated: the test
splits the assertion into three steps - wait for BD spec change, simulate
agent applying, then verify WaitApplied==0.
@thardeck thardeck merged commit ff94719 into main Mar 16, 2026
22 checks passed
@thardeck thardeck deleted the implement_594 branch March 16, 2026 09:28
@github-project-automation github-project-automation Bot moved this from 👀 In review to ✅ Done in Fleet Mar 16, 2026
thardeck added a commit that referenced this pull request Mar 16, 2026
* Fix stale bundle error when cluster is offline after bad commit

When a GitRepo contains a YAML parse error and the cluster agent is
offline, the bundle's Ready condition retains the error message even
after a fix commit is pushed. Three interdependent changes are needed.

deployer.go: Add MalformedYAMLError to the deployErrToStatus regex.
Helm v4 changed the error format from "YAML parse error" to
"MalformedYAMLError"; without this match the error is routed to the
Deployed condition instead of Installed, bypassing the staleness guard.

summary.go: In MessageFromDeployment, skip the Installed condition
message when AppliedDeploymentID differs from Spec.DeploymentID, so a
stale error from a superseded apply attempt is not surfaced.

target.go: Add effectiveDeployment so state and message compare against
t.DeploymentID (the ID the controller is about to write) rather than the
stale Spec.DeploymentID still held in the cached BundleDeployment. The
bundle controller calls SetReadyConditions before updating BD specs, so
the summary.go guard would otherwise never trigger while the agent is
offline.

* Fix integration test after effectiveDeployment change

After labels change the controller uses effectiveDeployment to compute
WaitApplied=1 (the new deployment ID hasn't been applied yet). The test
was checking WaitApplied==0 without simulating the agent re-applying the
updated bundle deployment. Split the assertion into three steps: wait for
BD spec change, simulate agent applying the new deployment, then assert
the bundle shows WaitApplied=0.
thardeck added a commit that referenced this pull request Mar 20, 2026
… (#4823)

* Fix stale bundle error when cluster is offline after bad commit

When a GitRepo contains a YAML parse error and the cluster agent is
offline, the bundle's Ready condition retains the error message even
after a fix commit is pushed. Three interdependent changes are needed.

deployer.go: Add MalformedYAMLError to the deployErrToStatus regex.
Helm v4 changed the error format from "YAML parse error" to
"MalformedYAMLError"; without this match the error is routed to the
Deployed condition instead of Installed, bypassing the staleness guard.

summary.go: In MessageFromDeployment, skip the Installed condition
message when AppliedDeploymentID differs from Spec.DeploymentID, so a
stale error from a superseded apply attempt is not surfaced.

target.go: Add effectiveDeployment so state and message compare against
t.DeploymentID (the ID the controller is about to write) rather than the
stale Spec.DeploymentID still held in the cached BundleDeployment. The
bundle controller calls SetReadyConditions before updating BD specs, so
the summary.go guard would otherwise never trigger while the agent is
offline.

* Fix integration test after effectiveDeployment change

After labels change the controller uses effectiveDeployment to compute
WaitApplied=1 (the new deployment ID hasn't been applied yet). The test
was checking WaitApplied==0 without simulating the agent re-applying the
updated bundle deployment. Split the assertion into three steps: wait for
BD spec change, simulate agent applying the new deployment, then assert
the bundle shows WaitApplied=0.
thardeck added a commit that referenced this pull request Apr 10, 2026
When Helm v4 uses server-side apply with strict field validation, the
API server rejects resources containing unknown fields with an error
message containing "unknown field" rather than "error validating
data". The previous regex only matched the client-side OpenAPI schema
validation format, so SSA rejections were returned to the reconciler
as real errors (Deployed=False, AppliedDeploymentID not updated). This
caused the stale error suppression added in #4780 to not take effect
when an offline cluster had a prior bad commit followed by a fix commit.

Add "(unknown field)" to the deployErrToStatus pattern list so that
both client-side and server-side field validation failures are stored in
the Installed condition (AppliedDeploymentID updated, Deployed=True),
allowing the existing ID-mismatch guard in MessageFromDeployment to
suppress the stale message once a fix commit has been pushed.

Refers #594
thardeck added a commit that referenced this pull request Apr 14, 2026
* Catch unknown-field errors in deployErrToStatus

When Helm v4 uses server-side apply with strict field validation, the
API server rejects resources containing unknown fields with an error
message containing "unknown field" rather than "error validating
data". The previous regex only matched the client-side OpenAPI schema
validation format, so SSA rejections were returned to the reconciler
as real errors (Deployed=False, AppliedDeploymentID not updated). This
caused the stale error suppression added in #4780 to not take effect
when an offline cluster had a prior bad commit followed by a fix commit.

Add "(unknown field)" to the deployErrToStatus pattern list so that
both client-side and server-side field validation failures are stored in
the Installed condition (AppliedDeploymentID updated, Deployed=True),
allowing the existing ID-mismatch guard in MessageFromDeployment to
suppress the stale message once a fix commit has been pushed.

Refers #594

* Add integration test for stale unknown-field error suppression

Tests that when the agent's deployErrToStatus routes an "unknown field"
error into the Installed condition (Deployed=True, AppliedDeploymentID
set to the failing commit's ID), the bundle controller correctly
suppresses the stale error once the spec advances to a new DeploymentID
while the cluster stays offline.

* Move deployErrPattern to package-level var

Compiling the regex on every call to deployErrToStatus wastes CPU when
deploy errors trigger repeated reconciliations. The pattern is static,
so a single compilation at init time is sufficient.
thardeck added a commit that referenced this pull request Apr 15, 2026
thardeck added a commit that referenced this pull request Apr 16, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Archived in project

Development

Successfully merging this pull request may close these issues.

4 participants