Fix stale bundle error when cluster is offline after bad commit#4780
Conversation
There was a problem hiding this comment.
Pull request overview
This PR addresses a Fleet status staleness issue where Bundles can keep showing an old “YAML parse” error when a downstream cluster agent is offline and a subsequent commit fixes the repo content. The fix aligns error classification with Helm v4, avoids surfacing stale Installed messages, and ensures the controller compares status against the deployment ID it is about to write (not the cached BD spec).
Changes:
- Update agent error-to-status mapping to recognize Helm v4
MalformedYAMLErroras anInstalled-condition error (preventing stale message routing). - Suppress
Installedcondition messages in summaries when they belong to a superseded deployment attempt (deployment ID mismatch). - In target status computation, evaluate state/message against an “effective” deployment ID that matches what the controller will write, and add an e2e regression test plus a minimal chart asset.
Reviewed changes
Copilot reviewed 7 out of 7 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| internal/cmd/controller/target/target.go | Computes state/message using an effective deployment ID to avoid stale cached BD spec issues while the agent is offline. |
| internal/cmd/controller/summary/summary.go | Prevents stale Installed messages from being surfaced when AppliedDeploymentID doesn’t match the current spec deployment ID. |
| internal/cmd/agent/deployer/deployer.go | Extends deploy error classification to include Helm v4 MalformedYAMLError as a YAML parse/install error. |
| e2e/single-cluster/status_test.go | Adds an e2e regression test simulating offline agent + bad commit + fix commit to ensure stale errors clear. |
| e2e/assets/single-cluster/offline-bundle-stuck/templates/configmap.yaml | Minimal chart content for the new e2e scenario. |
| e2e/assets/single-cluster/offline-bundle-stuck/fleet.yaml | Minimal Fleet config for the new e2e scenario. |
| e2e/assets/single-cluster/offline-bundle-stuck/Chart.yaml | Minimal Helm chart metadata for the new e2e scenario. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
When a GitRepo contains a YAML parse error and the cluster agent is offline, the bundle's Ready condition retains the error message even after a fix commit is pushed. Three interdependent changes are needed. deployer.go: Add MalformedYAMLError to the deployErrToStatus regex. Helm v4 changed the error format from "YAML parse error" to "MalformedYAMLError"; without this match the error is routed to the Deployed condition instead of Installed, bypassing the staleness guard. summary.go: In MessageFromDeployment, skip the Installed condition message when AppliedDeploymentID differs from Spec.DeploymentID, so a stale error from a superseded apply attempt is not surfaced. target.go: Add effectiveDeployment so state and message compare against t.DeploymentID (the ID the controller is about to write) rather than the stale Spec.DeploymentID still held in the cached BundleDeployment. The bundle controller calls SetReadyConditions before updating BD specs, so the summary.go guard would otherwise never trigger while the agent is offline.
After labels change the controller uses effectiveDeployment to compute WaitApplied=1 (the new deployment ID hasn't been applied yet). The test was checking WaitApplied==0 without simulating the agent re-applying the updated bundle deployment. Split the assertion into three steps: wait for BD spec change, simulate agent applying the new deployment, then assert the bundle shows WaitApplied=0.
weyfonk
left a comment
There was a problem hiding this comment.
Looking mostly good, with a couple of questions on my end :)
This seems complementary with #2933.
Keeping stale errors in bundle statuses would make the existence of an issue visible, but do not help users understand what the issue might be.
Resolved merge conflict in integrationtests/controller/bundle/status_test.go. The main branch refactored the test to use DescribeTable, while implement_594 added the fix for effectiveDeployment. Both are now integrated: the test splits the assertion into three steps - wait for BD spec change, simulate agent applying, then verify WaitApplied==0.
* Fix stale bundle error when cluster is offline after bad commit When a GitRepo contains a YAML parse error and the cluster agent is offline, the bundle's Ready condition retains the error message even after a fix commit is pushed. Three interdependent changes are needed. deployer.go: Add MalformedYAMLError to the deployErrToStatus regex. Helm v4 changed the error format from "YAML parse error" to "MalformedYAMLError"; without this match the error is routed to the Deployed condition instead of Installed, bypassing the staleness guard. summary.go: In MessageFromDeployment, skip the Installed condition message when AppliedDeploymentID differs from Spec.DeploymentID, so a stale error from a superseded apply attempt is not surfaced. target.go: Add effectiveDeployment so state and message compare against t.DeploymentID (the ID the controller is about to write) rather than the stale Spec.DeploymentID still held in the cached BundleDeployment. The bundle controller calls SetReadyConditions before updating BD specs, so the summary.go guard would otherwise never trigger while the agent is offline. * Fix integration test after effectiveDeployment change After labels change the controller uses effectiveDeployment to compute WaitApplied=1 (the new deployment ID hasn't been applied yet). The test was checking WaitApplied==0 without simulating the agent re-applying the updated bundle deployment. Split the assertion into three steps: wait for BD spec change, simulate agent applying the new deployment, then assert the bundle shows WaitApplied=0.
… (#4823) * Fix stale bundle error when cluster is offline after bad commit When a GitRepo contains a YAML parse error and the cluster agent is offline, the bundle's Ready condition retains the error message even after a fix commit is pushed. Three interdependent changes are needed. deployer.go: Add MalformedYAMLError to the deployErrToStatus regex. Helm v4 changed the error format from "YAML parse error" to "MalformedYAMLError"; without this match the error is routed to the Deployed condition instead of Installed, bypassing the staleness guard. summary.go: In MessageFromDeployment, skip the Installed condition message when AppliedDeploymentID differs from Spec.DeploymentID, so a stale error from a superseded apply attempt is not surfaced. target.go: Add effectiveDeployment so state and message compare against t.DeploymentID (the ID the controller is about to write) rather than the stale Spec.DeploymentID still held in the cached BundleDeployment. The bundle controller calls SetReadyConditions before updating BD specs, so the summary.go guard would otherwise never trigger while the agent is offline. * Fix integration test after effectiveDeployment change After labels change the controller uses effectiveDeployment to compute WaitApplied=1 (the new deployment ID hasn't been applied yet). The test was checking WaitApplied==0 without simulating the agent re-applying the updated bundle deployment. Split the assertion into three steps: wait for BD spec change, simulate agent applying the new deployment, then assert the bundle shows WaitApplied=0.
When Helm v4 uses server-side apply with strict field validation, the API server rejects resources containing unknown fields with an error message containing "unknown field" rather than "error validating data". The previous regex only matched the client-side OpenAPI schema validation format, so SSA rejections were returned to the reconciler as real errors (Deployed=False, AppliedDeploymentID not updated). This caused the stale error suppression added in #4780 to not take effect when an offline cluster had a prior bad commit followed by a fix commit. Add "(unknown field)" to the deployErrToStatus pattern list so that both client-side and server-side field validation failures are stored in the Installed condition (AppliedDeploymentID updated, Deployed=True), allowing the existing ID-mismatch guard in MessageFromDeployment to suppress the stale message once a fix commit has been pushed. Refers #594
* Catch unknown-field errors in deployErrToStatus When Helm v4 uses server-side apply with strict field validation, the API server rejects resources containing unknown fields with an error message containing "unknown field" rather than "error validating data". The previous regex only matched the client-side OpenAPI schema validation format, so SSA rejections were returned to the reconciler as real errors (Deployed=False, AppliedDeploymentID not updated). This caused the stale error suppression added in #4780 to not take effect when an offline cluster had a prior bad commit followed by a fix commit. Add "(unknown field)" to the deployErrToStatus pattern list so that both client-side and server-side field validation failures are stored in the Installed condition (AppliedDeploymentID updated, Deployed=True), allowing the existing ID-mismatch guard in MessageFromDeployment to suppress the stale message once a fix commit has been pushed. Refers #594 * Add integration test for stale unknown-field error suppression Tests that when the agent's deployErrToStatus routes an "unknown field" error into the Installed condition (Deployed=True, AppliedDeploymentID set to the failing commit's ID), the bundle controller correctly suppresses the stale error once the spec advances to a new DeploymentID while the cluster stays offline. * Move deployErrPattern to package-level var Compiling the regex on every call to deployErrToStatus wastes CPU when deploy errors trigger repeated reconciliations. The pattern is static, so a single compilation at init time is sufficient.
When a GitRepo contains a YAML parse error and the cluster agent is offline, the bundle's Ready condition retains the error message even after a fix commit is pushed. Three interdependent changes are needed to clear the stale state.
deployer.go: AddMalformedYAMLErrorto thedeployErrToStatusregex. Helm v4 changed the error format from "YAML parse error" to "MalformedYAMLError"; without this match the error is routed to theDeployedcondition instead ofInstalled, bypassing the staleness guard entirely.summary.go: InMessageFromDeployment, skip theInstalledcondition message whenAppliedDeploymentIDdiffers fromSpec.DeploymentID, so a stale error from a superseded apply attempt is not surfaced. Unit tests added for all three condition precedence cases.target.go: AddeffectiveDeploymentsostateandmessagecompare againstt.DeploymentID(the ID the controller is about to write) rather than the staleSpec.DeploymentIDstill held in the cachedBundleDeployment. The bundle controller callsSetReadyConditionsbefore updating BD specs, so thesummary.goguard would otherwise never trigger while the agent is offline. Uses a shallow struct copy;Refers #594