fix(e2e): resolve recommend-mode Chainsaw intermittent timeout#92
Merged
Conversation
The verify-status step asserted Ready=False/InsufficientData, a transient state the operator passes through before collecting Prometheus data. With minimumDataPoints=1 and a 15s scrape interval, the operator can transition to Ready=True/Monitoring within seconds, causing the static assert to timeout waiting for a condition that already passed. Replace the static assert with a script-based poll that accepts either InsufficientData or Monitoring as valid states. Add a verify-no-resizes step to confirm Recommend mode does not apply changes. Closes #87 Signed-off-by: Sebastien Tardif <sebtardif@ncf.ca>
…labels The pr-size workflow failed with HTTP 401 on gh pr edit --add-label because label management uses the Issues API, which needs issues:write permission (not just pull-requests:write). Also replaced 6 sequential gh pr edit calls with a single REST API PUT that fetches current labels, filters out size/* labels, and sets the new one in one call. Signed-off-by: Sebastien Tardif <sebtardif@ncf.ca>
91f4bed to
72fab4f
Compare
This was referenced May 27, 2026
SebTardif
added a commit
that referenced
this pull request
May 27, 2026
* docs: add Chainsaw stable-state assertion convention to AGENTS.md Chainsaw tests that assert transient operator states (InsufficientData) race with the reconcile loop when minimumDataPoints is low. Add testing convention to prefer stable state assertions or script-based polls that accept multiple valid states. Signed-off-by: Sebastien Tardif <sebtardif@ncf.ca> * fix: stabilize Chainsaw tests and add govulncheck to CI gate Replace static InsufficientData assertions in observe-mode and opt-out Chainsaw tests with script-based polls that accept either InsufficientData or Monitoring. This prevents the same transient-state race fixed in recommend-mode (PR #92): with minimumDataPoints=1, the operator can transition past InsufficientData before the assert evaluates. Add govulncheck as a CI gate job so PRs are blocked when known vulnerabilities exist in the dependency tree. The existing govulncheck in security.yaml continues to run on the weekly schedule. Closes #93 Closes #72 Signed-off-by: Sebastien Tardif <sebtardif@ncf.ca> --------- Signed-off-by: Sebastien Tardif <sebtardif@ncf.ca>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
The
recommend-modeChainsaw test intermittently times out during theverify-statusassertion step. The static assert waits forReady=False, reason=InsufficientData, but withminimumDataPoints: 1and a 15s Prometheus scrape interval, the operator can transition toReady=True, reason=Monitoringwithin seconds of policy creation. The assert then spends the full 2-minute timeout waiting for a state that already passed.From the failing nightly run on K8s v1.35:
Root Cause
The test asserted a transient state (
InsufficientData) rather than a stable state. WithminimumDataPoints: 1, the operator only needs one Prometheus scrape to have enough data. The race between "assert polls for InsufficientData" and "operator transitions to Monitoring" is timing-dependent. On 5 consecutive nightly passes (20 K8s-version runs), the timing was favorable. On the next run, it was not.Fix
InsufficientDataorMonitoringas valid states (both prove the policy discovered the workload and the controller reconciled it)verify-no-resizesstep that assertsworkloads.resized: 0, confirming Recommend mode does not apply changes (stronger test coverage than before)Closes #87