Skip to content

fix(e2e): resolve recommend-mode Chainsaw intermittent timeout#92

Merged
SebTardif merged 2 commits into
mainfrom
fix/chainsaw-recommend-mode-flake
May 27, 2026
Merged

fix(e2e): resolve recommend-mode Chainsaw intermittent timeout#92
SebTardif merged 2 commits into
mainfrom
fix/chainsaw-recommend-mode-flake

Conversation

@SebTardif

Copy link
Copy Markdown
Contributor

Problem

The recommend-mode Chainsaw test intermittently times out during the verify-status assertion step. The static assert waits for Ready=False, reason=InsufficientData, but with minimumDataPoints: 1 and a 15s Prometheus scrape interval, the operator can transition to Ready=True, reason=Monitoring within seconds of policy creation. The assert then spends the full 2-minute timeout waiting for a state that already passed.

From the failing nightly run on K8s v1.35:

status.conditions[0].reason: Invalid value: "Monitoring": Expected value: "InsufficientData"
status.conditions[0].status: Invalid value: "True": Expected value: "False"

Root Cause

The test asserted a transient state (InsufficientData) rather than a stable state. With minimumDataPoints: 1, the operator only needs one Prometheus scrape to have enough data. The race between "assert polls for InsufficientData" and "operator transitions to Monitoring" is timing-dependent. On 5 consecutive nightly passes (20 K8s-version runs), the timing was favorable. On the next run, it was not.

Fix

  • Replace the static Chainsaw assert with a script-based poll that accepts either InsufficientData or Monitoring as valid states (both prove the policy discovered the workload and the controller reconciled it)
  • Add a verify-no-resizes step that asserts workloads.resized: 0, confirming Recommend mode does not apply changes (stronger test coverage than before)

Closes #87

@github-actions github-actions Bot added area/ci CI/CD workflows area/e2e E2E and integration tests size/l 250-499 lines changed labels May 27, 2026
Comment thread .github/workflows/pr-size.yaml Fixed
@github-actions github-actions Bot added size/l 250-499 lines changed and removed size/l 250-499 lines changed labels May 27, 2026
SebTardif added 2 commits May 27, 2026 08:48
The verify-status step asserted Ready=False/InsufficientData, a transient
state the operator passes through before collecting Prometheus data. With
minimumDataPoints=1 and a 15s scrape interval, the operator can transition
to Ready=True/Monitoring within seconds, causing the static assert to
timeout waiting for a condition that already passed.

Replace the static assert with a script-based poll that accepts either
InsufficientData or Monitoring as valid states. Add a verify-no-resizes
step to confirm Recommend mode does not apply changes.

Closes #87

Signed-off-by: Sebastien Tardif <sebtardif@ncf.ca>
…labels

The pr-size workflow failed with HTTP 401 on gh pr edit --add-label
because label management uses the Issues API, which needs issues:write
permission (not just pull-requests:write). Also replaced 6 sequential
gh pr edit calls with a single REST API PUT that fetches current labels,
filters out size/* labels, and sets the new one in one call.

Signed-off-by: Sebastien Tardif <sebtardif@ncf.ca>
@SebTardif SebTardif force-pushed the fix/chainsaw-recommend-mode-flake branch from 91f4bed to 72fab4f Compare May 27, 2026 15:48
@github-actions github-actions Bot added size/s 10-49 lines changed and removed size/l 250-499 lines changed labels May 27, 2026
@SebTardif SebTardif merged commit c50e2ac into main May 27, 2026
27 checks passed
@SebTardif SebTardif deleted the fix/chainsaw-recommend-mode-flake branch May 27, 2026 15:49
SebTardif added a commit that referenced this pull request May 27, 2026
* docs: add Chainsaw stable-state assertion convention to AGENTS.md

Chainsaw tests that assert transient operator states (InsufficientData)
race with the reconcile loop when minimumDataPoints is low. Add testing
convention to prefer stable state assertions or script-based polls that
accept multiple valid states.

Signed-off-by: Sebastien Tardif <sebtardif@ncf.ca>

* fix: stabilize Chainsaw tests and add govulncheck to CI gate

Replace static InsufficientData assertions in observe-mode and opt-out
Chainsaw tests with script-based polls that accept either InsufficientData
or Monitoring. This prevents the same transient-state race fixed in
recommend-mode (PR #92): with minimumDataPoints=1, the operator can
transition past InsufficientData before the assert evaluates.

Add govulncheck as a CI gate job so PRs are blocked when known
vulnerabilities exist in the dependency tree. The existing govulncheck
in security.yaml continues to run on the weekly schedule.

Closes #93
Closes #72

Signed-off-by: Sebastien Tardif <sebtardif@ncf.ca>

---------

Signed-off-by: Sebastien Tardif <sebtardif@ncf.ca>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/ci CI/CD workflows area/e2e E2E and integration tests size/s 10-49 lines changed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

test: Chainsaw recommend-mode intermittent timeout on nightly

2 participants