Context
tests/stress/test-sustained-load.sh currently asserts error_pct <= 55% (down from a previous 30% that deterministically failed on v1.1.9-rc.1). The release-gate job runs with continue-on-error: true, so the assertion is informational, but the 55% number is a fixed ceiling chosen from a tiny sample (worst observed run 53% plus margin).
A real regression that produces 40% error rate would slip past, because the threshold is set to absorb the worst observed flake.
Followup from PR #148
PR #148 (Fresh-Eyes review, Finding 1) called this out. The original review asked for either tightening the threshold or making it telemetry-only. We kept the assertion plus continue-on-error: true for now because the suite is genuinely flake-prone on shared ARC runners and a hard fail would block release-gate without giving signal.
Proposal
Once 5-10 successful baseline runs have accumulated on the current Helm CPU layout (post-#140 rebalance, post-#132 admin-exempt credential), record their error-rate distribution and switch the assertion to:
error_pct <= median(baseline_error_pct) + 15
If a tests/stress/baseline-sustained-error-pct.txt file is absent, the script should warn loudly and pass (instead of failing with "no baseline" as the original review suggested) so the release-gate stays unblocked while the baseline file is being populated.
Acceptance
- Baseline file checked in with at least 5 successful run values
- Script computes median + delta at runtime, not a hard-coded number
SUSTAINED_ERROR_PCT_THRESHOLD env override still works for dedicated-runner environments
- TODO comment in
test-sustained-load.sh removed when this lands
Context
tests/stress/test-sustained-load.shcurrently assertserror_pct <= 55%(down from a previous 30% that deterministically failed on v1.1.9-rc.1). The release-gate job runs withcontinue-on-error: true, so the assertion is informational, but the 55% number is a fixed ceiling chosen from a tiny sample (worst observed run 53% plus margin).A real regression that produces 40% error rate would slip past, because the threshold is set to absorb the worst observed flake.
Followup from PR #148
PR #148 (Fresh-Eyes review, Finding 1) called this out. The original review asked for either tightening the threshold or making it telemetry-only. We kept the assertion plus
continue-on-error: truefor now because the suite is genuinely flake-prone on shared ARC runners and a hard fail would block release-gate without giving signal.Proposal
Once 5-10 successful baseline runs have accumulated on the current Helm CPU layout (post-#140 rebalance, post-#132 admin-exempt credential), record their error-rate distribution and switch the assertion to:
If a
tests/stress/baseline-sustained-error-pct.txtfile is absent, the script should warn loudly and pass (instead of failing with "no baseline" as the original review suggested) so the release-gate stays unblocked while the baseline file is being populated.Acceptance
SUSTAINED_ERROR_PCT_THRESHOLDenv override still works for dedicated-runner environmentstest-sustained-load.shremoved when this lands