Problem
`tests/stress/test-sustained-load.sh` fails deterministically on the v1.1.9-rc.1 release-gate runs:
| Run |
Total reqs |
Successful |
Errors (4xx/5xx) |
Timeouts |
Rate |
| 25265469440 |
10,174 |
4,705 |
5,469 |
0 |
53% (threshold 30%) |
| 25265748133 |
7,371 |
4,552 |
2,819 |
0 |
38% (threshold 30%) |
Zero timeouts in both runs. Backend stays responsive (recovery check passes in 10s after the load). Variance in throughput (169 -> 122 req/s) suggests ARC runner host load fluctuating between runs.
The "errors but no timeouts" pattern strongly indicates rate-limit hits (429) or DB-pool saturation (503), not capacity collapse.
Two candidate root causes
(A) Test isn't using the rate-limit-exempt admin
The backend now supports `RATE_LIMIT_EXEMPT_USERNAMES` (artifact-keeper#995). The default smoke admin user (`admin`) is configured as exempt in `helm/values-test.yaml`. Verify:
- Is `test-sustained-load.sh` authenticating as `admin` (or another exempt account)?
- If so, is the `X-RateLimit-Exempt: true` header present on its responses?
- If not, the test is a fair test of the rate limiter behavior under burst, but then the 30% threshold is unrealistic.
(B) Threshold is too aggressive for shared runner capacity
Even with rate-limit exemption, sustained 169 req/s × 5 workers across backend + postgres + meilisearch + scanners on a 4-CPU/8-Gi namespace quota is tight. ARC runner host CPU contention with other pods on rocky K8s adds variance.
Action items
- Inspect `tests/stress/test-sustained-load.sh` to confirm what credential it uses
- Capture a sample failed-response body (likely 429 with rate-limit headers)
- Decide: fix the test auth, raise the threshold, or both
Workaround in place
The release-gate workflow already has `continue-on-error: true` on `stress-tests` so this does not block v1.1.9-rc.1 tagging. See artifact-keeper-test#... for the parallel security-tests workaround.
Related
- artifact-keeper#886 (v1.1.9 release coordination)
- artifact-keeper#991 (v1.1.x auth-path perf investigation)
- artifact-keeper#995 (RATE_LIMIT_EXEMPT_USERNAMES support)
- artifact-keeper#1001 (companion Grype DB-seeding issue)
Problem
`tests/stress/test-sustained-load.sh` fails deterministically on the v1.1.9-rc.1 release-gate runs:
Zero timeouts in both runs. Backend stays responsive (recovery check passes in 10s after the load). Variance in throughput (169 -> 122 req/s) suggests ARC runner host load fluctuating between runs.
The "errors but no timeouts" pattern strongly indicates rate-limit hits (429) or DB-pool saturation (503), not capacity collapse.
Two candidate root causes
(A) Test isn't using the rate-limit-exempt admin
The backend now supports `RATE_LIMIT_EXEMPT_USERNAMES` (artifact-keeper#995). The default smoke admin user (`admin`) is configured as exempt in `helm/values-test.yaml`. Verify:
(B) Threshold is too aggressive for shared runner capacity
Even with rate-limit exemption, sustained 169 req/s × 5 workers across backend + postgres + meilisearch + scanners on a 4-CPU/8-Gi namespace quota is tight. ARC runner host CPU contention with other pods on rocky K8s adds variance.
Action items
Workaround in place
The release-gate workflow already has `continue-on-error: true` on `stress-tests` so this does not block v1.1.9-rc.1 tagging. See artifact-keeper-test#... for the parallel security-tests workaround.
Related