Fix flaky failure e2e test: increase training duration for reliable polling#148
Conversation
…olling The failing-test-runtime training ran only 3 seconds (15 steps × 0.2s), which was too short for the controller's 2s poll interval to capture progress > 0 before the job crashed. Increased per-step sleep to 0.5s (~8s total) so the controller has 3-4 poll cycles to capture progress. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
📝 WalkthroughWalkthroughA test configuration file was modified to adjust timing parameters for a failing workload test. The per-step sleep duration increased from 0.2 seconds to 0.5 seconds, and the associated Estimated code review effort🎯 1 (Trivial) | ⏱️ ~3 minutes Security findingsNo security concerns identified in this diff. This is a test configuration update with timing adjustments only. 🚥 Pre-merge checks | ✅ 2✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@test/e2e/rhai/resources/failing-test-runtime.yaml`:
- Around line 69-71: The test's timing assumes a 2s poll interval but the
controller clamps polling to 5s in pkg/rhai/progression/progression.go (the
polling clamp logic around lines 223-240), so update the
failing-test-runtime.yaml scenario to ensure the job runs long enough for
multiple 5s polls: increase the runtime window by raising fail_at_step (e.g.,
from 15 to a value that yields >15s total runtime) or otherwise lengthen the
step duration so the run lasts long enough for several 5s polls; change the
failing-test-runtime.yaml values that set fail_at_step and per-step duration to
reflect the clamped 5s poll interval rather than 2s.
- Line 79: The remaining-time calc overstates time because it always uses
total_steps; update the computation that assigns remaining so it uses the
failure-bound end (use fail_at_step if present/less than total_steps) as the
effective total: compute effective_total = min(total_steps, fail_at_step) (or
fall back to total_steps), then set remaining = max(0, int((effective_total -
step) * 0.5)) so estimatedRemainingSeconds reflects the intentional early exit;
change the line that defines remaining (and any uses of
estimatedRemainingSeconds) accordingly to avoid negative values.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Central YAML (base), Organization UI (inherited)
Review profile: CHILL
Plan: Pro
Run ID: 6003c4dc-959f-4019-a75e-c2ca1cc7fcbc
📒 Files selected for processing (1)
test/e2e/rhai/resources/failing-test-runtime.yaml
| # Training that will fail at 50% (~8 seconds total) | ||
| # Must run long enough for controller to poll progress > 0 | ||
| # with a 2s poll interval (at least 3-4 poll cycles). |
There was a problem hiding this comment.
Fix poll-interval assumption and extend runtime window to match controller behavior.
Line 71 assumes a 2s poll interval, but effective polling is clamped to 5s in pkg/rhai/progression/progression.go (Lines 223-240). With Line 77 at 0.5s and fail_at_step=15, the run is ~7.5s, which can still miss enough polls and remain flaky.
Proposed patch
- # Must run long enough for controller to poll progress > 0
- # with a 2s poll interval (at least 3-4 poll cycles).
+ # Must run long enough for controller to poll progress > 0.
+ # Metrics poll interval is clamped to a minimum of 5s,
+ # so keep runtime >=10s to allow at least 2 poll cycles.
...
- time.sleep(0.5) # 0.5s per step
+ time.sleep(0.8) # 0.8s per step (~12s total before failure)
...
- remaining = int((total_steps - step) * 0.5)
+ remaining = int((total_steps - step) * 0.8)As per coding guidelines, this falls under “Bug-prone patterns and error handling gaps” in the ** review priorities.
Also applies to: 77-77
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@test/e2e/rhai/resources/failing-test-runtime.yaml` around lines 69 - 71, The
test's timing assumes a 2s poll interval but the controller clamps polling to 5s
in pkg/rhai/progression/progression.go (the polling clamp logic around lines
223-240), so update the failing-test-runtime.yaml scenario to ensure the job
runs long enough for multiple 5s polls: increase the runtime window by raising
fail_at_step (e.g., from 15 to a value that yields >15s total runtime) or
otherwise lengthen the step duration so the run lasts long enough for several 5s
polls; change the failing-test-runtime.yaml values that set fail_at_step and
per-step duration to reflect the clamped 5s poll interval rather than 2s.
There was a problem hiding this comment.
@sutaakar This seems legit
Actually controller enforces minimum bound :
trainer/pkg/rhai/progression/progression.go
Line 247 in c2c1ebc
There was a problem hiding this comment.
Seems like the humans are having a chat. I'll hop back into my burrow for now. If you need me again, just tag @coderabbitai in a new comment, and I'll come hopping out!
| time.sleep(0.5) # 0.5s per step | ||
| progress = int((step / total_steps) * 100) | ||
| remaining = int((total_steps - step) * 0.2) | ||
| remaining = int((total_steps - step) * 0.5) |
There was a problem hiding this comment.
Use failure-bound remaining time for this intentionally failing runtime.
Line 79 calculates estimatedRemainingSeconds against total_steps, but this job exits at fail_at_step (50%). The reported remaining time is overstated for this scenario and can mislead progression status checks.
Proposed patch
- remaining = int((total_steps - step) * 0.8)
+ remaining = int(max(fail_at_step - (step + 1), 0) * 0.8)As per coding guidelines, this falls under “Bug-prone patterns and error handling gaps” in the ** review priorities.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@test/e2e/rhai/resources/failing-test-runtime.yaml` at line 79, The
remaining-time calc overstates time because it always uses total_steps; update
the computation that assigns remaining so it uses the failure-bound end (use
fail_at_step if present/less than total_steps) as the effective total: compute
effective_total = min(total_steps, fail_at_step) (or fall back to total_steps),
then set remaining = max(0, int((effective_total - step) * 0.5)) so
estimatedRemainingSeconds reflects the intentional early exit; change the line
that defines remaining (and any uses of estimatedRemainingSeconds) accordingly
to avoid negative values.
|
@sutaakar: The following test has Succeeded: OCI Artifact Browser URLInspecting Test Artifacts ManuallyTo inspect your test artifacts manually, follow these steps:
mkdir -p oras-artifacts
cd oras-artifacts
oras pull quay.io/opendatahub/odh-ci-artifacts:odh-pr-test-trainer-gwl7c |
Summary
"should capture final status even when job fails"e2e test was flaky because the failing training runtime ran only ~3 seconds (15 steps × 0.2s), which was too short for the controller's 2s poll interval to captureprogressPercentage > 0before the job crashed.Test plan
failing-test-runtime.yaml🤖 Generated with Claude Code
Summary by CodeRabbit