Fix flaky failure e2e test: increase training duration for reliable polling by sutaakar · Pull Request #148 · opendatahub-io/trainer

sutaakar · 2026-04-13T10:06:00Z

Summary

The "should capture final status even when job fails" e2e test was flaky because the failing training runtime ran only ~3 seconds (15 steps × 0.2s), which was too short for the controller's 2s poll interval to capture progressPercentage > 0 before the job crashed.
Increased per-step sleep from 0.2s to 0.5s (~8s total training), giving the controller 3-4 poll cycles to reliably capture progress > 0.

Test plan

Verified the only change is the step duration in failing-test-runtime.yaml
RHAI progression e2e test suite passes with the updated runtime

🤖 Generated with Claude Code

Summary by CodeRabbit

Tests
- Updated end-to-end test configuration to improve reliability of workload progress observation by adjusting timing parameters.

…olling The failing-test-runtime training ran only 3 seconds (15 steps × 0.2s), which was too short for the controller's 2s poll interval to capture progress > 0 before the job crashed. Increased per-step sleep to 0.5s (~8s total) so the controller has 3-4 poll cycles to capture progress. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

coderabbitai · 2026-04-13T10:06:16Z

📝 Walkthrough

Walkthrough

A test configuration file was modified to adjust timing parameters for a failing workload test. The per-step sleep duration increased from 0.2 seconds to 0.5 seconds, and the associated estimatedRemainingSeconds scale factor was updated from * 0.2 to * 0.5 to maintain consistency. Comments were updated to document the revised approximate total duration (~8 seconds) and clarify that the workload runs long enough for the controller to observe progress across multiple polling intervals. No logic changes, failure conditions, or endpoint behaviors were altered.

Estimated code review effort

🎯 1 (Trivial) | ⏱️ ~3 minutes

Security findings

No security concerns identified in this diff. This is a test configuration update with timing adjustments only.

🚥 Pre-merge checks | ✅ 2

✅ Passed checks (2 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title accurately describes the main change: increasing training duration in a failing e2e test to fix flakiness caused by insufficient polling observation time.
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@test/e2e/rhai/resources/failing-test-runtime.yaml`:
- Around line 69-71: The test's timing assumes a 2s poll interval but the
controller clamps polling to 5s in pkg/rhai/progression/progression.go (the
polling clamp logic around lines 223-240), so update the
failing-test-runtime.yaml scenario to ensure the job runs long enough for
multiple 5s polls: increase the runtime window by raising fail_at_step (e.g.,
from 15 to a value that yields >15s total runtime) or otherwise lengthen the
step duration so the run lasts long enough for several 5s polls; change the
failing-test-runtime.yaml values that set fail_at_step and per-step duration to
reflect the clamped 5s poll interval rather than 2s.
- Line 79: The remaining-time calc overstates time because it always uses
total_steps; update the computation that assigns remaining so it uses the
failure-bound end (use fail_at_step if present/less than total_steps) as the
effective total: compute effective_total = min(total_steps, fail_at_step) (or
fall back to total_steps), then set remaining = max(0, int((effective_total -
step) * 0.5)) so estimatedRemainingSeconds reflects the intentional early exit;
change the line that defines remaining (and any uses of
estimatedRemainingSeconds) accordingly to avoid negative values.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Central YAML (base), Organization UI (inherited)

Review profile: CHILL

Plan: Pro

Run ID: 6003c4dc-959f-4019-a75e-c2ca1cc7fcbc

📥 Commits

Reviewing files that changed from the base of the PR and between c2c1ebc and 5a956c3.

📒 Files selected for processing (1)

test/e2e/rhai/resources/failing-test-runtime.yaml

coderabbitai · 2026-04-13T10:09:41Z

+                          # Training that will fail at 50% (~8 seconds total)
+                          # Must run long enough for controller to poll progress > 0
+                          # with a 2s poll interval (at least 3-4 poll cycles).


⚠️ Potential issue | 🟠 Major

Fix poll-interval assumption and extend runtime window to match controller behavior.

Line 71 assumes a 2s poll interval, but effective polling is clamped to 5s in pkg/rhai/progression/progression.go (Lines 223-240). With Line 77 at 0.5s and fail_at_step=15, the run is ~7.5s, which can still miss enough polls and remain flaky.

Proposed patch

- # Must run long enough for controller to poll progress > 0 - # with a 2s poll interval (at least 3-4 poll cycles). + # Must run long enough for controller to poll progress > 0. + # Metrics poll interval is clamped to a minimum of 5s, + # so keep runtime >=10s to allow at least 2 poll cycles. ... - time.sleep(0.5) # 0.5s per step + time.sleep(0.8) # 0.8s per step (~12s total before failure) ... - remaining = int((total_steps - step) * 0.5) + remaining = int((total_steps - step) * 0.8)

As per coding guidelines, this falls under “Bug-prone patterns and error handling gaps” in the ** review priorities.

Also applies to: 77-77

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@test/e2e/rhai/resources/failing-test-runtime.yaml` around lines 69 - 71, The test's timing assumes a 2s poll interval but the controller clamps polling to 5s in pkg/rhai/progression/progression.go (the polling clamp logic around lines 223-240), so update the failing-test-runtime.yaml scenario to ensure the job runs long enough for multiple 5s polls: increase the runtime window by raising fail_at_step (e.g., from 15 to a value that yields >15s total runtime) or otherwise lengthen the step duration so the run lasts long enough for several 5s polls; change the failing-test-runtime.yaml values that set fail_at_step and per-step duration to reflect the clamped 5s poll interval rather than 2s.

@sutaakar This seems legit
Actually controller enforces minimum bound :

trainer/pkg/rhai/progression/progression.go

Line 247 in c2c1ebc

// Enforce min/max bounds (5s - 300s)

Seems like the humans are having a chat. I'll hop back into my burrow for now. If you need me again, just tag @coderabbitai in a new comment, and I'll come hopping out!

coderabbitai · 2026-04-13T10:09:41Z

+                              time.sleep(0.5)  # 0.5s per step
                              progress = int((step / total_steps) * 100)
-                              remaining = int((total_steps - step) * 0.2)
+                              remaining = int((total_steps - step) * 0.5)


⚠️ Potential issue | 🟡 Minor

Use failure-bound remaining time for this intentionally failing runtime.

Line 79 calculates estimatedRemainingSeconds against total_steps, but this job exits at fail_at_step (50%). The reported remaining time is overstated for this scenario and can mislead progression status checks.

Proposed patch

- remaining = int((total_steps - step) * 0.8) + remaining = int(max(fail_at_step - (step + 1), 0) * 0.8)

As per coding guidelines, this falls under “Bug-prone patterns and error handling gaps” in the ** review priorities.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@test/e2e/rhai/resources/failing-test-runtime.yaml` at line 79, The remaining-time calc overstates time because it always uses total_steps; update the computation that assigns remaining so it uses the failure-bound end (use fail_at_step if present/less than total_steps) as the effective total: compute effective_total = min(total_steps, fail_at_step) (or fall back to total_steps), then set remaining = max(0, int((effective_total - step) * 0.5)) so estimatedRemainingSeconds reflects the intentional early exit; change the line that defines remaining (and any uses of estimatedRemainingSeconds) accordingly to avoid negative values.

abhijeet-dhumal

Thanks Karel 🙌

rhods-ci-bot · 2026-04-13T13:32:03Z

@sutaakar: The following test has Succeeded:

OCI Artifact Browser URL

View in Artifact Browser

Inspecting Test Artifacts Manually

To inspect your test artifacts manually, follow these steps:

Install ORAS (see the ORAS installation guide).
Download artifacts with the following commands:

mkdir -p oras-artifacts
cd oras-artifacts
oras pull quay.io/opendatahub/odh-ci-artifacts:odh-pr-test-trainer-gwl7c

coderabbitai Bot reviewed Apr 13, 2026

View reviewed changes

abhijeet-dhumal approved these changes Apr 13, 2026

View reviewed changes

abhijeet-dhumal merged commit 916e0df into opendatahub-io:main Apr 13, 2026
7 checks passed

sutaakar deleted the progress-fix branch April 13, 2026 10:24

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix flaky failure e2e test: increase training duration for reliable polling#148

Fix flaky failure e2e test: increase training duration for reliable polling#148
abhijeet-dhumal merged 1 commit intoopendatahub-io:mainfrom
sutaakar:progress-fix

sutaakar commented Apr 13, 2026 •

edited

Loading

Uh oh!

coderabbitai Bot commented Apr 13, 2026 •

edited

Loading

Walkthrough

Estimated code review effort

Security findings

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot Apr 13, 2026 •

edited

Loading

Uh oh!

abhijeet-dhumal Apr 13, 2026

Uh oh!

coderabbitai Bot Apr 13, 2026

Uh oh!

coderabbitai Bot Apr 13, 2026

Uh oh!

abhijeet-dhumal left a comment

Uh oh!

Uh oh!

rhods-ci-bot commented Apr 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

sutaakar commented Apr 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Apr 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Estimated code review effort

Security findings

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

abhijeet-dhumal Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

abhijeet-dhumal left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

rhods-ci-bot commented Apr 13, 2026

OCI Artifact Browser URL

Inspecting Test Artifacts Manually

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

sutaakar commented Apr 13, 2026 •

edited

Loading

coderabbitai Bot commented Apr 13, 2026 •

edited

Loading

coderabbitai Bot Apr 13, 2026 •

edited

Loading