Skip to content

Fix flaky failure e2e test: increase training duration for reliable polling#148

Merged
abhijeet-dhumal merged 1 commit intoopendatahub-io:mainfrom
sutaakar:progress-fix
Apr 13, 2026
Merged

Fix flaky failure e2e test: increase training duration for reliable polling#148
abhijeet-dhumal merged 1 commit intoopendatahub-io:mainfrom
sutaakar:progress-fix

Conversation

@sutaakar
Copy link
Copy Markdown
Collaborator

@sutaakar sutaakar commented Apr 13, 2026

Summary

  • The "should capture final status even when job fails" e2e test was flaky because the failing training runtime ran only ~3 seconds (15 steps × 0.2s), which was too short for the controller's 2s poll interval to capture progressPercentage > 0 before the job crashed.
  • Increased per-step sleep from 0.2s to 0.5s (~8s total training), giving the controller 3-4 poll cycles to reliably capture progress > 0.

Test plan

  • Verified the only change is the step duration in failing-test-runtime.yaml
  • RHAI progression e2e test suite passes with the updated runtime

🤖 Generated with Claude Code

Summary by CodeRabbit

  • Tests
    • Updated end-to-end test configuration to improve reliability of workload progress observation by adjusting timing parameters.

…olling

The failing-test-runtime training ran only 3 seconds (15 steps × 0.2s),
which was too short for the controller's 2s poll interval to capture
progress > 0 before the job crashed. Increased per-step sleep to 0.5s
(~8s total) so the controller has 3-4 poll cycles to capture progress.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 13, 2026

📝 Walkthrough

Walkthrough

A test configuration file was modified to adjust timing parameters for a failing workload test. The per-step sleep duration increased from 0.2 seconds to 0.5 seconds, and the associated estimatedRemainingSeconds scale factor was updated from * 0.2 to * 0.5 to maintain consistency. Comments were updated to document the revised approximate total duration (~8 seconds) and clarify that the workload runs long enough for the controller to observe progress across multiple polling intervals. No logic changes, failure conditions, or endpoint behaviors were altered.

Estimated code review effort

🎯 1 (Trivial) | ⏱️ ~3 minutes

Security findings

No security concerns identified in this diff. This is a test configuration update with timing adjustments only.

🚥 Pre-merge checks | ✅ 2
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately describes the main change: increasing training duration in a failing e2e test to fix flakiness caused by insufficient polling observation time.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@test/e2e/rhai/resources/failing-test-runtime.yaml`:
- Around line 69-71: The test's timing assumes a 2s poll interval but the
controller clamps polling to 5s in pkg/rhai/progression/progression.go (the
polling clamp logic around lines 223-240), so update the
failing-test-runtime.yaml scenario to ensure the job runs long enough for
multiple 5s polls: increase the runtime window by raising fail_at_step (e.g.,
from 15 to a value that yields >15s total runtime) or otherwise lengthen the
step duration so the run lasts long enough for several 5s polls; change the
failing-test-runtime.yaml values that set fail_at_step and per-step duration to
reflect the clamped 5s poll interval rather than 2s.
- Line 79: The remaining-time calc overstates time because it always uses
total_steps; update the computation that assigns remaining so it uses the
failure-bound end (use fail_at_step if present/less than total_steps) as the
effective total: compute effective_total = min(total_steps, fail_at_step) (or
fall back to total_steps), then set remaining = max(0, int((effective_total -
step) * 0.5)) so estimatedRemainingSeconds reflects the intentional early exit;
change the line that defines remaining (and any uses of
estimatedRemainingSeconds) accordingly to avoid negative values.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Central YAML (base), Organization UI (inherited)

Review profile: CHILL

Plan: Pro

Run ID: 6003c4dc-959f-4019-a75e-c2ca1cc7fcbc

📥 Commits

Reviewing files that changed from the base of the PR and between c2c1ebc and 5a956c3.

📒 Files selected for processing (1)
  • test/e2e/rhai/resources/failing-test-runtime.yaml

Comment on lines +69 to +71
# Training that will fail at 50% (~8 seconds total)
# Must run long enough for controller to poll progress > 0
# with a 2s poll interval (at least 3-4 poll cycles).
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot Apr 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Fix poll-interval assumption and extend runtime window to match controller behavior.

Line 71 assumes a 2s poll interval, but effective polling is clamped to 5s in pkg/rhai/progression/progression.go (Lines 223-240). With Line 77 at 0.5s and fail_at_step=15, the run is ~7.5s, which can still miss enough polls and remain flaky.

Proposed patch
-                          # Must run long enough for controller to poll progress > 0
-                          # with a 2s poll interval (at least 3-4 poll cycles).
+                          # Must run long enough for controller to poll progress > 0.
+                          # Metrics poll interval is clamped to a minimum of 5s,
+                          # so keep runtime >=10s to allow at least 2 poll cycles.
...
-                              time.sleep(0.5)  # 0.5s per step
+                              time.sleep(0.8)  # 0.8s per step (~12s total before failure)
...
-                              remaining = int((total_steps - step) * 0.5)
+                              remaining = int((total_steps - step) * 0.8)

As per coding guidelines, this falls under “Bug-prone patterns and error handling gaps” in the ** review priorities.

Also applies to: 77-77

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/e2e/rhai/resources/failing-test-runtime.yaml` around lines 69 - 71, The
test's timing assumes a 2s poll interval but the controller clamps polling to 5s
in pkg/rhai/progression/progression.go (the polling clamp logic around lines
223-240), so update the failing-test-runtime.yaml scenario to ensure the job
runs long enough for multiple 5s polls: increase the runtime window by raising
fail_at_step (e.g., from 15 to a value that yields >15s total runtime) or
otherwise lengthen the step duration so the run lasts long enough for several 5s
polls; change the failing-test-runtime.yaml values that set fail_at_step and
per-step duration to reflect the clamped 5s poll interval rather than 2s.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sutaakar This seems legit
Actually controller enforces minimum bound :

// Enforce min/max bounds (5s - 300s)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like the humans are having a chat. I'll hop back into my burrow for now. If you need me again, just tag @coderabbitai in a new comment, and I'll come hopping out!

time.sleep(0.5) # 0.5s per step
progress = int((step / total_steps) * 100)
remaining = int((total_steps - step) * 0.2)
remaining = int((total_steps - step) * 0.5)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Use failure-bound remaining time for this intentionally failing runtime.

Line 79 calculates estimatedRemainingSeconds against total_steps, but this job exits at fail_at_step (50%). The reported remaining time is overstated for this scenario and can mislead progression status checks.

Proposed patch
-                              remaining = int((total_steps - step) * 0.8)
+                              remaining = int(max(fail_at_step - (step + 1), 0) * 0.8)

As per coding guidelines, this falls under “Bug-prone patterns and error handling gaps” in the ** review priorities.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/e2e/rhai/resources/failing-test-runtime.yaml` at line 79, The
remaining-time calc overstates time because it always uses total_steps; update
the computation that assigns remaining so it uses the failure-bound end (use
fail_at_step if present/less than total_steps) as the effective total: compute
effective_total = min(total_steps, fail_at_step) (or fall back to total_steps),
then set remaining = max(0, int((effective_total - step) * 0.5)) so
estimatedRemainingSeconds reflects the intentional early exit; change the line
that defines remaining (and any uses of estimatedRemainingSeconds) accordingly
to avoid negative values.

Copy link
Copy Markdown
Member

@abhijeet-dhumal abhijeet-dhumal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Karel 🙌

@abhijeet-dhumal abhijeet-dhumal merged commit 916e0df into opendatahub-io:main Apr 13, 2026
7 checks passed
@sutaakar sutaakar deleted the progress-fix branch April 13, 2026 10:24
@rhods-ci-bot
Copy link
Copy Markdown

@sutaakar: The following test has Succeeded:

OCI Artifact Browser URL

View in Artifact Browser

Inspecting Test Artifacts Manually

To inspect your test artifacts manually, follow these steps:

  1. Install ORAS (see the ORAS installation guide).
  2. Download artifacts with the following commands:
mkdir -p oras-artifacts
cd oras-artifacts
oras pull quay.io/opendatahub/odh-ci-artifacts:odh-pr-test-trainer-gwl7c

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants