Skip to content
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 5 additions & 3 deletions test/e2e/rhai/resources/failing-test-runtime.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -66,15 +66,17 @@ spec:
# Wait briefly for server to be ready
time.sleep(1)

# Fast training that will fail at 50% (3 seconds total)
# Training that will fail at 50% (~8 seconds total)
# Must run long enough for controller to poll progress > 0
# with a 2s poll interval (at least 3-4 poll cycles).
Comment on lines +69 to +71
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot Apr 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Fix poll-interval assumption and extend runtime window to match controller behavior.

Line 71 assumes a 2s poll interval, but effective polling is clamped to 5s in pkg/rhai/progression/progression.go (Lines 223-240). With Line 77 at 0.5s and fail_at_step=15, the run is ~7.5s, which can still miss enough polls and remain flaky.

Proposed patch
-                          # Must run long enough for controller to poll progress > 0
-                          # with a 2s poll interval (at least 3-4 poll cycles).
+                          # Must run long enough for controller to poll progress > 0.
+                          # Metrics poll interval is clamped to a minimum of 5s,
+                          # so keep runtime >=10s to allow at least 2 poll cycles.
...
-                              time.sleep(0.5)  # 0.5s per step
+                              time.sleep(0.8)  # 0.8s per step (~12s total before failure)
...
-                              remaining = int((total_steps - step) * 0.5)
+                              remaining = int((total_steps - step) * 0.8)

As per coding guidelines, this falls under “Bug-prone patterns and error handling gaps” in the ** review priorities.

Also applies to: 77-77

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/e2e/rhai/resources/failing-test-runtime.yaml` around lines 69 - 71, The
test's timing assumes a 2s poll interval but the controller clamps polling to 5s
in pkg/rhai/progression/progression.go (the polling clamp logic around lines
223-240), so update the failing-test-runtime.yaml scenario to ensure the job
runs long enough for multiple 5s polls: increase the runtime window by raising
fail_at_step (e.g., from 15 to a value that yields >15s total runtime) or
otherwise lengthen the step duration so the run lasts long enough for several 5s
polls; change the failing-test-runtime.yaml values that set fail_at_step and
per-step duration to reflect the clamped 5s poll interval rather than 2s.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sutaakar This seems legit
Actually controller enforces minimum bound :

// Enforce min/max bounds (5s - 300s)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like the humans are having a chat. I'll hop back into my burrow for now. If you need me again, just tag @coderabbitai in a new comment, and I'll come hopping out!

print("Starting training that will fail...")
total_steps = 30
fail_at_step = 15 # Fail at 50%

for step in range(fail_at_step):
time.sleep(0.2) # 0.2s per step
time.sleep(0.5) # 0.5s per step
progress = int((step / total_steps) * 100)
remaining = int((total_steps - step) * 0.2)
remaining = int((total_steps - step) * 0.5)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Use failure-bound remaining time for this intentionally failing runtime.

Line 79 calculates estimatedRemainingSeconds against total_steps, but this job exits at fail_at_step (50%). The reported remaining time is overstated for this scenario and can mislead progression status checks.

Proposed patch
-                              remaining = int((total_steps - step) * 0.8)
+                              remaining = int(max(fail_at_step - (step + 1), 0) * 0.8)

As per coding guidelines, this falls under “Bug-prone patterns and error handling gaps” in the ** review priorities.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/e2e/rhai/resources/failing-test-runtime.yaml` at line 79, The
remaining-time calc overstates time because it always uses total_steps; update
the computation that assigns remaining so it uses the failure-bound end (use
fail_at_step if present/less than total_steps) as the effective total: compute
effective_total = min(total_steps, fail_at_step) (or fall back to total_steps),
then set remaining = max(0, int((effective_total - step) * 0.5)) so
estimatedRemainingSeconds reflects the intentional early exit; change the line
that defines remaining (and any uses of estimatedRemainingSeconds) accordingly
to avoid negative values.


MetricsHandler.progress_data = {
"progressPercentage": progress,
Expand Down
Loading