-
Notifications
You must be signed in to change notification settings - Fork 10
Fix flaky failure e2e test: increase training duration for reliable polling #148
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -66,15 +66,17 @@ spec: | |
| # Wait briefly for server to be ready | ||
| time.sleep(1) | ||
|
|
||
| # Fast training that will fail at 50% (3 seconds total) | ||
| # Training that will fail at 50% (~8 seconds total) | ||
| # Must run long enough for controller to poll progress > 0 | ||
| # with a 2s poll interval (at least 3-4 poll cycles). | ||
| print("Starting training that will fail...") | ||
| total_steps = 30 | ||
| fail_at_step = 15 # Fail at 50% | ||
|
|
||
| for step in range(fail_at_step): | ||
| time.sleep(0.2) # 0.2s per step | ||
| time.sleep(0.5) # 0.5s per step | ||
| progress = int((step / total_steps) * 100) | ||
| remaining = int((total_steps - step) * 0.2) | ||
| remaining = int((total_steps - step) * 0.5) | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Use failure-bound remaining time for this intentionally failing runtime. Line 79 calculates Proposed patch- remaining = int((total_steps - step) * 0.8)
+ remaining = int(max(fail_at_step - (step + 1), 0) * 0.8)As per coding guidelines, this falls under “Bug-prone patterns and error handling gaps” in the 🤖 Prompt for AI Agents |
||
|
|
||
| MetricsHandler.progress_data = { | ||
| "progressPercentage": progress, | ||
|
|
||
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fix poll-interval assumption and extend runtime window to match controller behavior.
Line 71 assumes a 2s poll interval, but effective polling is clamped to 5s in
pkg/rhai/progression/progression.go(Lines 223-240). With Line 77 at0.5sandfail_at_step=15, the run is ~7.5s, which can still miss enough polls and remain flaky.Proposed patch
As per coding guidelines, this falls under “Bug-prone patterns and error handling gaps” in the
**review priorities.Also applies to: 77-77
🤖 Prompt for AI Agents
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sutaakar This seems legit
Actually controller enforces minimum bound :
trainer/pkg/rhai/progression/progression.go
Line 247 in c2c1ebc
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.