-
Notifications
You must be signed in to change notification settings - Fork 6.2k
[core] Deflake sigint cgraph test #52623
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: dayshah <[email protected]>
ray.get(signal_actor.wait.remote()) | ||
time.sleep(0.1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this still looks inherently flaky. do we actually need to have an explicit integration test for the timeout=0
case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the only test in regular core or cgraphs that we test for success with a 0 timeout. It is potentially behavior that could break, but I don't think we guarantee it anywhere. But it seems like something we should guarantee?
I do think this should never be flaky even without the 0.1 sleep because the time for line 325 to finish should never be more than the time it takes for:
- the remote signal actor function to schedule
- execute and flick the asyncio event
- the result to actually get to the ray.get at 335
- the 0.1 second sleep
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is also a valid test case we could add for regular core to make sure we don't break this behavior that may be depended on. Way simpler for regular core test though because we can reuse the same actor.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm agree that testing to maintain the guarantee makes sense.
If I were to write this test for Ray Core, I'd write it as: ray.wait
for obj to be ready, then assert ray.get(timeout=0)
returns OK. That is likely the pattern that users would follow if they're using timeout=0
.
Given we don't support ray.wait
in cgraph (unless that has been added), I'm OK with leaving the test as-is. But please monitor and make sure it doesn't become flaky :)
while driver_proc.stdout.readline() != b"executing\n": | ||
pass |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should add a timeout to this and avoid busy spinning
you can use the wait_for_condition
utility that we use elsewhere for it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this doesn't really busy spin because readline is a blocking call. But I don't need the while in general, just changed to an assert bc we can guarantee the first read will be "executing", i wasn't sure of that when writing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sg
Signed-off-by: dayshah <[email protected]>
The sigint cgraph test was flaky. Also removed another 2 second sleep from the other test. --------- Signed-off-by: dayshah <[email protected]>
Why are these changes needed?
The sigint cgraph test was flaky. Also removed another 2 second sleep from the other test.
Related issue number
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/
under thecorresponding
.rst
file.