Skip to content

[core] Deflake sigint cgraph test #52623

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Apr 28, 2025
Merged

Conversation

dayshah
Copy link
Contributor

@dayshah dayshah commented Apr 25, 2025

Why are these changes needed?

The sigint cgraph test was flaky. Also removed another 2 second sleep from the other test.

Related issue number

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

@dayshah dayshah added the go add ONLY when ready to merge, run all tests label Apr 25, 2025
Comment on lines +335 to +336
ray.get(signal_actor.wait.remote())
time.sleep(0.1)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this still looks inherently flaky. do we actually need to have an explicit integration test for the timeout=0 case?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the only test in regular core or cgraphs that we test for success with a 0 timeout. It is potentially behavior that could break, but I don't think we guarantee it anywhere. But it seems like something we should guarantee?

I do think this should never be flaky even without the 0.1 sleep because the time for line 325 to finish should never be more than the time it takes for:

  • the remote signal actor function to schedule
  • execute and flick the asyncio event
  • the result to actually get to the ray.get at 335
  • the 0.1 second sleep

Copy link
Contributor Author

@dayshah dayshah Apr 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is also a valid test case we could add for regular core to make sure we don't break this behavior that may be depended on. Way simpler for regular core test though because we can reuse the same actor.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm agree that testing to maintain the guarantee makes sense.

If I were to write this test for Ray Core, I'd write it as: ray.wait for obj to be ready, then assert ray.get(timeout=0) returns OK. That is likely the pattern that users would follow if they're using timeout=0.

Given we don't support ray.wait in cgraph (unless that has been added), I'm OK with leaving the test as-is. But please monitor and make sure it doesn't become flaky :)

Comment on lines 879 to 880
while driver_proc.stdout.readline() != b"executing\n":
pass
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should add a timeout to this and avoid busy spinning

you can use the wait_for_condition utility that we use elsewhere for it

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this doesn't really busy spin because readline is a blocking call. But I don't need the while in general, just changed to an assert bc we can guarantee the first read will be "executing", i wasn't sure of that when writing

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sg

Signed-off-by: dayshah <[email protected]>
@edoakes edoakes merged commit fba1566 into ray-project:master Apr 28, 2025
5 checks passed
ktyxx pushed a commit to ktyxx/ray that referenced this pull request Apr 29, 2025
The sigint cgraph test was flaky. Also removed another 2 second sleep
from the other test.

---------

Signed-off-by: dayshah <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
go add ONLY when ready to merge, run all tests
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants