Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ProcessGroupBaby: support streams in futures #139

Merged
merged 1 commit into from
Mar 18, 2025
Merged

Conversation

d4l3k
Copy link
Member

@d4l3k d4l3k commented Mar 17, 2025

There's an issue with the recent future event work that sometimes causes cudart to crash.

This tracks the stream so the future completes in the same stream that it was launched in which seems to avoid the cudart crash.

Stack trace:

Fatal Python error: Segmentation fault

Current thread 0x00007f3752b5c640 (most recent call first):
  File "/home/tristanr/.conda/envs/torchft-3.10/lib/python3.10/site-packages/torch/cuda/streams.py", line 196 in wait
  File "/home/tristanr/torchft/torchft/process_group.py", line 1180 in _future_handler
  File "/home/tristanr/.conda/envs/torchft-3.10/lib/python3.10/threading.py", line 953 in run
  File "/home/tristanr/.conda/envs/torchft-3.10/lib/python3.10/threading.py", line 1016 in _bootstrap_inner
  File "/home/tristanr/.conda/envs/torchft-3.10/lib/python3.10/threading.py", line 973 in _bootstrap

Test plan:

pytest torchft/process_group_test.py -s -v -x

@d4l3k d4l3k requested a review from H-Huang March 17, 2025 23:40
@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label Mar 17, 2025
@d4l3k d4l3k force-pushed the d4l3k/future_stream branch from eec2d6c to b79bf28 Compare March 17, 2025 23:41
@d4l3k d4l3k requested a review from fegin March 17, 2025 23:44
Copy link
Contributor

@fegin fegin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@d4l3k d4l3k merged commit 52d4b01 into main Mar 18, 2025
6 checks passed
@d4l3k d4l3k deleted the d4l3k/future_stream branch March 18, 2025 01:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Meta Open Source bot.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants