Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TimeoutManager: delete cuda events on main thread #142

Merged
merged 1 commit into from
Mar 21, 2025
Merged

Conversation

d4l3k
Copy link
Member

@d4l3k d4l3k commented Mar 20, 2025

This updates the TimeoutManager to use a deletion queue so we delete the CUDA events on the main thread. If we delete these on the background thread we can end up deadlocking and not calling future timeout.

This is critical for NCCL abort as cudaEventDestroy can block in scenarios when a NCCL operation is stuck.

Test plan:

Tested in torchtitan

pytest torchft/futures_test.py

@d4l3k d4l3k requested review from fegin and H-Huang March 20, 2025 21:21
@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label Mar 20, 2025
@d4l3k d4l3k force-pushed the d4l3k/del_queue branch from 75350d4 to 3dae3c8 Compare March 20, 2025 23:21
while True:
try:
# get and immediately discard item
self._del_queue.get_nowait()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to do a sanity check to ensure there are no other reference to the event?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good suggestion! I added a getrefcount assertion

Copy link
Contributor

@fegin fegin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@d4l3k d4l3k force-pushed the d4l3k/del_queue branch from 3dae3c8 to c1ab7d9 Compare March 20, 2025 23:54
@d4l3k d4l3k merged commit 538b219 into main Mar 21, 2025
7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Meta Open Source bot.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants