Skip to content

Conversation

@msaroufim
Copy link
Member

@msaroufim msaroufim commented Apr 17, 2025

Fixes #152

Usage

from datetime import timedelta
from torchft.process_group import ProcessGroupNCCL

# Create a process group with watchdog (5 second timeout)
pg = ProcessGroupNCCL(
    timeout=timedelta(seconds=60.0),
    watchdog_timeout=timedelta(seconds=5.0)
)

# To disable the watchdog
pg = ProcessGroupNCCL(watchdog_timeout=None)

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label Apr 17, 2025
@msaroufim msaroufim changed the title Watchdog [WIP] Watchdog Apr 17, 2025
@d4l3k
Copy link
Member

d4l3k commented Apr 18, 2025

@msaroufim thanks for taking a stab at this! I am a bit concerned that this approach adds significant amount of code/complexity to the ProcessGroupNCCL implementation and also makes the watchdog not very reusable.

What I had in mind for this was to add a watchdog to the _TimeoutManager so if we ever end up in a spot where the timeout handling hangs (i.e. ncclCommAbort hangs) we could crash the whole program. The timeout handlers in _TimeoutManager should complete promptly.

If we add a watchdog thread to the _TimeoutManager it should automatically give us watchdog behavior on top of ProcessGroupNCCL without adding any additional complexity to that code.

Code pointer: https://github.com/pytorch/torchft/blob/main/torchft/futures.py

Happy to chat more about this offline

@msaroufim
Copy link
Member Author

That does sound much simpler lol, lemme close this and redo

@msaroufim msaroufim closed this Apr 18, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add watchdog to _TimeoutManager+ProcessGroupNCCL to guarantee fast aborts

3 participants