Skip to content

Conversation

@finbarrtimbers
Copy link
Contributor

No description provided.

finbarrtimbers and others added 7 commits January 20, 2026 08:09
When multiple process groups exist (e.g., for vLLM weight sync in RLHF),
initializing torch.distributed with device_id can cause NCCL hangs.
This allows callers to pre-initialize torch.distributed without device_id
before calling init_distributed(), which will then skip the initialization
but still set up environment variables and CUDA device.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
When follow=False, make the launch command return immediately after
submitting the job to Beaker, instead of waiting for the job to start.

Changes:
- In beaker.py: Resolve follow from self.follow before using it, and
  pass timeout=0 to gantry when not following to skip waiting
- In experiment.py: Don't pass follow=True to launch(), and only pass
  step_soft_timeout when follow=True

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant