Modify launch script so that it has a configurable timeout #557

finbarrtimbers · 2026-01-20T21:40:46Z

No description provided.

When multiple process groups exist (e.g., for vLLM weight sync in RLHF), initializing torch.distributed with device_id can cause NCCL hangs. This allows callers to pre-initialize torch.distributed without device_id before calling init_distributed(), which will then skip the initialization but still set up environment variables and CUDA device. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

When follow=False, make the launch command return immediately after submitting the job to Beaker, instead of waiting for the job to start. Changes: - In beaker.py: Resolve follow from self.follow before using it, and pass timeout=0 to gantry when not following to skip waiting - In experiment.py: Don't pass follow=True to launch(), and only pass step_soft_timeout when follow=True Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

finbarrtimbers and others added 7 commits January 20, 2026 08:09

Updated CHANGELOG.

d8a11f6

Added a raise to check backends

09e7483

Merge branch 'main' into finbarr/skip-init-if-initialized

59c7579

Updated launch

0dd9b35

Updated test script.

d279799

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Modify launch script so that it has a configurable timeout #557

Modify launch script so that it has a configurable timeout #557

Uh oh!

finbarrtimbers commented Jan 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Modify launch script so that it has a configurable timeout #557

Are you sure you want to change the base?

Modify launch script so that it has a configurable timeout #557

Uh oh!

Conversation

finbarrtimbers commented Jan 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant