Skip to content

Commit d4f9a2c

Browse files
Fix unit-test hang: use spawn (not fork) for multiprocess jobs
fork in the long-lived pytest process inherits locks held by background threads (OpenMP / torch intra-op pools), deadlocking the child (e.g. in dist.init_process_group) and hanging the job. Revert spawn_multiprocess_job to spawn; the world_size reduction remains the speedup. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>
1 parent a694e89 commit d4f9a2c

1 file changed

Lines changed: 2 additions & 11 deletions

File tree

  • tests/_test_utils/torch/distributed

tests/_test_utils/torch/distributed/utils.py

Lines changed: 2 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -53,18 +53,9 @@ def init_process(rank, size, job=None, backend="gloo", port=None):
5353
job(rank, size)
5454

5555

56-
def spawn_multiprocess_job(size, job, backend="gloo", start_method=None):
57-
# ``fork`` lets child processes inherit the parent's already-imported torch/modelopt
58-
# modules, avoiding a ~12s re-import per process. It is only safe without CUDA (a CUDA
59-
# context cannot be forked safely), so default to ``fork`` for CPU/gloo jobs and fall
60-
# back to ``spawn`` when a GPU is present or a non-gloo backend is used.
61-
if start_method is None:
62-
start_method = "fork" if backend == "gloo" and not torch.cuda.is_available() else "spawn"
56+
def spawn_multiprocess_job(size, job, backend="gloo"):
6357
port = get_free_port()
64-
65-
# Use an explicit context instead of ``set_start_method(force=True)`` so we don't mutate
66-
# the global multiprocessing state shared with other tests.
67-
ctx = mp.get_context(start_method)
58+
ctx = mp.get_context("spawn")
6859
processes = []
6960
for rank in range(size):
7061
p = ctx.Process(target=init_process, args=(rank, size, job, backend, port))

0 commit comments

Comments
 (0)