Multi-GPU interactive mode hangs during model initialization on some machines

Running Matrix-Game 3.0 interactive mode with torchrun on 4x A100 SXM 80GB.
  Stock generate.py with --interactive works on some machines but silently hangs
  during model initialization on others.

  Setup:
  - torch 2.10.0, FA2, CUDA 12.8, Python 3.12
  - torchrun --nproc_per_node=4 with --ulysses_size 4, --use_int8, --t5_cpu
  - Docker image based on nvidia/cuda:12.8.0-devel-ubuntu22.04

  What happens:
  - NCCL init succeeds on all ranks
  - T5 encoder loads successfully on all ranks
  - All 4 ranks log "Initializing Model (DiT)..."
  - Then hangs forever: 0% CPU, 0% GPU VRAM (633MB CUDA context only)
  - No error, no timeout, no crash

  What we've tried:
  - NCCL_P2P_DISABLE=1
  - NCCL_DEBUG=INFO (init shows COMPLETE, no errors)
  - device_id in init_process_group()
  - 2 GPUs instead of 4
  - Different volume sizes (75GB, 200GB)

  Key observation:
  - Works on some RunPod machines (e.g. vqnzttsdu44t, c250s631h6rc)
  - Hangs on others (e.g. cmg18urp1kg5, hleuzkw9vyyn)
  - Same Docker image, same code, same config

  The hang appears to be inside WanModel.from_pretrained() or the
  dist.barrier() that precedes it. Has anyone else seen this?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multi-GPU interactive mode hangs during model initialization on some machines #72

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Multi-GPU interactive mode hangs during model initialization on some machines #72

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions