Running Matrix-Game 3.0 interactive mode with torchrun on 4x A100 SXM 80GB.
Stock generate.py with --interactive works on some machines but silently hangs
during model initialization on others.
Setup:
- torch 2.10.0, FA2, CUDA 12.8, Python 3.12
- torchrun --nproc_per_node=4 with --ulysses_size 4, --use_int8, --t5_cpu
- Docker image based on nvidia/cuda:12.8.0-devel-ubuntu22.04
What happens:
- NCCL init succeeds on all ranks
- T5 encoder loads successfully on all ranks
- All 4 ranks log "Initializing Model (DiT)..."
- Then hangs forever: 0% CPU, 0% GPU VRAM (633MB CUDA context only)
- No error, no timeout, no crash
What we've tried:
- NCCL_P2P_DISABLE=1
- NCCL_DEBUG=INFO (init shows COMPLETE, no errors)
- device_id in init_process_group()
- 2 GPUs instead of 4
- Different volume sizes (75GB, 200GB)
Key observation:
- Works on some RunPod machines (e.g. vqnzttsdu44t, c250s631h6rc)
- Hangs on others (e.g. cmg18urp1kg5, hleuzkw9vyyn)
- Same Docker image, same code, same config
The hang appears to be inside WanModel.from_pretrained() or the
dist.barrier() that precedes it. Has anyone else seen this?
Running Matrix-Game 3.0 interactive mode with torchrun on 4x A100 SXM 80GB.
Stock generate.py with --interactive works on some machines but silently hangs
during model initialization on others.
Setup:
What happens:
What we've tried:
Key observation:
The hang appears to be inside WanModel.from_pretrained() or the
dist.barrier() that precedes it. Has anyone else seen this?