Skip to content

Multi-GPU interactive mode hangs during model initialization on some machines #72

@FVEFWFE

Description

@FVEFWFE

Running Matrix-Game 3.0 interactive mode with torchrun on 4x A100 SXM 80GB.
Stock generate.py with --interactive works on some machines but silently hangs
during model initialization on others.

Setup:

  • torch 2.10.0, FA2, CUDA 12.8, Python 3.12
  • torchrun --nproc_per_node=4 with --ulysses_size 4, --use_int8, --t5_cpu
  • Docker image based on nvidia/cuda:12.8.0-devel-ubuntu22.04

What happens:

  • NCCL init succeeds on all ranks
  • T5 encoder loads successfully on all ranks
  • All 4 ranks log "Initializing Model (DiT)..."
  • Then hangs forever: 0% CPU, 0% GPU VRAM (633MB CUDA context only)
  • No error, no timeout, no crash

What we've tried:

  • NCCL_P2P_DISABLE=1
  • NCCL_DEBUG=INFO (init shows COMPLETE, no errors)
  • device_id in init_process_group()
  • 2 GPUs instead of 4
  • Different volume sizes (75GB, 200GB)

Key observation:

  • Works on some RunPod machines (e.g. vqnzttsdu44t, c250s631h6rc)
  • Hangs on others (e.g. cmg18urp1kg5, hleuzkw9vyyn)
  • Same Docker image, same code, same config

The hang appears to be inside WanModel.from_pretrained() or the
dist.barrier() that precedes it. Has anyone else seen this?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions