Skip to content

NUM_NODES=$nnodes NODE_RANK=$rank MASTER_ADDR=$master MASTER_PORT=$port nemo llm pretrain --factory "llama3_8b_64k(num_nodes=$nnodes, name='my_64k_pretrain')" --yes #15019

@risemeup1

Description

@risemeup1

Describe the bug
“I am running NeMo on multiple nodes and encountered an NCCL problem. How can I resolve it?”

The bug is as following:

nexpected error: NCCL error in: /opt/pytorch/pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:2897, invalid usage (run with NCCL_DEBUG=WARN for details), NCCL version 2.27.3

ncclInvalidUsage: This usually reflects invalid usage of NCCL library.

Last error:

Bootstrap : no socket interface found

[rank1]:[W1102 00:39:34.733273667 TCPStore.cpp:125] [c10d] recvValue failed on SocketImpl(fd=96, addr=[localhost]:10006, remote=[localhost]:10197): failed to recv, got 0 bytes

Exception raised from recvBytes at /opt/pytorch/pytorch/torch/csrc/distributed/c10d/Utils.hpp:678 (most recent call first):

frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits, std::allocator >) + 0x88 (0x7f3915ea9568 in /usr/local/lib/python3.12/dist-packages/torch/lib/libc10.so)

Steps/Code to reproduce bug
NUM_NODES=$nnodes NODE_RANK=$rank MASTER_ADDR=$master MASTER_PORT=$port nemo llm pretrain --factory "llama3_8b_64k(num_nodes=$nnodes, name='my_64k_pretrain')" --yes

A helpful guide on on how to craft a minimal bug report http://matthewrocklin.com/blog/work/2018/02/28/minimal-bug-reports.

Expected behavior

A clear and concise description of what you expected to happen.

Environment overview (please complete the following information)

  • Environment location: [Bare-metal, Docker, Cloud(specify cloud provider - AWS, Azure, GCP, Collab)]
  • Method of NeMo install: [pip install or from source]. Please specify exact commands you used to install.
  • If method of install is [Docker], provide docker pull & docker run commands used

Environment details
use nemo 25.09
If NVIDIA docker image is used you don't need to specify these.
Otherwise, please provide:

  • OS version
  • PyTorch version
  • Python version

Additional context

Add any other context about the problem here.
Example: GPU model

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions