NUM_NODES=$nnodes NODE_RANK=$rank MASTER_ADDR=$master MASTER_PORT=$port nemo llm pretrain --factory "llama3_8b_64k(num_nodes=$nnodes, name='my_64k_pretrain')" --yes

**Describe the bug**
“I am running NeMo on multiple nodes and encountered an NCCL problem. How can I resolve it?”

The bug is as following:

nexpected error: NCCL error in: /opt/pytorch/pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:2897, invalid usage (run with NCCL_DEBUG=WARN for details), NCCL version 2.27.3

ncclInvalidUsage: This usually reflects invalid usage of NCCL library.

Last error:

Bootstrap : no socket interface found

[rank1]:[W1102 00:39:34.733273667 TCPStore.cpp:125] [c10d] recvValue failed on SocketImpl(fd=96, addr=[localhost]:10006, remote=[localhost]:10197): failed to recv, got 0 bytes

Exception raised from recvBytes at /opt/pytorch/pytorch/torch/csrc/distributed/c10d/Utils.hpp:678 (most recent call first):

frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x88 (0x7f3915ea9568 in /usr/local/lib/python3.12/dist-packages/torch/lib/libc10.so)

**Steps/Code to reproduce bug**
NUM_NODES=$nnodes NODE_RANK=$rank MASTER_ADDR=$master MASTER_PORT=$port nemo llm pretrain --factory "llama3_8b_64k(num_nodes=$nnodes, name='my_64k_pretrain')" --yes 

A  helpful guide on on how to craft a minimal bug report  http://matthewrocklin.com/blog/work/2018/02/28/minimal-bug-reports. 


**Expected behavior**

A clear and concise description of what you expected to happen.

**Environment overview (please complete the following information)**

 - Environment location: [Bare-metal, Docker, Cloud(specify cloud provider - AWS, Azure, GCP, Collab)]
 - Method of NeMo install: [pip install or from source]. Please specify exact commands you used to install.
 - If method of install is [Docker], provide `docker pull` & `docker run` commands used

**Environment details**
use nemo 25.09
If NVIDIA docker image is used you don't need to specify these.
Otherwise, please provide:
- OS version
- PyTorch version
- Python version

**Additional context**

Add any other context about the problem here.
Example: GPU model


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

NUM_NODES=$nnodes NODE_RANK=$rank MASTER_ADDR=$master MASTER_PORT=$port nemo llm pretrain --factory "llama3_8b_64k(num_nodes=$nnodes, name='my_64k_pretrain')" --yes #15019

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

NUM_NODES=$nnodes NODE_RANK=$rank MASTER_ADDR=$master MASTER_PORT=$port nemo llm pretrain --factory "llama3_8b_64k(num_nodes=$nnodes, name='my_64k_pretrain')" --yes #15019

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions