-
Notifications
You must be signed in to change notification settings - Fork 3.2k
Description
Describe the bug
“I am running NeMo on multiple nodes and encountered an NCCL problem. How can I resolve it?”
The bug is as following:
nexpected error: NCCL error in: /opt/pytorch/pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:2897, invalid usage (run with NCCL_DEBUG=WARN for details), NCCL version 2.27.3
ncclInvalidUsage: This usually reflects invalid usage of NCCL library.
Last error:
Bootstrap : no socket interface found
[rank1]:[W1102 00:39:34.733273667 TCPStore.cpp:125] [c10d] recvValue failed on SocketImpl(fd=96, addr=[localhost]:10006, remote=[localhost]:10197): failed to recv, got 0 bytes
Exception raised from recvBytes at /opt/pytorch/pytorch/torch/csrc/distributed/c10d/Utils.hpp:678 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits, std::allocator >) + 0x88 (0x7f3915ea9568 in /usr/local/lib/python3.12/dist-packages/torch/lib/libc10.so)
Steps/Code to reproduce bug
NUM_NODES=$nnodes NODE_RANK=$rank MASTER_ADDR=$master MASTER_PORT=$port nemo llm pretrain --factory "llama3_8b_64k(num_nodes=$nnodes, name='my_64k_pretrain')" --yes
A helpful guide on on how to craft a minimal bug report http://matthewrocklin.com/blog/work/2018/02/28/minimal-bug-reports.
Expected behavior
A clear and concise description of what you expected to happen.
Environment overview (please complete the following information)
- Environment location: [Bare-metal, Docker, Cloud(specify cloud provider - AWS, Azure, GCP, Collab)]
- Method of NeMo install: [pip install or from source]. Please specify exact commands you used to install.
- If method of install is [Docker], provide
docker pull&docker runcommands used
Environment details
use nemo 25.09
If NVIDIA docker image is used you don't need to specify these.
Otherwise, please provide:
- OS version
- PyTorch version
- Python version
Additional context
Add any other context about the problem here.
Example: GPU model