ncclRemoteError: A call failed possibly due to a network error or a remote process exiting prematurely.

Setup: 4 nodes - Each node of 8x AMD MI250.


On Sat Aug 16, I saw this `ncclRemoteError: A call failed possibly due to a network error or a remote process exiting prematurely.` error when initializing my model (broadcasting model weights to different GPUs): 
```

[rank0]:     model_engine, ds_optimizer, _, ds_scheduler = deepspeed.initialize(
[rank0]:                                                   ~~~~~~~~~~~~~~~~~~~~^
[rank0]:         model=model,
[rank0]:         ^^^^^^^^^^^^
[rank0]:     ...<3 lines>...
[rank0]:         config=ds_config
[rank0]:         ^^^^^^^^^^^^^^^^
[rank0]:     )
[rank0]:     ^
[rank0]:   File "/work1/mzhang/ag82/hoa/amdenv313/lib/python3.13/site-packages/deepspeed/__init__.py", line 193, in initialize
[rank0]:     engine = DeepSpeedEngine(args=args,
[rank0]:                              model=model,
[rank0]:     ...<8 lines>...
[rank0]:                              mesh_device=mesh_device,
[rank0]:                              config_class=config_class)
[rank0]:   File "/work1/mzhang/ag82/hoa/amdenv313/lib/python3.13/site-packages/deepspeed/runtime/engine.py", line 278, in __init__
[rank0]:     self._configure_distributed_model(model)
[rank0]:     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^
[rank0]:   File "/work1/mzhang/ag82/hoa/amdenv313/lib/python3.13/site-packages/deepspeed/runtime/engine.py", line 1307, in _configure_distributed_model
[rank0]:     self._broadcast_model()
[rank0]:     ~~~~~~~~~~~~~~~~~~~~~^^
[rank0]:   File "/work1/mzhang/ag82/hoa/amdenv313/lib/python3.13/site-packages/deepspeed/runtime/engine.py", line 1225, in _broadcast_model
[rank0]:     dist.broadcast(p.data, groups._get_broadcast_src_rank(), group=self.seq_data_parallel_group)
[rank0]:     ~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/work1/mzhang/ag82/hoa/amdenv313/lib/python3.13/site-packages/deepspeed/comm/comm.py", line 117, in log_wrapper
[rank0]:     return func(*args, **kwargs)
[rank0]:   File "/work1/mzhang/ag82/hoa/amdenv313/lib/python3.13/site-packages/deepspeed/comm/comm.py", line 224, in broadcast
[rank0]:     return cdb.broadcast(tensor=tensor, src=src, group=group, async_op=async_op)
[rank0]:            ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/work1/mzhang/ag82/hoa/amdenv313/lib/python3.13/site-packages/deepspeed/comm/torch.py", line 206, in broadcast
[rank0]:     return torch.distributed.broadcast(tensor=tensor, src=src, group=group, async_op=async_op)
[rank0]:   File "/work1/mzhang/ag82/hoa/amdenv313/lib/python3.13/site-packages/torch/distributed/c10d_logger.py", line 81, in wrapper
[rank0]:     return func(*args, **kwargs)
[rank0]:   File "/work1/mzhang/ag82/hoa/amdenv313/lib/python3.13/site-packages/torch/distributed/distributed_c10d.py", line 2715, in broadcast
[rank0]:     work = group.broadcast([tensor], opts)
[rank0]: torch.distributed.DistBackendError: NCCL error in: /pytorch/torch/csrc/distributed/c10d/NCCLUtils.cpp:77, remote process exited or there was a network error, NCCL version 2.21.5
[rank0]: ncclRemoteError: A call failed possibly due to a network error or a remote process exiting prematurely.
[rank0]: Last error:
[rank0]: socketProgressOpt: Call to recv from 10.0.200.140<51534> failed : Connection reset by peer
```

This occurs on all of these combination of nodes:  `k004-[004-007]`, `k004-[005-008]`  , `k004-[006-009]`, `k004-[002,004,005,006]`

This kind of error just occurred today. Previously (Friday Aug15), I was able to run the same code fine.

Not sure if there is any issue/maintenance on the cluster. 

Also an observation: previously, I had seen this issue and it had been flaky -- but the more nodes I requested, the more likely it occurred. When I requested 8 nodes, it always occurred (e.g., `k004-[002-009]`). Why is this?

Sometimes, the error (failed at initialize broadcast model) also occurred as `ncclSystemError: System call (e.g. socket, malloc) or external library call failed or device error.`

Thanks.

cc @tom-papatheodore  @koomie 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

ncclRemoteError: A call failed possibly due to a network error or a remote process exiting prematurely. #58

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

ncclRemoteError: A call failed possibly due to a network error or a remote process exiting prematurely. #58

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions