Skip to content

ncclRemoteError: A call failed possibly due to a network error or a remote process exiting prematurely. #58

@hoaala

Description

@hoaala

Setup: 4 nodes - Each node of 8x AMD MI250.

On Sat Aug 16, I saw this ncclRemoteError: A call failed possibly due to a network error or a remote process exiting prematurely. error when initializing my model (broadcasting model weights to different GPUs):


[rank0]:     model_engine, ds_optimizer, _, ds_scheduler = deepspeed.initialize(
[rank0]:                                                   ~~~~~~~~~~~~~~~~~~~~^
[rank0]:         model=model,
[rank0]:         ^^^^^^^^^^^^
[rank0]:     ...<3 lines>...
[rank0]:         config=ds_config
[rank0]:         ^^^^^^^^^^^^^^^^
[rank0]:     )
[rank0]:     ^
[rank0]:   File "/work1/mzhang/ag82/hoa/amdenv313/lib/python3.13/site-packages/deepspeed/__init__.py", line 193, in initialize
[rank0]:     engine = DeepSpeedEngine(args=args,
[rank0]:                              model=model,
[rank0]:     ...<8 lines>...
[rank0]:                              mesh_device=mesh_device,
[rank0]:                              config_class=config_class)
[rank0]:   File "/work1/mzhang/ag82/hoa/amdenv313/lib/python3.13/site-packages/deepspeed/runtime/engine.py", line 278, in __init__
[rank0]:     self._configure_distributed_model(model)
[rank0]:     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^
[rank0]:   File "/work1/mzhang/ag82/hoa/amdenv313/lib/python3.13/site-packages/deepspeed/runtime/engine.py", line 1307, in _configure_distributed_model
[rank0]:     self._broadcast_model()
[rank0]:     ~~~~~~~~~~~~~~~~~~~~~^^
[rank0]:   File "/work1/mzhang/ag82/hoa/amdenv313/lib/python3.13/site-packages/deepspeed/runtime/engine.py", line 1225, in _broadcast_model
[rank0]:     dist.broadcast(p.data, groups._get_broadcast_src_rank(), group=self.seq_data_parallel_group)
[rank0]:     ~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/work1/mzhang/ag82/hoa/amdenv313/lib/python3.13/site-packages/deepspeed/comm/comm.py", line 117, in log_wrapper
[rank0]:     return func(*args, **kwargs)
[rank0]:   File "/work1/mzhang/ag82/hoa/amdenv313/lib/python3.13/site-packages/deepspeed/comm/comm.py", line 224, in broadcast
[rank0]:     return cdb.broadcast(tensor=tensor, src=src, group=group, async_op=async_op)
[rank0]:            ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/work1/mzhang/ag82/hoa/amdenv313/lib/python3.13/site-packages/deepspeed/comm/torch.py", line 206, in broadcast
[rank0]:     return torch.distributed.broadcast(tensor=tensor, src=src, group=group, async_op=async_op)
[rank0]:   File "/work1/mzhang/ag82/hoa/amdenv313/lib/python3.13/site-packages/torch/distributed/c10d_logger.py", line 81, in wrapper
[rank0]:     return func(*args, **kwargs)
[rank0]:   File "/work1/mzhang/ag82/hoa/amdenv313/lib/python3.13/site-packages/torch/distributed/distributed_c10d.py", line 2715, in broadcast
[rank0]:     work = group.broadcast([tensor], opts)
[rank0]: torch.distributed.DistBackendError: NCCL error in: /pytorch/torch/csrc/distributed/c10d/NCCLUtils.cpp:77, remote process exited or there was a network error, NCCL version 2.21.5
[rank0]: ncclRemoteError: A call failed possibly due to a network error or a remote process exiting prematurely.
[rank0]: Last error:
[rank0]: socketProgressOpt: Call to recv from 10.0.200.140<51534> failed : Connection reset by peer

This occurs on all of these combination of nodes: k004-[004-007], k004-[005-008] , k004-[006-009], k004-[002,004,005,006]

This kind of error just occurred today. Previously (Friday Aug15), I was able to run the same code fine.

Not sure if there is any issue/maintenance on the cluster.

Also an observation: previously, I had seen this issue and it had been flaky -- but the more nodes I requested, the more likely it occurred. When I requested 8 nodes, it always occurred (e.g., k004-[002-009]). Why is this?

Sometimes, the error (failed at initialize broadcast model) also occurred as ncclSystemError: System call (e.g. socket, malloc) or external library call failed or device error.

Thanks.

cc @tom-papatheodore @koomie

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions