-
Notifications
You must be signed in to change notification settings - Fork 4
Description
Setup: 4 nodes - Each node of 8x AMD MI250.
On Sat Aug 16, I saw this ncclRemoteError: A call failed possibly due to a network error or a remote process exiting prematurely. error when initializing my model (broadcasting model weights to different GPUs):
[rank0]: model_engine, ds_optimizer, _, ds_scheduler = deepspeed.initialize(
[rank0]: ~~~~~~~~~~~~~~~~~~~~^
[rank0]: model=model,
[rank0]: ^^^^^^^^^^^^
[rank0]: ...<3 lines>...
[rank0]: config=ds_config
[rank0]: ^^^^^^^^^^^^^^^^
[rank0]: )
[rank0]: ^
[rank0]: File "/work1/mzhang/ag82/hoa/amdenv313/lib/python3.13/site-packages/deepspeed/__init__.py", line 193, in initialize
[rank0]: engine = DeepSpeedEngine(args=args,
[rank0]: model=model,
[rank0]: ...<8 lines>...
[rank0]: mesh_device=mesh_device,
[rank0]: config_class=config_class)
[rank0]: File "/work1/mzhang/ag82/hoa/amdenv313/lib/python3.13/site-packages/deepspeed/runtime/engine.py", line 278, in __init__
[rank0]: self._configure_distributed_model(model)
[rank0]: ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^
[rank0]: File "/work1/mzhang/ag82/hoa/amdenv313/lib/python3.13/site-packages/deepspeed/runtime/engine.py", line 1307, in _configure_distributed_model
[rank0]: self._broadcast_model()
[rank0]: ~~~~~~~~~~~~~~~~~~~~~^^
[rank0]: File "/work1/mzhang/ag82/hoa/amdenv313/lib/python3.13/site-packages/deepspeed/runtime/engine.py", line 1225, in _broadcast_model
[rank0]: dist.broadcast(p.data, groups._get_broadcast_src_rank(), group=self.seq_data_parallel_group)
[rank0]: ~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/work1/mzhang/ag82/hoa/amdenv313/lib/python3.13/site-packages/deepspeed/comm/comm.py", line 117, in log_wrapper
[rank0]: return func(*args, **kwargs)
[rank0]: File "/work1/mzhang/ag82/hoa/amdenv313/lib/python3.13/site-packages/deepspeed/comm/comm.py", line 224, in broadcast
[rank0]: return cdb.broadcast(tensor=tensor, src=src, group=group, async_op=async_op)
[rank0]: ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/work1/mzhang/ag82/hoa/amdenv313/lib/python3.13/site-packages/deepspeed/comm/torch.py", line 206, in broadcast
[rank0]: return torch.distributed.broadcast(tensor=tensor, src=src, group=group, async_op=async_op)
[rank0]: File "/work1/mzhang/ag82/hoa/amdenv313/lib/python3.13/site-packages/torch/distributed/c10d_logger.py", line 81, in wrapper
[rank0]: return func(*args, **kwargs)
[rank0]: File "/work1/mzhang/ag82/hoa/amdenv313/lib/python3.13/site-packages/torch/distributed/distributed_c10d.py", line 2715, in broadcast
[rank0]: work = group.broadcast([tensor], opts)
[rank0]: torch.distributed.DistBackendError: NCCL error in: /pytorch/torch/csrc/distributed/c10d/NCCLUtils.cpp:77, remote process exited or there was a network error, NCCL version 2.21.5
[rank0]: ncclRemoteError: A call failed possibly due to a network error or a remote process exiting prematurely.
[rank0]: Last error:
[rank0]: socketProgressOpt: Call to recv from 10.0.200.140<51534> failed : Connection reset by peer
This occurs on all of these combination of nodes: k004-[004-007], k004-[005-008] , k004-[006-009], k004-[002,004,005,006]
This kind of error just occurred today. Previously (Friday Aug15), I was able to run the same code fine.
Not sure if there is any issue/maintenance on the cluster.
Also an observation: previously, I had seen this issue and it had been flaky -- but the more nodes I requested, the more likely it occurred. When I requested 8 nodes, it always occurred (e.g., k004-[002-009]). Why is this?
Sometimes, the error (failed at initialize broadcast model) also occurred as ncclSystemError: System call (e.g. socket, malloc) or external library call failed or device error.
Thanks.