-
Notifications
You must be signed in to change notification settings - Fork 39
Description
Hello! I am following your work and doing a reproduction. But I got these questions below while using the command python -m torch.distributed.launch --nproc_per_node 8 run.py --model_name ddad --config configs/ddad.txt for distributed training on the DDAD dataset.
[E ProcessGroupNCCL.cpp:587] [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(OpType=_ALLGATHER_BASE, Timeout(ms)=1800000) ran for 1806986 milliseconds before timing out. '
[E ProcessGroupNCCL.cpp:341] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. T o avoid this inconsistency, we are taking the entire process down.
After training for a while, the process would be automatically shut down for running overtime.
Are there any details or training settings that I have ignored? Or does the torch version matter?
Thanks!