A question about distributed training on DDAD dataset

Hello! I am following your work and doing a reproduction. But I got these questions below while using the command `python -m torch.distributed.launch --nproc_per_node 8 run.py --model_name ddad --config configs/ddad.txt` for distributed training on the DDAD dataset. 

`[E ProcessGroupNCCL.cpp:587] [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(OpType=_ALLGATHER_BASE, Timeout(ms)=1800000) ran for 1806986 milliseconds before timing out. ' `
  
`[E ProcessGroupNCCL.cpp:341] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. T
o avoid this inconsistency, we are taking the entire process down.`

After training for a while, the process would be automatically shut down for running overtime.
Are there any details or training settings that I have ignored? Or does the torch version matter?
Thanks!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

A question about distributed training on DDAD dataset #11

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

A question about distributed training on DDAD dataset #11

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions