-
Notifications
You must be signed in to change notification settings - Fork 56
Closed
Description
When using pytorch with the NCCL/RCCL backend on a system with eight GPUs/node, I get initialization failures of the following kind:
347: pthread_mutex_timedlock() returned 110
347: Failed to initialize RSMI device mutex after 5 seconds. Previous execution may not have shutdown cleanly. To fix problem, stop all rocm_smi programs, and then delete the rocm_smi* shared memory files in /dev/shm.: Success
348: pthread_mutex_timedlock() returned 110
348: Failed to initialize RSMI device mutex after 5 seconds. Previous execution may not have shutdown cleanly. To fix problem, stop all rocm_smi programs, and then delete the rocm_smi* shared memory files in /dev/shm.: Success
757: pthread_mutex_timedlock() returned 110
757: Failed to initialize RSMI device mutex after 5 seconds. Previous execution may not have shutdown cleanly. To fix problem, stop all rocm_smi programs, and then delete the rocm_smi* shared memory files in /dev/shm.: Success
350: pthread_mutex_timedlock() returned 110
350: Failed to initialize RSMI device mutex after 5 seconds. Previous execution may not have shutdown cleanly. To fix problem, stop all rocm_smi programs, and then delete the rocm_smi* shared memory files in /dev/shm.: Success
753: pthread_mutex_timedlock() returned 110
753: Failed to initialize RSMI device mutex after 5 seconds. Previous execution may not have shutdown cleanly. To fix problem, stop all rocm_smi programs, and then delete the rocm_smi* shared memory files in /dev/shm.: Success
351: pthread_mutex_timedlock() returned 110
351: Failed to initialize RSMI device mutex after 5 seconds. Previous execution may not have shutdown cleanly. To fix problem, stop all rocm_smi programs, and then delete the rocm_smi* shared memory files in /dev/shm.: Success
756: pthread_mutex_timedlock() returned 110
756: Failed to initialize RSMI device mutex after 5 seconds. Previous execution may not have shutdown cleanly. To fix problem, stop all rocm_smi programs, and then delete the rocm_smi* shared memory files in /dev/shm.: Success
758: pthread_mutex_timedlock() returned 110
758: Failed to initialize RSMI device mutex after 5 seconds. Previous execution may not have shutdown cleanly. To fix problem, stop all rocm_smi programs, and then delete the rocm_smi* shared memory files in /dev/shm.: Success
1050: pthread_mutex_timedlock() returned 110
1050: Failed to initialize RSMI device mutex after 5 seconds. Previous execution may not have shutdown cleanly. To fix problem, stop all rocm_smi programs, and then delete the rocm_smi* shared memory files in /dev/shm.: Success
347: rsmi_init() failed
1052: pthread_mutex_timedlock() returned 110
The reason is that rocm_smi_lib creates a mutex in /dev/shm whose name is independent of the process id, which creates a race condition.
v-iashin
Metadata
Metadata
Assignees
Labels
No labels