Replies: 2 comments 3 replies
-
While not a direct answer to your question, please consider using https://github.com/horovod/horovod/ to speed up the multi-gpu training. You will want to have a separate process for each GPU where you set CUDA_VISIBLE_DEVICES to ensure that the process only sees it's own GPU. Then you want to use horovod to communicate accross the processes / gpus. This will be significantly faster than mxnet kvstore as you reduce the Python overhead by using multiple Python processes. |
Beta Was this translation helpful? Give feedback.
-
The "device" kvstore attempts at instantiating the buffers for gradient aggregation evenly across all seen devices: https://github.com/apache/incubator-mxnet/blob/v1.x/src/kvstore/comm.h#L677-L712. The problem is that the Gluon trainer only initiates it on the first context: https://github.com/apache/incubator-mxnet/blob/v1.x/python/mxnet/gluon/trainer.py#L158. As a result, all gradient aggregation only happens on gpu(0). |
Beta Was this translation helpful? Give feedback.
-
Hi mxnet community, I have spent a whole day googling around but could find this information: when training on a multi-gpu machine with the gluon.Trainer and kvstore set to "device", which gpu is used as the kvstore? Or putting it differently, how to specify which gpu to use?
Based on my tests, it is not trivial because on a AWS p2.8x, gpu(0) is used as the kvstore but on a AWS p2.16x, gpu(10) is used. Can't find any documentation about the logics behind it. Any ideas are highly appreciated. Thanks!
Beta Was this translation helpful? Give feedback.
All reactions