Which GPU is used as KVStore device on multi-GPU machine? #19965

bill10 · 2021-02-26T18:06:30Z

bill10
Feb 26, 2021

Hi mxnet community, I have spent a whole day googling around but could find this information: when training on a multi-gpu machine with the gluon.Trainer and kvstore set to "device", which gpu is used as the kvstore? Or putting it differently, how to specify which gpu to use?

Based on my tests, it is not trivial because on a AWS p2.8x, gpu(0) is used as the kvstore but on a AWS p2.16x, gpu(10) is used. Can't find any documentation about the logics behind it. Any ideas are highly appreciated. Thanks!

leezu · 2021-02-26T21:28:09Z

leezu
Feb 26, 2021
Collaborator

While not a direct answer to your question, please consider using https://github.com/horovod/horovod/ to speed up the multi-gpu training. You will want to have a separate process for each GPU where you set CUDA_VISIBLE_DEVICES to ensure that the process only sees it's own GPU. Then you want to use horovod to communicate accross the processes / gpus. This will be significantly faster than mxnet kvstore as you reduce the Python overhead by using multiple Python processes.

1 reply

bill10 Mar 2, 2021
Author

@leezu thanks for the suggestion! Yeah, if I were to start a new project I would probably switch to horovod. However, this project is already code complete with only this bug. Or it is not really a bug but something annoying since I need to hardcode the kvstore device number. Anyway, thanks for the idea! If you have any reference about the mxnet code, please send them along and I will dig into them. Thanks again!

szha · 2021-03-03T16:27:03Z

szha
Mar 3, 2021
Collaborator

The "device" kvstore attempts at instantiating the buffers for gradient aggregation evenly across all seen devices: https://github.com/apache/incubator-mxnet/blob/v1.x/src/kvstore/comm.h#L677-L712.

The problem is that the Gluon trainer only initiates it on the first context: https://github.com/apache/incubator-mxnet/blob/v1.x/python/mxnet/gluon/trainer.py#L158. As a result, all gradient aggregation only happens on gpu(0).

2 replies

bill10 Mar 4, 2021
Author

Thanks for the pointer! Looks like the initialization is from param_arrays[0] but not exactly gpu(0). That's probably why on p2.16x I saw the kvstore is on gpu(9) or gpu(10). I will check it again tomorrow and see if the context of param_arrays[0] matches what I saw. Thanks!

bill10 Mar 4, 2021
Author

Just checked param_arrays[0] as in the line referenced above. It is always data from gpu(0). This does not agree with what I saw on P2.16x where gradient aggregation happened on gpu(10). Super confused.

Which GPU is used as KVStore device on multi-GPU machine? #19965

Uh oh!

Uh oh!

bill10 Feb 26, 2021

Replies: 2 comments · 3 replies

Uh oh!

Uh oh!

leezu Feb 26, 2021 Collaborator

Uh oh!

Uh oh!

bill10 Mar 2, 2021 Author

Uh oh!

Uh oh!

szha Mar 3, 2021 Collaborator

Uh oh!

bill10 Mar 4, 2021 Author

Uh oh!

bill10 Mar 4, 2021 Author

bill10
Feb 26, 2021

Replies: 2 comments 3 replies

leezu
Feb 26, 2021
Collaborator

bill10 Mar 2, 2021
Author

szha
Mar 3, 2021
Collaborator

bill10 Mar 4, 2021
Author

bill10 Mar 4, 2021
Author