Description
We derived a multi-thread multi-GPU version decoder, which can parallelly decode many utterances on multi-GPUs. for example, 40 threads evenly on 4 GPU .
In the main thread, we init our decoder by calling cudaSetDevice(gpu_id) for each gpu card, and loading a copy of am model on each gpu ( each copy will be used by many decoding threads)
////pthreadkey_gpuid will be used to index cu-device, which we modified in a way that
//// threads on each gpu will share one cu-device.
for (int i = 0; i < gpu_num_; i++) {
pthread_setspecific(::pthreadkey_gpuid, &i);
#ifdef HAVE_CUDA
CuDevice::Instantiate().SelectGpuId(gpu_option); // we will select pthreadkey_gpuid
CuDevice::Instantiate().AllowMultithreading();
#endif
am_nnet_[i] = new nnet3::AmNnetSimple;
.................
Input ki(fileName, &binary);
...............
am_nnet_[i]->Read(ki.Stream(), binary);
}
///in cu-device.h , we change the instantiate func , to support diffenrent cu-device on each GPU
static inline CuDevice& Instantiate() {
int index = *((int *)pthread_getspecific(::pthreadkey_gpuid));
//KALDI_LOG << "now device index is:" << index;
return global_device_[index];
}
//Then we create several decoding threads which are evenly allocated on multi-GPUs and use the //corresponding AM copy .
for (int i = 0; i < decoders_num_; i++) {
.................................
int tm = ((i) % gpu_num_)
pthread_setspecific(::pthreadkey_gpuid, &tm);
#ifdef HAVE_CUDA
CuDevice::Instantiate().SelectGpuId(config_->gpu_option_);
CuDevice::Instantiate().AllowMultithreading();
#endif
decoder_[i]->Init(config_, model_);
}
it works well with cuda-9.1.
But after we upgrade to cuda-10.2. IT can not work. Some problems arises,
if we just use on GPU card ( export CUDA_VISIBLE_DEVICES=0) , that is N threads on one GPU ,it works fine.
But if we make 2 or more GPUs visible, we encounter some confusing errors.
when CUDA_VISIBLE_DEVICES=0,1 or 0,1,2 , the progam will be locked somewhere
If CUDA_VISIBLE_DEVICES=0,1,2,3( that's all we have on a machine) , it will give "cublasStatus_t 1 : "CUBLAS_STATUS_NOT_INITIALIZED" returned from 'cublas_gemv(GetCublasHandle(), (trans==kTrans? CUBLAS_OP_N:CUBLAS_OP_T), M.NumCols(), M.NumRows(), alpha, M.Data(), M.Stride(), v.Data(), 1, beta, data_, 1)"
Can anyone give me a hint on how this happens?
What is the reasonable way to use M threads evenly allocated on N Gpu cards ( M = k * N ), while currently we use one cu-device for one gpu which many threads use the same cu-device.