decoder error use multiple GPUs and multiple threads, with cuda-10.2

We derived a multi-thread multi-GPU version decoder, which can parallelly decode many utterances on multi-GPUs. for example, 40 threads evenly on 4 GPU .

In the main thread,  we init our decoder by calling cudaSetDevice(gpu_id) for each gpu card, and loading a copy of am model on each gpu ( each copy will be used by many decoding threads)
----------------------------------------------------------------------------------------
////pthreadkey_gpuid will be used to index cu-device, which we modified in a way that
////  threads on each gpu will share one cu-device.

  for (int i = 0; i < gpu_num_; i++) {
    pthread_setspecific(::pthreadkey_gpuid, &i);
#ifdef HAVE_CUDA
    CuDevice::Instantiate().SelectGpuId(gpu_option); // we will select pthreadkey_gpuid 
    CuDevice::Instantiate().AllowMultithreading();
#endif
      am_nnet_[i] = new nnet3::AmNnetSimple;
      .................
      Input ki(fileName, &binary);
      ...............
      am_nnet_[i]->Read(ki.Stream(), binary);
  }

   ///in cu-device.h , we change the instantiate func , to support diffenrent cu-device on each GPU
  static inline CuDevice& Instantiate() {
    int index = *((int *)pthread_getspecific(::pthreadkey_gpuid));
    //KALDI_LOG << "now device index is:" << index;
      return global_device_[index];
  }

//Then we create several decoding threads which are evenly allocated on multi-GPUs and use the //corresponding AM copy .

for (int i = 0; i < decoders_num_; i++) {
     .................................
     int tm = ((i) % gpu_num_)
     pthread_setspecific(::pthreadkey_gpuid, &tm);
 #ifdef HAVE_CUDA
     CuDevice::Instantiate().SelectGpuId(config_->gpu_option_);
     CuDevice::Instantiate().AllowMultithreading();
 #endif
     decoder_[i]->Init(config_, model_);
}
---------------------------------------------------------------------

it works well with cuda-9.1. 
But after we upgrade to cuda-10.2. IT can not work. Some problems arises,
if we just use on GPU card ( export CUDA_VISIBLE_DEVICES=0) , that is N threads on one GPU ,it works fine.
But if we make 2 or more GPUs  visible,  we encounter some confusing errors.
when CUDA_VISIBLE_DEVICES=0,1 or 0,1,2 ,  the progam will be locked somewhere
If CUDA_VISIBLE_DEVICES=0,1,2,3( that's all we have on a machine) , it will give "cublasStatus_t 1 : "CUBLAS_STATUS_NOT_INITIALIZED" returned from 'cublas_gemv(GetCublasHandle(), (trans==kTrans? CUBLAS_OP_N:CUBLAS_OP_T), M.NumCols(), M.NumRows(), alpha, M.Data(), M.Stride(), v.Data(), 1, beta, data_, 1)" 

------------------------------------------------------------------------------------------
Can anyone give me a hint on how this happens?
What is the reasonable way to use M threads evenly allocated on N Gpu cards ( M = k * N ), while currently we use one cu-device for one gpu which many threads use the same cu-device.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

decoder error use multiple GPUs and multiple threads, with cuda-10.2 #4084

In the main thread, we init our decoder by calling cudaSetDevice(gpu_id) for each gpu card, and loading a copy of am model on each gpu ( each copy will be used by many decoding threads)

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

decoder error use multiple GPUs and multiple threads, with cuda-10.2 #4084

Description

In the main thread, we init our decoder by calling cudaSetDevice(gpu_id) for each gpu card, and loading a copy of am model on each gpu ( each copy will be used by many decoding threads)

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions