Skip to content

decoder error use multiple GPUs and multiple threads, with cuda-10.2 #4084

Open
@housebaby

Description

@housebaby

We derived a multi-thread multi-GPU version decoder, which can parallelly decode many utterances on multi-GPUs. for example, 40 threads evenly on 4 GPU .

In the main thread, we init our decoder by calling cudaSetDevice(gpu_id) for each gpu card, and loading a copy of am model on each gpu ( each copy will be used by many decoding threads)

////pthreadkey_gpuid will be used to index cu-device, which we modified in a way that
//// threads on each gpu will share one cu-device.

for (int i = 0; i < gpu_num_; i++) {
pthread_setspecific(::pthreadkey_gpuid, &i);
#ifdef HAVE_CUDA
CuDevice::Instantiate().SelectGpuId(gpu_option); // we will select pthreadkey_gpuid
CuDevice::Instantiate().AllowMultithreading();
#endif
am_nnet_[i] = new nnet3::AmNnetSimple;
.................
Input ki(fileName, &binary);
...............
am_nnet_[i]->Read(ki.Stream(), binary);
}

///in cu-device.h , we change the instantiate func , to support diffenrent cu-device on each GPU
static inline CuDevice& Instantiate() {
int index = *((int *)pthread_getspecific(::pthreadkey_gpuid));
//KALDI_LOG << "now device index is:" << index;
return global_device_[index];
}

//Then we create several decoding threads which are evenly allocated on multi-GPUs and use the //corresponding AM copy .

for (int i = 0; i < decoders_num_; i++) {
.................................
int tm = ((i) % gpu_num_)
pthread_setspecific(::pthreadkey_gpuid, &tm);
#ifdef HAVE_CUDA
CuDevice::Instantiate().SelectGpuId(config_->gpu_option_);
CuDevice::Instantiate().AllowMultithreading();
#endif
decoder_[i]->Init(config_, model_);
}

it works well with cuda-9.1.
But after we upgrade to cuda-10.2. IT can not work. Some problems arises,
if we just use on GPU card ( export CUDA_VISIBLE_DEVICES=0) , that is N threads on one GPU ,it works fine.
But if we make 2 or more GPUs visible, we encounter some confusing errors.
when CUDA_VISIBLE_DEVICES=0,1 or 0,1,2 , the progam will be locked somewhere
If CUDA_VISIBLE_DEVICES=0,1,2,3( that's all we have on a machine) , it will give "cublasStatus_t 1 : "CUBLAS_STATUS_NOT_INITIALIZED" returned from 'cublas_gemv(GetCublasHandle(), (trans==kTrans? CUBLAS_OP_N:CUBLAS_OP_T), M.NumCols(), M.NumRows(), alpha, M.Data(), M.Stride(), v.Data(), 1, beta, data_, 1)"


Can anyone give me a hint on how this happens?
What is the reasonable way to use M threads evenly allocated on N Gpu cards ( M = k * N ), while currently we use one cu-device for one gpu which many threads use the same cu-device.

Metadata

Metadata

Assignees

No one assigned

    Labels

    kaldi10-TODOrelates to new version of kaldistaleStale bot on the loosewaiting-for-feedbackReporter's feedback has been requested

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions