Skip to content

[TMVA] cuDNN LSTM backpropagation test fails on ubuntu2404 cuda 12.6.1 #16790

Open
@hageboeck

Description

@hageboeck

Check duplicate issues.

  • Checked for duplicates

Description

The test TMVA-DNN-LSTM-BackpropagationCudnn crashes on ubuntu2404 cuda-12.6.1 with cudnn with the following stack trace:

   0x00007fda7f0b5540 in <unknown> from /usr/lib64/libcuda.so.1
   0x00007fda7ed1491e in <unknown> from /usr/lib64/libcuda.so.1
   0x00007fda7f08f040 in <unknown> from /usr/lib64/libcuda.so.1
   0x00007fda7ed0ef22 in <unknown> from /usr/lib64/libcuda.so.1
   0x00007fda7eed2bae in <unknown> from /usr/lib64/libcuda.so.1
   0x00007fdaaa248b01 in <unknown> from /usr/local/cuda-12.6/targets/x86_64-linux/lib/libcudart.so.12
   0x00007fdaaa218baa in <unknown> from /usr/local/cuda-12.6/targets/x86_64-linux/lib/libcudart.so.12
   0x00007fdaaa270721 in cudaMemcpy + 0x211 from /usr/local/cuda-12.6/targets/x86_64-linux/lib/libcudart.so.12
   0x000055d25af29e37 in bool testLSTMBackpropagation<TMVA::DNN::TCudnn<double> >(unsigned long, unsigned long, unsigned long, unsigned long, TMVA::DNN::TCudnn<double>::Scalar_t, std::vector<bool, std::allocator<bool> >, bool) + 0x4d37 from /github/home/ROOT-CI/build/tmva/tmva/test/DNN/LSTM/testLSTMBackpropagationCudnn

Specifically, it's the assignment in this loop:

for (size_t i = 0; i < batchSize; ++i) {
auto mat = XArch[i];
for (size_t l = 0; l < (size_t) XArch[i].GetNrows(); ++l) {
for (size_t m = 0; m < (size_t) XArch[i].GetNcols(); ++m) {
mat(l, m) = gRandom->Uniform(-1,1);
//XArch[i](0, 0) = 0.5;
//XArch[i](1, 0) = 0.5;
}
}
}
}

Which triggers a cuda_memcpy to the GPU. The crash happens somewhere in the cuda library. Other cudnn tests work, so the problem is not necessarily a broken installation.

Reproducer

cmake -Dtmva-gpu=On -Dtesting=On <src>
ctest -R TMVA-DNN-LSTM-BackpropagationCudnn

ROOT version

Master

Installation method

Source

Operating system

ubuntu24 docker container with cuda 12.6.1

Additional context

No response

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions