Open
Description
@cms-sw/ml-l2 , we are trying to update Tensorlfow for cmssw. We have TF 2.16 in CMSSW_14_2_TF_X_2024-09-24-1100 IB and TF 2.17 in this weeks IB (will be available soon). We noticed that GPU tests are failing [a] for these newer versions TF. Can you please look in to these failures ?
[a] run on lxplus-gpu node
> ssh lxplus-gpu
> cd /tmp/$(whoami)
> cmssw-el8 --nv
Singularity> scram p CMSSW_14_2_TF_X_2024-09-24-1100
Singularity> cmsenv
Singularity> git cms-addpkg PhysicsTools/TensorFlow
Singularity> scram build -j 8
Singularity> cmsenv
Singularity> ./test/el8_amd64_gcc12/testTFHelloWorldCUDA
Running .
46.0
CUDA service enabled: 1
Testing CUDA backend
E
##Failure Location unknown## : Error
Test name: testHelloWorldCUDA::test
uncaught exception of type std::exception (or derived).
- An exception of category 'UnavailableAccelerator' occurred while
[0] Calling tensorflow::setBackend()
Exception Message:
Cuda backend requested, NVIDIA GPU visible to cmssw, but not visible to TensorFlow in the job
Failures !!!
Run: 1 Failure total: 1 Failures: 0 Errors: 1