Skip to content

[TF] Eigen unit tests on GPU failed #46333

Open
@smuzaffar

Description

@smuzaffar

Hi,

For tensorflow special IBs TF_X, where we have TF 2.17 (cuda build enabled) and new eigen https://github.com/cms-externals/eigen-git-mirror/tree/cms/master/c1d637433e3b3f9012b226c2c9125c494b470ae6 , few unit tests when use eigen are failing [a]. To reproduce this one can do

> ssh lxplus-gpu
> cd /tmp/$(whoami)
> cmssw-el8 --nv
> scram p CMSSW_14_2_TF_X_2024-10-08-1100
> cd CMSSW_14_2_TF_X_2024-10-08-1100
> cmsenv
> git cms-addpkg RecoTracker/PixelTrackFitting
> scram b -j 8
> scram b runtests_testEigenGPUNoFit_t

Note that we do apply cms-externals/eigen-git-mirror@3cbe8e7 patch on top of eigen. So may be we are missing something to patch?

@fwyzard , do you have any idea howto fix this?

[a]

Pass    0s ... RecoTracker/PixelTrackFitting/testFits
Pass    0s ... RecoTracker/PixelTrackFitting/testFitsDump
Pass    0s ... RecoTracker/PixelTrackFitting/testEigenJacobian
Pass    0s ... RecoTracker/PixelTrackFitting/testRecoPixelVertexingPixelTrackFittingRZLine
Fail    3s ... RecoTracker/PixelTrackFitting/testFitsGPU_t
Fail    3s ... RecoTracker/PixelTrackFitting/testBrokenLineFitGPU_t
Fail    3s ... RecoTracker/PixelTrackFitting/testEigenGPUNoFit_t
Pass  158s ... RecoTracker/PixelTrackFitting/PixelTrackFits
Pass  158s ... RecoTracker/PixelTrackFitting/PixelTrackFits_Debug
Pass  158s ... RecoTracker/PixelTrackFitting/PixelTrackBrokenLineFit
> cat uunit_tests/testEigenGPUNoFit_t.lognit_tests/testEigenGPUNoFit_t.log
===== Test "testEigenGPUNoFit_t" ====
TEST EIGENVALUES
TEST INVERSE 3x3
TEST INVERSE 4x4
TEST INVERSE 5x5
/cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02858/el8_amd64_gcc12/external/eigen/c1d637433e3b3f9012b226c2c9125c494b470ae6-42b72b714d1a11d439b86af5ed2418e1/include/eigen3/Eigen/src/Core/PermutationMatrix.h:184: Derived &Eigen::PermutationBase<Derived>::applyTranspositionOnTheRight(long, long) [with Derived = Eigen::PermutationMatrix<5, 5, int>]: block: [0,0,0], thread: [0,0,0] Assertion `i >= 0 && j >= 0 && i < size() && j < size()` failed.
terminate called after throwing an instance of 'std::runtime_error'
  what():  
src/RecoTracker/PixelTrackFitting/test/testEigenGPUNoFit.cu, line 173:
cudaCheck(cudaMemcpy(mCPUret, mGPUret, sizeof(Matrix5d), cudaMemcpyDeviceToHost));
cudaErrorAssert: device-side assert triggered

/bin/sh: line 1: 3864396 Aborted                 (core dumped) sh -c 'testEigenGPUNoFit_t '

---> test testEigenGPUNoFit_t had ERRORS
TestTime:3
^^^^ End Test testEigenGPUNoFit_t ^^^^

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions