Skip to content

Conversation

@smuzaffar
Copy link
Contributor

@smuzaffar smuzaffar commented Dec 17, 2025

This is a possible fix for failed CUDA/GPU tests with newer eigen which we see in PY312_X IBs.

  • simpleCholeskyTest.cu: Use different objects for src and dst while calling invertNN . Chatgppt identifies
What goes wrong with invertNN(m, m)
  - src(i,j) and dst(i,j) refer to the same object
   - Writes to dst happen before all reads from src are finished
   - On CPU, this happens to limp along
   -  On GPU:
        Different instruction ordering
        Different register pressure
        Different memory model
        Guaranteed corruption
  • testEigenGPUNoFit.cu: Again chatgpt identifies the following and suggest to use the invertNN for large matrix
CUDA cannot fully support all Eigen expressions on the device, especially:
        Matrix.inverse()
        SelfAdjointEigenSolver::computeDirect()
    When you call in->inverse() in a __global__ kernel (like in kernelInverse5x5), Eigen will internally use permutation matrices (for LU or LDLT) on the GPU memory.

    Eigen's permutation matrices use bounds-checked indexing. When they run on the device, some of these bounds checks fail because device memory isn't tracked the same way as host memory.

    This triggers the applyTranspositionOnTheRight assertion failure.
  • testEigenGPU.cu: Eigen already knows the dimensions at compile time. This change matches the rest of cmssw code where we use riemannFit::Map*

This is an attempt to fix these unit tests for new eigen but I am not sure if these are the correct fixes.
FYI @fwyzard

@smuzaffar
Copy link
Contributor Author

enable gpu

@cmsbuild cmsbuild added this to the CMSSW_16_0_X milestone Dec 17, 2025
@smuzaffar
Copy link
Contributor Author

please test

@cmsbuild
Copy link
Contributor

cmsbuild commented Dec 17, 2025

cms-bot internal usage

@cmsbuild
Copy link
Contributor

-code-checks

Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-49655/47199

Code check has found code style and quality issues which could be resolved by applying following patch(s)

@cmsbuild
Copy link
Contributor

@cmsbuild
Copy link
Contributor

A new Pull Request was created by @smuzaffar for master.

It involves the following packages:

  • DataFormats/Math (reconstruction)
  • RecoTracker/PixelTrackFitting (reconstruction)

@Moanwar, @cmsbuild, @jfernan2, @mandrenguyen, @srimanob can you please review it and eventually sign? Thanks.
@GiacomoSguazzoni, @VinInn, @VourMa, @dgulhan, @elusian, @fabiocos, @felicepantaleo, @gpetruc, @makortel, @missirol, @mmasciov, @mmusich, @mtosi, @rovere this is something you requested to watch as well.
@ftenchini, @mandrenguyen, @sextonkennedy you are the release manager for this.

cms-bot commands are listed here

@smuzaffar
Copy link
Contributor Author

please test

@fwyzard
Copy link
Contributor

fwyzard commented Dec 17, 2025

assign heterogeneous

@cmsbuild
Copy link
Contributor

New categories assigned: heterogeneous

@fwyzard,@makortel you have been requested to review this Pull request/Issue and eventually sign? Thanks

@cmsbuild
Copy link
Contributor

Milestone for this pull request has been moved to CMSSW_16_1_X. Please open a backport if it should also go in to CMSSW_16_0_X.

@cmsbuild cmsbuild modified the milestones: CMSSW_16_0_X, CMSSW_16_1_X Dec 18, 2025
@smuzaffar
Copy link
Contributor Author

please test for CMSSW_16_1_PY312_X

@cmsbuild
Copy link
Contributor

+1

Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-a0cb7c/50361/summary.html
COMMIT: 08ec95f
CMSSW: CMSSW_16_1_PY312_X_2025-12-18-2300/el8_amd64_gcc13
Additional Tests: GPU,AMD_MI300X,AMD_W7900,NVIDIA_H100,NVIDIA_L40S,NVIDIA_T4
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week0/cms-sw/cmssw/49655/50361/install.sh to create a dev area with all the needed externals and cmssw changes.

Comparison Summary

Summary:

  • You potentially added 221 lines to the logs
  • ROOTFileChecks: Some differences in event products or their sizes found
  • Reco comparison results: 2377 differences found in the comparisons
  • Reco comparison had 4 failed jobs
  • DQMHistoTests: Total files compared: 55
  • DQMHistoTests: Total histograms compared: 4513798
  • DQMHistoTests: Total failures: 7973
  • DQMHistoTests: Total nulls: 0
  • DQMHistoTests: Total successes: 4505805
  • DQMHistoTests: Total skipped: 20
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 0.0 KiB( 54 files compared)
  • Checked 235 log files, 208 edm output root files, 55 DQM output files
  • TriggerResults: no differences found

@cmsbuild
Copy link
Contributor

-1

Size: This PR adds an extra 32KB to repository
Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-a0cb7c/50317/summary.html
COMMIT: 08ec95f
CMSSW: CMSSW_16_0_X_2025-12-16-2300/el8_amd64_gcc13
Additional Tests: GPU,AMD_MI300X,AMD_W7900,NVIDIA_H100,NVIDIA_L40S,NVIDIA_T4
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week0/cms-sw/cmssw/49655/50317/install.sh to create a dev area with all the needed externals and cmssw changes.

Comparison Summary

Summary:

  • You potentially removed 2 lines from the logs
  • ROOTFileChecks: Some differences in event products or their sizes found
  • Reco comparison results: 9 differences found in the comparisons
  • Reco comparison had 4 failed jobs
  • DQMHistoTests: Total files compared: 55
  • DQMHistoTests: Total histograms compared: 4513634
  • DQMHistoTests: Total failures: 76
  • DQMHistoTests: Total nulls: 0
  • DQMHistoTests: Total successes: 4513538
  • DQMHistoTests: Total skipped: 20
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 0.0 KiB( 54 files compared)
  • Checked 235 log files, 208 edm output root files, 55 DQM output files
  • TriggerResults: no differences found

AMD_W7900 Comparison Summary

Summary:

  • You potentially removed 9 lines from the logs
  • Reco comparison results: 249 differences found in the comparisons
  • Reco comparison had 6 failed jobs
  • DQMHistoTests: Total files compared: 11
  • DQMHistoTests: Total histograms compared: 149371
  • DQMHistoTests: Total failures: 30394
  • DQMHistoTests: Total nulls: 11
  • DQMHistoTests: Total successes: 118966
  • DQMHistoTests: Total skipped: 0
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 0.0 KiB( 10 files compared)
  • Checked 42 log files, 45 edm output root files, 11 DQM output files
  • TriggerResults: no differences found

NVIDIA_H100 Comparison Summary

Summary:

  • You potentially removed 8 lines from the logs
  • Reco comparison results: 247 differences found in the comparisons
  • Reco comparison had 6 failed jobs
  • DQMHistoTests: Total files compared: 11
  • DQMHistoTests: Total histograms compared: 149371
  • DQMHistoTests: Total failures: 27861
  • DQMHistoTests: Total nulls: 14
  • DQMHistoTests: Total successes: 121496
  • DQMHistoTests: Total skipped: 0
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 0.0 KiB( 10 files compared)
  • Checked 42 log files, 45 edm output root files, 11 DQM output files
  • TriggerResults: no differences found

NVIDIA_L40S Comparison Summary

Summary:

  • You potentially added 6 lines to the logs
  • Reco comparison results: 248 differences found in the comparisons
  • Reco comparison had 6 failed jobs
  • DQMHistoTests: Total files compared: 11
  • DQMHistoTests: Total histograms compared: 149371
  • DQMHistoTests: Total failures: 29040
  • DQMHistoTests: Total nulls: 7
  • DQMHistoTests: Total successes: 120324
  • DQMHistoTests: Total skipped: 0
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 0.0 KiB( 10 files compared)
  • Checked 42 log files, 45 edm output root files, 11 DQM output files
  • TriggerResults: no differences found

NVIDIA_T4 Comparison Summary

Summary:

  • You potentially added 18 lines to the logs
  • Reco comparison results: 259 differences found in the comparisons
  • Reco comparison had 6 failed jobs
  • DQMHistoTests: Total files compared: 11
  • DQMHistoTests: Total histograms compared: 149371
  • DQMHistoTests: Total failures: 27857
  • DQMHistoTests: Total nulls: 8
  • DQMHistoTests: Total successes: 121506
  • DQMHistoTests: Total skipped: 0
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 0.0 KiB( 10 files compared)
  • Checked 42 log files, 45 edm output root files, 11 DQM output files
  • TriggerResults: no differences found

Max Memory Comparisons exceeding threshold

@cms-sw/core-l2 , I found 1 workflow step(s) with memory usage exceeding the error threshold:

  • Error

@makortel
Copy link
Contributor

-1

It is not clear from the message above what caused the -1. Following the link

Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-a0cb7c/50317/summary.html

Indicates runTheMatrix workflow failures on AMD_MI300X because of timing out because of #49570

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants