Fix for cuda/eigen unit test #49655

smuzaffar · 2025-12-17T14:18:38Z

This is a possible fix for failed CUDA/GPU tests with newer eigen which we see in PY312_X IBs.

simpleCholeskyTest.cu: Use different objects for src and dst while calling invertNN . Chatgppt identifies

What goes wrong with invertNN(m, m)
  - src(i,j) and dst(i,j) refer to the same object
   - Writes to dst happen before all reads from src are finished
   - On CPU, this happens to limp along
   -  On GPU:
        Different instruction ordering
        Different register pressure
        Different memory model
        Guaranteed corruption

testEigenGPUNoFit.cu: Again chatgpt identifies the following and suggest to use the invertNN for large matrix

CUDA cannot fully support all Eigen expressions on the device, especially:
        Matrix.inverse()
        SelfAdjointEigenSolver::computeDirect()
    When you call in->inverse() in a __global__ kernel (like in kernelInverse5x5), Eigen will internally use permutation matrices (for LU or LDLT) on the GPU memory.

    Eigen's permutation matrices use bounds-checked indexing. When they run on the device, some of these bounds checks fail because device memory isn't tracked the same way as host memory.

    This triggers the applyTranspositionOnTheRight assertion failure.

testEigenGPU.cu: Eigen already knows the dimensions at compile time. This change matches the rest of cmssw code where we use riemannFit::Map*

This is an attempt to fix these unit tests for new eigen but I am not sure if these are the correct fixes.
FYI @fwyzard

smuzaffar · 2025-12-17T14:18:56Z

enable gpu

smuzaffar · 2025-12-17T14:19:02Z

please test

cmsbuild · 2025-12-17T14:19:04Z

cms-bot internal usage

cmsbuild · 2025-12-17T14:19:52Z

-code-checks

Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-49655/47199

Code check has found code style and quality issues which could be resolved by applying following patch(s)

code-format:
https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-49655/47199/code-format.patch
e.g. curl -k https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-49655/47199/code-format.patch | patch -p1
You can also run scram build code-format to apply code format directly

cmsbuild · 2025-12-17T14:26:15Z

+code-checks

Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-49655/47200

cmsbuild · 2025-12-17T14:26:35Z

A new Pull Request was created by @smuzaffar for master.

It involves the following packages:

DataFormats/Math (reconstruction)
RecoTracker/PixelTrackFitting (reconstruction)

@Moanwar, @cmsbuild, @jfernan2, @mandrenguyen, @srimanob can you please review it and eventually sign? Thanks.
@GiacomoSguazzoni, @VinInn, @VourMa, @dgulhan, @elusian, @fabiocos, @felicepantaleo, @gpetruc, @makortel, @missirol, @mmasciov, @mmusich, @mtosi, @rovere this is something you requested to watch as well.
@ftenchini, @mandrenguyen, @sextonkennedy you are the release manager for this.

cms-bot commands are listed here

smuzaffar · 2025-12-17T14:27:18Z

please test

fwyzard · 2025-12-17T19:39:44Z

assign heterogeneous

cmsbuild · 2025-12-17T19:40:04Z

New categories assigned: heterogeneous

@fwyzard,@makortel you have been requested to review this Pull request/Issue and eventually sign? Thanks

cmsbuild · 2025-12-18T13:00:48Z

Milestone for this pull request has been moved to CMSSW_16_1_X. Please open a backport if it should also go in to CMSSW_16_0_X.

smuzaffar · 2025-12-19T08:57:49Z

please test for CMSSW_16_1_PY312_X

cmsbuild · 2025-12-19T16:58:11Z

+1

Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-a0cb7c/50361/summary.html
COMMIT: 08ec95f
CMSSW: CMSSW_16_1_PY312_X_2025-12-18-2300/el8_amd64_gcc13
Additional Tests: GPU,AMD_MI300X,AMD_W7900,NVIDIA_H100,NVIDIA_L40S,NVIDIA_T4
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week0/cms-sw/cmssw/49655/50361/install.sh to create a dev area with all the needed externals and cmssw changes.

Comparison Summary

Summary:

You potentially added 221 lines to the logs
ROOTFileChecks: Some differences in event products or their sizes found
Reco comparison results: 2377 differences found in the comparisons
Reco comparison had 4 failed jobs
DQMHistoTests: Total files compared: 55
DQMHistoTests: Total histograms compared: 4513798
DQMHistoTests: Total failures: 7973
DQMHistoTests: Total nulls: 0
DQMHistoTests: Total successes: 4505805
DQMHistoTests: Total skipped: 20
DQMHistoTests: Total Missing objects: 0
DQMHistoSizes: Histogram memory added: 0.0 KiB( 54 files compared)
Checked 235 log files, 208 edm output root files, 55 DQM output files
TriggerResults: no differences found

cmsbuild · 2025-12-20T07:15:23Z

-1

Size: This PR adds an extra 32KB to repository
Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-a0cb7c/50317/summary.html
COMMIT: 08ec95f
CMSSW: CMSSW_16_0_X_2025-12-16-2300/el8_amd64_gcc13
Additional Tests: GPU,AMD_MI300X,AMD_W7900,NVIDIA_H100,NVIDIA_L40S,NVIDIA_T4
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week0/cms-sw/cmssw/49655/50317/install.sh to create a dev area with all the needed externals and cmssw changes.

Comparison Summary

Summary:

You potentially removed 2 lines from the logs
ROOTFileChecks: Some differences in event products or their sizes found
Reco comparison results: 9 differences found in the comparisons
Reco comparison had 4 failed jobs
DQMHistoTests: Total files compared: 55
DQMHistoTests: Total histograms compared: 4513634
DQMHistoTests: Total failures: 76
DQMHistoTests: Total nulls: 0
DQMHistoTests: Total successes: 4513538
DQMHistoTests: Total skipped: 20
DQMHistoTests: Total Missing objects: 0
DQMHistoSizes: Histogram memory added: 0.0 KiB( 54 files compared)
Checked 235 log files, 208 edm output root files, 55 DQM output files
TriggerResults: no differences found

AMD_W7900 Comparison Summary

Summary:

You potentially removed 9 lines from the logs
Reco comparison results: 249 differences found in the comparisons
Reco comparison had 6 failed jobs
DQMHistoTests: Total files compared: 11
DQMHistoTests: Total histograms compared: 149371
DQMHistoTests: Total failures: 30394
DQMHistoTests: Total nulls: 11
DQMHistoTests: Total successes: 118966
DQMHistoTests: Total skipped: 0
DQMHistoTests: Total Missing objects: 0
DQMHistoSizes: Histogram memory added: 0.0 KiB( 10 files compared)
Checked 42 log files, 45 edm output root files, 11 DQM output files
TriggerResults: no differences found

NVIDIA_H100 Comparison Summary

Summary:

You potentially removed 8 lines from the logs
Reco comparison results: 247 differences found in the comparisons
Reco comparison had 6 failed jobs
DQMHistoTests: Total files compared: 11
DQMHistoTests: Total histograms compared: 149371
DQMHistoTests: Total failures: 27861
DQMHistoTests: Total nulls: 14
DQMHistoTests: Total successes: 121496
DQMHistoTests: Total skipped: 0
DQMHistoTests: Total Missing objects: 0
DQMHistoSizes: Histogram memory added: 0.0 KiB( 10 files compared)
Checked 42 log files, 45 edm output root files, 11 DQM output files
TriggerResults: no differences found

NVIDIA_L40S Comparison Summary

Summary:

You potentially added 6 lines to the logs
Reco comparison results: 248 differences found in the comparisons
Reco comparison had 6 failed jobs
DQMHistoTests: Total files compared: 11
DQMHistoTests: Total histograms compared: 149371
DQMHistoTests: Total failures: 29040
DQMHistoTests: Total nulls: 7
DQMHistoTests: Total successes: 120324
DQMHistoTests: Total skipped: 0
DQMHistoTests: Total Missing objects: 0
DQMHistoSizes: Histogram memory added: 0.0 KiB( 10 files compared)
Checked 42 log files, 45 edm output root files, 11 DQM output files
TriggerResults: no differences found

NVIDIA_T4 Comparison Summary

Summary:

You potentially added 18 lines to the logs
Reco comparison results: 259 differences found in the comparisons
Reco comparison had 6 failed jobs
DQMHistoTests: Total files compared: 11
DQMHistoTests: Total histograms compared: 149371
DQMHistoTests: Total failures: 27857
DQMHistoTests: Total nulls: 8
DQMHistoTests: Total successes: 121506
DQMHistoTests: Total skipped: 0
DQMHistoTests: Total Missing objects: 0
DQMHistoSizes: Histogram memory added: 0.0 KiB( 10 files compared)
Checked 42 log files, 45 edm output root files, 11 DQM output files
TriggerResults: no differences found

Max Memory Comparisons exceeding threshold

@cms-sw/core-l2 , I found 1 workflow step(s) with memory usage exceeding the error threshold:

Error

makortel · 2025-12-22T14:50:21Z

-1

It is not clear from the message above what caused the -1. Following the link

Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-a0cb7c/50317/summary.html

Indicates runTheMatrix workflow failures on AMD_MI300X because of timing out because of #49570

Fix for cuda/eigen unit test

0fa7587

cmsbuild added this to the CMSSW_16_0_X milestone Dec 17, 2025

cmsbuild added reconstruction-pending pending-signatures tests-pending orp-pending code-checks-pending tracking labels Dec 17, 2025

cmsbuild added tests-started and removed tests-pending labels Dec 17, 2025

cmsbuild added code-checks-rejected and removed code-checks-pending labels Dec 17, 2025

Update testEigenGPUNoFit.cu

08ec95f

cmsbuild added tests-pending code-checks-pending and removed tests-started code-checks-rejected labels Dec 17, 2025

cmsbuild added code-checks-approved and removed code-checks-pending labels Dec 17, 2025

cmsbuild added tests-started and removed tests-pending labels Dec 17, 2025

smuzaffar mentioned this pull request Dec 17, 2025

[PY312] Use official eigen commit without cms patch cms-sw/cmsdist#10240

Open

cmsbuild added the heterogeneous-pending label Dec 17, 2025

cmsbuild modified the milestones: CMSSW_16_0_X, CMSSW_16_1_X Dec 18, 2025

cmsbuild added tests-rejected and removed tests-started labels Dec 20, 2025

makortel mentioned this pull request Dec 22, 2025

MaxMemoryPreload shows variations of tens of MB in some cases #46966

Open

Fix for cuda/eigen unit test #49655

Are you sure you want to change the base?

Fix for cuda/eigen unit test #49655

Conversation

smuzaffar commented Dec 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

smuzaffar commented Dec 17, 2025

Uh oh!

smuzaffar commented Dec 17, 2025

Uh oh!

cmsbuild commented Dec 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cmsbuild commented Dec 17, 2025

Uh oh!

cmsbuild commented Dec 17, 2025

Uh oh!

cmsbuild commented Dec 17, 2025

Uh oh!

smuzaffar commented Dec 17, 2025

Uh oh!

fwyzard commented Dec 17, 2025

Uh oh!

cmsbuild commented Dec 17, 2025

Uh oh!

cmsbuild commented Dec 18, 2025

Uh oh!

smuzaffar commented Dec 19, 2025

Uh oh!

cmsbuild commented Dec 19, 2025

Comparison Summary

Uh oh!

cmsbuild commented Dec 20, 2025

Comparison Summary

AMD_W7900 Comparison Summary

NVIDIA_H100 Comparison Summary

NVIDIA_L40S Comparison Summary

NVIDIA_T4 Comparison Summary

Max Memory Comparisons exceeding threshold

Uh oh!

makortel commented Dec 22, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

smuzaffar commented Dec 17, 2025 •

edited

Loading

cmsbuild commented Dec 17, 2025 •

edited

Loading