LST CPU Optimizations - Trig Simplifications and Redundant Load Removal by GNiendorf · Pull Request #50606 · cms-sw/cmssw

GNiendorf · 2026-03-31T16:21:36Z

Speeds up the LST CPU backend, reducing per-event time for LST by roughly 40-50% (excluding pixel seed duplicate cleaning which is disabled in the online version of LST). The changes fall into three categories. First, algebraic trig simplifications replace expensive transcendental calls with equivalent expressions: sin(atan(x)) becomes x/sqrt(1+x^2), tan(asin(s))/asin(s) is computed without calling tan, and deltaPhi(x1,y1,x2,y2) which internally calls atan2 twice is replaced with a single atan2(cross, dot) using the 2D cross and dot products directly. Second, loose pre-checks are added that compare the cross product or dot product of hit coordinate vectors against thresholds, effectively cutting on sin(dPhi) or cos(dPhi) rather than computing the angle first, allowing early rejection of clearly failing candidates before the more expensive exact-angle computation. These are applied in the MiniDoublet barrel and endcap creation kernels, the Segment counting kernel, and the Triplet betaIn check. Third, SoA fields that are invariant across inner loop iterations (module geometry properties, pixel seed kinematics) are pre-loaded once into local structs (ModuleMDData, ModuleSegData, PixelSeedData) to avoid redundant memory reads. GPU timing and physics performance have negligible changes.

There is also a small decrease in the memory footprint from a restructuring of the Segment counting kernel, where we use only the non-trig pre-checks and added an additional cut.

cmsbuild · 2026-03-31T16:22:05Z

cms-bot internal usage

cmsbuild · 2026-03-31T16:23:41Z

+code-checks

Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-50606/48792

cmsbuild · 2026-03-31T16:24:07Z

A new Pull Request was created by @GNiendorf for master.

It involves the following packages:

HeterogeneousCore/AlpakaMath (heterogeneous)
RecoTracker/LSTCore (reconstruction)

@Moanwar, @cmsbuild, @fwyzard, @jfernan2, @makortel, @mandrenguyen, @srimanob can you please review it and eventually sign? Thanks.
@GiacomoSguazzoni, @VinInn, @VourMa, @dgulhan, @elusian, @felicepantaleo, @gpetruc, @makortel, @mmasciov, @mmusich, @mtosi, @rovere this is something you requested to watch as well.
@ftenchini, @mandrenguyen, @sextonkennedy you are the release manager for this.

cms-bot commands are listed here

slava77 · 2026-03-31T22:15:04Z

test parameters:

enable = gpu, hlt_p2_integration, hlt_p2_timing
workflows = ph2_hlt

slava77 · 2026-03-31T22:15:36Z

@cmsbuild please test

cmsbuild · 2026-04-01T08:45:34Z

-1

Failed Tests: UnitTests HLTP2Timing amd_mi300xUnitTests amd_w7900UnitTests nvidia_h100UnitTests nvidia_l40sUnitTests
Size: This PR adds an extra 80KB to repository
Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-25ca41/52371/summary.html
COMMIT: 1d8b0bb
CMSSW: CMSSW_16_1_X_2026-03-31-1100/el8_amd64_gcc13
Additional Tests: GPU,HLT_P2_INTEGRATION,HLT_P2_TIMING,AMD_MI300X,AMD_W7900,NVIDIA_H100,NVIDIA_L40S
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week1/cms-sw/cmssw/50606/52371/install.sh to create a dev area with all the needed externals and cmssw changes.

Failed Unit Tests

I found 1 errors in the following unit tests:

---> test alpakaTestDeltaPhiSerialSync had ERRORS

Comparison Summary

Summary:

You potentially removed 3 lines from the logs
ROOTFileChecks: Some differences in event products or their sizes found
Reco comparison results: 2 differences found in the comparisons
DQMHistoTests: Total files compared: 68
DQMHistoTests: Total histograms compared: 4802019
DQMHistoTests: Total failures: 376
DQMHistoTests: Total nulls: 0
DQMHistoTests: Total successes: 4801623
DQMHistoTests: Total skipped: 20
DQMHistoTests: Total Missing objects: 0
DQMHistoSizes: Histogram memory added: 0.0 KiB( 67 files compared)
Checked 282 log files, 244 edm output root files, 68 DQM output files
TriggerResults: no differences found

AMD_MI300X Comparison Summary

Summary:

You potentially removed 40 lines from the logs
Reco comparison results: 377 differences found in the comparisons
DQMHistoTests: Total files compared: 13
DQMHistoTests: Total histograms compared: 216539
DQMHistoTests: Total failures: 34561
DQMHistoTests: Total nulls: 33
DQMHistoTests: Total successes: 181945
DQMHistoTests: Total skipped: 0
DQMHistoTests: Total Missing objects: 0
DQMHistoSizes: Histogram memory added: 0.0 KiB( 12 files compared)
Checked 49 log files, 50 edm output root files, 13 DQM output files
TriggerResults: found differences in 3 / 12 workflows

AMD_W7900 Comparison Summary

Summary:

You potentially removed 38 lines from the logs
Reco comparison results: 383 differences found in the comparisons
DQMHistoTests: Total files compared: 13
DQMHistoTests: Total histograms compared: 216539
DQMHistoTests: Total failures: 32045
DQMHistoTests: Total nulls: 34
DQMHistoTests: Total successes: 184460
DQMHistoTests: Total skipped: 0
DQMHistoTests: Total Missing objects: 0
DQMHistoSizes: Histogram memory added: 0.0 KiB( 12 files compared)
Checked 49 log files, 50 edm output root files, 13 DQM output files
TriggerResults: found differences in 2 / 12 workflows

NVIDIA_H100 Comparison Summary

Summary:

You potentially added 19 lines to the logs
Reco comparison results: 381 differences found in the comparisons
DQMHistoTests: Total files compared: 13
DQMHistoTests: Total histograms compared: 216539
DQMHistoTests: Total failures: 31044
DQMHistoTests: Total nulls: 36
DQMHistoTests: Total successes: 185459
DQMHistoTests: Total skipped: 0
DQMHistoTests: Total Missing objects: 0
DQMHistoSizes: Histogram memory added: 0.0 KiB( 12 files compared)
Checked 49 log files, 50 edm output root files, 13 DQM output files
TriggerResults: found differences in 1 / 12 workflows

NVIDIA_L40S Comparison Summary

Summary:

You potentially added 8 lines to the logs
Reco comparison results: 372 differences found in the comparisons
DQMHistoTests: Total files compared: 13
DQMHistoTests: Total histograms compared: 216539
DQMHistoTests: Total failures: 31699
DQMHistoTests: Total nulls: 34
DQMHistoTests: Total successes: 184806
DQMHistoTests: Total skipped: 0
DQMHistoTests: Total Missing objects: 0
DQMHistoSizes: Histogram memory added: 0.0 KiB( 12 files compared)
Checked 49 log files, 50 edm output root files, 13 DQM output files
TriggerResults: no differences found

Max Memory Comparisons exceeding threshold NVIDIA_L40S

@cms-sw/core-l2 , I found 1 workflow step(s) with memory usage exceeding the error threshold:

Expand to see workflows ...

Error: Workflow 34634.7503_TTbar_14TeV+Run4D121PU_HLTHeterogeneousValid step2 max memory diff -119.8 exceeds +/- 90.0 MiB

GNiendorf · 2026-04-01T09:03:57Z

Failed Unit Tests

I found 1 errors in the following unit tests:

---> test alpakaTestDeltaPhiSerialSync had ERRORS

Seems like the unit test errors come from a few ambiguous test cases (e.g. answer is looking for -pi but new formula returns +pi, even though these correspond to the same angle)

fwyzard · 2026-04-01T09:31:38Z

Is

GPU timing [...] have negligible changes.

the expected outcome for these optimisations ?

fwyzard · 2026-04-01T09:32:27Z

Seems like the unit test errors come from a few ambiguous test cases (e.g. answer is looking for -pi but new formula returns +pi, even though these correspond to the same angle)

I see.

Can you fix the tests to use a better "matcher" ?

cmsbuild · 2026-04-01T10:51:38Z

+code-checks

Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-50606/48812

Found files with invalid states:
- Traceback (most recent call last):
  - File "/data/cmsbld/jenkins/workspace/run-pr-code-checks/cms-bot/process-pull-request.py", line 81, in
    - for c in get_pr_commits(prId, opts.repository):
  - File "/data/cmsbld/jenkins/workspace/run-pr-code-checks/cms-bot/github_utils.py", line 751, in get_pr_commits
    - return github_api(
  - File "/data/cmsbld/jenkins/workspace/run-pr-code-checks/cms-bot/github_utils.py", line 611, in github_api
    - response = urlopen(request)
  - File "/cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02935/el8_amd64_gcc13/external/python3/3.9.14-e16d2924e9eb9db8fddd14e187cf7209/lib/python3.9/urllib/request.py", line 214, in urlopen
    - return opener.open(url, data, timeout)
  - File "/cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02935/el8_amd64_gcc13/external/python3/3.9.14-e16d2924e9eb9db8fddd14e187cf7209/lib/python3.9/urllib/request.py", line 523, in open
    - response = meth(req, response)
  - File "/cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02935/el8_amd64_gcc13/external/python3/3.9.14-e16d2924e9eb9db8fddd14e187cf7209/lib/python3.9/urllib/request.py", line 632, in http_response
    - response = self.parent.error(
  - File "/cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02935/el8_amd64_gcc13/external/python3/3.9.14-e16d2924e9eb9db8fddd14e187cf7209/lib/python3.9/urllib/request.py", line 561, in error
    - return self._call_chain(*args)
  - File "/cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02935/el8_amd64_gcc13/external/python3/3.9.14-e16d2924e9eb9db8fddd14e187cf7209/lib/python3.9/urllib/request.py", line 494, in _call_chain
    - result = func(*args)
  - File "/cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02935/el8_amd64_gcc13/external/python3/3.9.14-e16d2924e9eb9db8fddd14e187cf7209/lib/python3.9/urllib/request.py", line 641, in http_error_default
    - raise HTTPError(req.full_url, code, msg, hdrs, fp)
- urllib.error.HTTPError: HTTP Error 503: Service Unavailable

cmsbuild · 2026-04-01T10:52:03Z

Pull request #50606 was updated. @Moanwar, @cmsbuild, @fwyzard, @jfernan2, @makortel, @mandrenguyen, @srimanob can you please check and sign again.

HeterogeneousCore/AlpakaMath/test/alpaka/testDeltaPhi.dev.cc

GNiendorf · 2026-04-01T11:55:53Z

Is

GPU timing [...] have negligible changes.

the expected outcome for these optimisations ?

Yes, these optimizations address CPU-side bottlenecks. The GPU backend has its own set of bottlenecks, e.g. early exits are not as helpful on GPU, trig calls seem to be not as expensive as on CPU, etc.

fwyzard · 2026-04-01T11:58:04Z

Yes, these optimizations address CPU-side bottlenecks.

Ah, OK.

cmsbuild · 2026-04-01T12:05:46Z

+code-checks

Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-50606/48813

cmsbuild · 2026-04-01T12:06:10Z

Pull request #50606 was updated. @Moanwar, @cmsbuild, @fwyzard, @jfernan2, @makortel, @mandrenguyen, @srimanob can you please check and sign again.

slava77 · 2026-04-01T12:52:48Z

@cmsbuild please test

slava77 · 2026-04-01T14:21:44Z

Failed Tests: UnitTests HLTP2Timing amd_mi300xUnitTests amd_w7900UnitTests nvidia_h100UnitTests nvidia_l40sUnitTests

it looks like the timing test failed: was it something in the IB already? I'm not sure I find this PR-related notes in the timing log

cmsbuild · 2026-04-01T18:23:12Z

+1

Size: This PR adds an extra 80KB to repository
Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-25ca41/52396/summary.html
COMMIT: c3282cf
CMSSW: CMSSW_16_1_X_2026-03-31-2300/el8_amd64_gcc13
Additional Tests: GPU,HLT_P2_INTEGRATION,HLT_P2_TIMING,AMD_MI300X,AMD_W7900,NVIDIA_H100,NVIDIA_L40S
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week1/cms-sw/cmssw/50606/52396/install.sh to create a dev area with all the needed externals and cmssw changes.

HLT P2 Timing: chart

Comparison Summary

Summary:

You potentially removed 4 lines from the logs
ROOTFileChecks: Some differences in event products or their sizes found
Reco comparison results: 7 differences found in the comparisons
DQMHistoTests: Total files compared: 68
DQMHistoTests: Total histograms compared: 4795858
DQMHistoTests: Total failures: 407
DQMHistoTests: Total nulls: 0
DQMHistoTests: Total successes: 4795431
DQMHistoTests: Total skipped: 20
DQMHistoTests: Total Missing objects: 0
DQMHistoSizes: Histogram memory added: 0.0 KiB( 67 files compared)
Checked 282 log files, 243 edm output root files, 68 DQM output files
TriggerResults: no differences found

AMD_MI300X Comparison Summary

Summary:

You potentially added 12 lines to the logs
Reco comparison results: 373 differences found in the comparisons
DQMHistoTests: Total files compared: 13
DQMHistoTests: Total histograms compared: 216539
DQMHistoTests: Total failures: 32818
DQMHistoTests: Total nulls: 34
DQMHistoTests: Total successes: 183687
DQMHistoTests: Total skipped: 0
DQMHistoTests: Total Missing objects: 0
DQMHistoSizes: Histogram memory added: 0.0 KiB( 12 files compared)
Checked 49 log files, 50 edm output root files, 13 DQM output files
TriggerResults: found differences in 3 / 12 workflows

AMD_W7900 Comparison Summary

Summary:

You potentially added 23 lines to the logs
Reco comparison results: 391 differences found in the comparisons
DQMHistoTests: Total files compared: 13
DQMHistoTests: Total histograms compared: 216539
DQMHistoTests: Total failures: 30182
DQMHistoTests: Total nulls: 28
DQMHistoTests: Total successes: 186329
DQMHistoTests: Total skipped: 0
DQMHistoTests: Total Missing objects: 0
DQMHistoSizes: Histogram memory added: 0.0 KiB( 12 files compared)
Checked 49 log files, 50 edm output root files, 13 DQM output files
TriggerResults: found differences in 2 / 12 workflows

NVIDIA_H100 Comparison Summary

Summary:

You potentially removed 1 lines from the logs
Reco comparison results: 383 differences found in the comparisons
DQMHistoTests: Total files compared: 13
DQMHistoTests: Total histograms compared: 216539
DQMHistoTests: Total failures: 32044
DQMHistoTests: Total nulls: 31
DQMHistoTests: Total successes: 184464
DQMHistoTests: Total skipped: 0
DQMHistoTests: Total Missing objects: 0
DQMHistoSizes: Histogram memory added: 0.0 KiB( 12 files compared)
Checked 49 log files, 50 edm output root files, 13 DQM output files
TriggerResults: no differences found

NVIDIA_L40S Comparison Summary

Summary:

You potentially added 26 lines to the logs
Reco comparison results: 353 differences found in the comparisons
DQMHistoTests: Total files compared: 13
DQMHistoTests: Total histograms compared: 216539
DQMHistoTests: Total failures: 33103
DQMHistoTests: Total nulls: 35
DQMHistoTests: Total successes: 183401
DQMHistoTests: Total skipped: 0
DQMHistoTests: Total Missing objects: 0
DQMHistoSizes: Histogram memory added: 0.0 KiB( 12 files compared)
Checked 49 log files, 50 edm output root files, 13 DQM output files
TriggerResults: no differences found

fwyzard · 2026-04-02T08:09:58Z

+heterogeneous

mmusich · 2026-04-02T08:19:35Z

reducing per-event time by roughly 40-50%

For future record, here are the measurements from the bot when running the timing HLT menu on CPU:

Vanilla CMSSW_16_1_X_2026-03-31-2300	CMSSW_16_1_X_2026-03-31-2300 + this PR

Total execution time per event goes from 8.762 s/ev to 8.4782 s/ev, i.e. a 3.2% speed-up
The saving are as expected concentrated in the Tracking category:

out of which everything is under LSTProducer@alpaka:

cmsbuild added this to the CMSSW_16_1_X milestone Mar 31, 2026

cmsbuild added reconstruction-pending pending-signatures tests-pending orp-pending code-checks-pending heterogeneous-pending tracking labels Mar 31, 2026

cmsbuild added code-checks-approved and removed code-checks-pending labels Mar 31, 2026

cmsbuild added tests-started and removed tests-pending labels Mar 31, 2026

cmsbuild added tests-rejected and removed tests-started labels Apr 1, 2026

GNiendorf force-pushed the cpu_speedups_hoist branch from 1d8b0bb to 63cb5db Compare April 1, 2026 10:50

cmsbuild added tests-pending code-checks-pending and removed tests-rejected code-checks-approved labels Apr 1, 2026

cmsbuild removed the code-checks-pending label Apr 1, 2026

cmsbuild added the code-checks-approved label Apr 1, 2026

fwyzard reviewed Apr 1, 2026

View reviewed changes

HeterogeneousCore/AlpakaMath/test/alpaka/testDeltaPhi.dev.cc Outdated Show resolved Hide resolved

LST CPU Optimizations - trig simplifications and redundant load removal

c3282cf

GNiendorf force-pushed the cpu_speedups_hoist branch from 63cb5db to c3282cf Compare April 1, 2026 12:03

cmsbuild added code-checks-pending and removed code-checks-approved labels Apr 1, 2026

cmsbuild added code-checks-approved and removed code-checks-pending labels Apr 1, 2026

cmsbuild added tests-started and removed tests-pending labels Apr 1, 2026

cmsbuild added tests-approved and removed tests-started labels Apr 1, 2026

cmsbuild added heterogeneous-approved and removed heterogeneous-pending labels Apr 2, 2026

Conversation

GNiendorf commented Mar 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cmsbuild commented Mar 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cmsbuild commented Mar 31, 2026

Uh oh!

cmsbuild commented Mar 31, 2026

Uh oh!

slava77 commented Mar 31, 2026

Uh oh!

slava77 commented Mar 31, 2026

Uh oh!

cmsbuild commented Apr 1, 2026

Failed Unit Tests

Comparison Summary

AMD_MI300X Comparison Summary

AMD_W7900 Comparison Summary

NVIDIA_H100 Comparison Summary

NVIDIA_L40S Comparison Summary

Max Memory Comparisons exceeding threshold NVIDIA_L40S

Uh oh!

GNiendorf commented Apr 1, 2026

Failed Unit Tests

Uh oh!

fwyzard commented Apr 1, 2026

Uh oh!

fwyzard commented Apr 1, 2026

Uh oh!

cmsbuild commented Apr 1, 2026

Uh oh!

cmsbuild commented Apr 1, 2026

Uh oh!

Uh oh!

GNiendorf commented Apr 1, 2026

Uh oh!

fwyzard commented Apr 1, 2026

Uh oh!

cmsbuild commented Apr 1, 2026

Uh oh!

cmsbuild commented Apr 1, 2026

Uh oh!

slava77 commented Apr 1, 2026

Uh oh!

slava77 commented Apr 1, 2026

Uh oh!

cmsbuild commented Apr 1, 2026

Comparison Summary

AMD_MI300X Comparison Summary

AMD_W7900 Comparison Summary

NVIDIA_H100 Comparison Summary

NVIDIA_L40S Comparison Summary

Uh oh!

fwyzard commented Apr 2, 2026

Uh oh!

mmusich commented Apr 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

GNiendorf commented Mar 31, 2026 •

edited

Loading

cmsbuild commented Mar 31, 2026 •

edited

Loading