Skip to content

LST CPU Optimizations - Trig Simplifications and Redundant Load Removal#50606

Open
GNiendorf wants to merge 1 commit intocms-sw:masterfrom
SegmentLinking:cpu_speedups_hoist
Open

LST CPU Optimizations - Trig Simplifications and Redundant Load Removal#50606
GNiendorf wants to merge 1 commit intocms-sw:masterfrom
SegmentLinking:cpu_speedups_hoist

Conversation

@GNiendorf
Copy link
Copy Markdown
Contributor

@GNiendorf GNiendorf commented Mar 31, 2026

Speeds up the LST CPU backend, reducing per-event time for LST by roughly 40-50% (excluding pixel seed duplicate cleaning which is disabled in the online version of LST). The changes fall into three categories. First, algebraic trig simplifications replace expensive transcendental calls with equivalent expressions: sin(atan(x)) becomes x/sqrt(1+x^2), tan(asin(s))/asin(s) is computed without calling tan, and deltaPhi(x1,y1,x2,y2) which internally calls atan2 twice is replaced with a single atan2(cross, dot) using the 2D cross and dot products directly. Second, loose pre-checks are added that compare the cross product or dot product of hit coordinate vectors against thresholds, effectively cutting on sin(dPhi) or cos(dPhi) rather than computing the angle first, allowing early rejection of clearly failing candidates before the more expensive exact-angle computation. These are applied in the MiniDoublet barrel and endcap creation kernels, the Segment counting kernel, and the Triplet betaIn check. Third, SoA fields that are invariant across inner loop iterations (module geometry properties, pixel seed kinematics) are pre-loaded once into local structs (ModuleMDData, ModuleSegData, PixelSeedData) to avoid redundant memory reads. GPU timing and physics performance have negligible changes.

There is also a small decrease in the memory footprint from a restructuring of the Segment counting kernel, where we use only the non-trig pre-checks and added an additional cut.

@cmsbuild
Copy link
Copy Markdown
Contributor

cmsbuild commented Mar 31, 2026

cms-bot internal usage

@cmsbuild
Copy link
Copy Markdown
Contributor

@cmsbuild
Copy link
Copy Markdown
Contributor

A new Pull Request was created by @GNiendorf for master.

It involves the following packages:

  • HeterogeneousCore/AlpakaMath (heterogeneous)
  • RecoTracker/LSTCore (reconstruction)

@Moanwar, @cmsbuild, @fwyzard, @jfernan2, @makortel, @mandrenguyen, @srimanob can you please review it and eventually sign? Thanks.
@GiacomoSguazzoni, @VinInn, @VourMa, @dgulhan, @elusian, @felicepantaleo, @gpetruc, @makortel, @mmasciov, @mmusich, @mtosi, @rovere this is something you requested to watch as well.
@ftenchini, @mandrenguyen, @sextonkennedy you are the release manager for this.

cms-bot commands are listed here

@slava77
Copy link
Copy Markdown
Contributor

slava77 commented Mar 31, 2026

test parameters:

  • enable = gpu, hlt_p2_integration, hlt_p2_timing
  • workflows = ph2_hlt

@slava77
Copy link
Copy Markdown
Contributor

slava77 commented Mar 31, 2026

@cmsbuild please test

@cmsbuild
Copy link
Copy Markdown
Contributor

cmsbuild commented Apr 1, 2026

-1

Failed Tests: UnitTests HLTP2Timing amd_mi300xUnitTests amd_w7900UnitTests nvidia_h100UnitTests nvidia_l40sUnitTests
Size: This PR adds an extra 80KB to repository
Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-25ca41/52371/summary.html
COMMIT: 1d8b0bb
CMSSW: CMSSW_16_1_X_2026-03-31-1100/el8_amd64_gcc13
Additional Tests: GPU,HLT_P2_INTEGRATION,HLT_P2_TIMING,AMD_MI300X,AMD_W7900,NVIDIA_H100,NVIDIA_L40S
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week1/cms-sw/cmssw/50606/52371/install.sh to create a dev area with all the needed externals and cmssw changes.

Failed Unit Tests

I found 1 errors in the following unit tests:

---> test alpakaTestDeltaPhiSerialSync had ERRORS

Comparison Summary

Summary:

  • You potentially removed 3 lines from the logs
  • ROOTFileChecks: Some differences in event products or their sizes found
  • Reco comparison results: 2 differences found in the comparisons
  • DQMHistoTests: Total files compared: 68
  • DQMHistoTests: Total histograms compared: 4802019
  • DQMHistoTests: Total failures: 376
  • DQMHistoTests: Total nulls: 0
  • DQMHistoTests: Total successes: 4801623
  • DQMHistoTests: Total skipped: 20
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 0.0 KiB( 67 files compared)
  • Checked 282 log files, 244 edm output root files, 68 DQM output files
  • TriggerResults: no differences found

AMD_MI300X Comparison Summary

Summary:

AMD_W7900 Comparison Summary

Summary:

NVIDIA_H100 Comparison Summary

Summary:

NVIDIA_L40S Comparison Summary

Summary:

Max Memory Comparisons exceeding threshold NVIDIA_L40S

@cms-sw/core-l2 , I found 1 workflow step(s) with memory usage exceeding the error threshold:

Expand to see workflows ...
  • Error: Workflow 34634.7503_TTbar_14TeV+Run4D121PU_HLTHeterogeneousValid step2 max memory diff -119.8 exceeds +/- 90.0 MiB

@GNiendorf
Copy link
Copy Markdown
Contributor Author

Failed Unit Tests

I found 1 errors in the following unit tests:

---> test alpakaTestDeltaPhiSerialSync had ERRORS

Seems like the unit test errors come from a few ambiguous test cases (e.g. answer is looking for -pi but new formula returns +pi, even though these correspond to the same angle)

@fwyzard
Copy link
Copy Markdown
Contributor

fwyzard commented Apr 1, 2026

Is

GPU timing [...] have negligible changes.

the expected outcome for these optimisations ?

@fwyzard
Copy link
Copy Markdown
Contributor

fwyzard commented Apr 1, 2026

Seems like the unit test errors come from a few ambiguous test cases (e.g. answer is looking for -pi but new formula returns +pi, even though these correspond to the same angle)

I see.

Can you fix the tests to use a better "matcher" ?

@cmsbuild
Copy link
Copy Markdown
Contributor

cmsbuild commented Apr 1, 2026

+code-checks

Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-50606/48812

  • Found files with invalid states:
    • Traceback (most recent call last):
      • File "/data/cmsbld/jenkins/workspace/run-pr-code-checks/cms-bot/process-pull-request.py", line 81, in
        • for c in get_pr_commits(prId, opts.repository):
      • File "/data/cmsbld/jenkins/workspace/run-pr-code-checks/cms-bot/github_utils.py", line 751, in get_pr_commits
        • return github_api(
      • File "/data/cmsbld/jenkins/workspace/run-pr-code-checks/cms-bot/github_utils.py", line 611, in github_api
        • response = urlopen(request)
      • File "/cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02935/el8_amd64_gcc13/external/python3/3.9.14-e16d2924e9eb9db8fddd14e187cf7209/lib/python3.9/urllib/request.py", line 214, in urlopen
        • return opener.open(url, data, timeout)
      • File "/cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02935/el8_amd64_gcc13/external/python3/3.9.14-e16d2924e9eb9db8fddd14e187cf7209/lib/python3.9/urllib/request.py", line 523, in open
        • response = meth(req, response)
      • File "/cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02935/el8_amd64_gcc13/external/python3/3.9.14-e16d2924e9eb9db8fddd14e187cf7209/lib/python3.9/urllib/request.py", line 632, in http_response
        • response = self.parent.error(
      • File "/cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02935/el8_amd64_gcc13/external/python3/3.9.14-e16d2924e9eb9db8fddd14e187cf7209/lib/python3.9/urllib/request.py", line 561, in error
        • return self._call_chain(*args)
      • File "/cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02935/el8_amd64_gcc13/external/python3/3.9.14-e16d2924e9eb9db8fddd14e187cf7209/lib/python3.9/urllib/request.py", line 494, in _call_chain
        • result = func(*args)
      • File "/cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02935/el8_amd64_gcc13/external/python3/3.9.14-e16d2924e9eb9db8fddd14e187cf7209/lib/python3.9/urllib/request.py", line 641, in http_error_default
        • raise HTTPError(req.full_url, code, msg, hdrs, fp)
    • urllib.error.HTTPError: HTTP Error 503: Service Unavailable

@cmsbuild
Copy link
Copy Markdown
Contributor

cmsbuild commented Apr 1, 2026

Pull request #50606 was updated. @Moanwar, @cmsbuild, @fwyzard, @jfernan2, @makortel, @mandrenguyen, @srimanob can you please check and sign again.

@GNiendorf
Copy link
Copy Markdown
Contributor Author

Is

GPU timing [...] have negligible changes.

the expected outcome for these optimisations ?

Yes, these optimizations address CPU-side bottlenecks. The GPU backend has its own set of bottlenecks, e.g. early exits are not as helpful on GPU, trig calls seem to be not as expensive as on CPU, etc.

@fwyzard
Copy link
Copy Markdown
Contributor

fwyzard commented Apr 1, 2026

Yes, these optimizations address CPU-side bottlenecks.

Ah, OK.

@cmsbuild
Copy link
Copy Markdown
Contributor

cmsbuild commented Apr 1, 2026

@cmsbuild
Copy link
Copy Markdown
Contributor

cmsbuild commented Apr 1, 2026

Pull request #50606 was updated. @Moanwar, @cmsbuild, @fwyzard, @jfernan2, @makortel, @mandrenguyen, @srimanob can you please check and sign again.

@slava77
Copy link
Copy Markdown
Contributor

slava77 commented Apr 1, 2026

@cmsbuild please test

@slava77
Copy link
Copy Markdown
Contributor

slava77 commented Apr 1, 2026

Failed Tests: UnitTests HLTP2Timing amd_mi300xUnitTests amd_w7900UnitTests nvidia_h100UnitTests nvidia_l40sUnitTests

it looks like the timing test failed: was it something in the IB already? I'm not sure I find this PR-related notes in the timing log

@cmsbuild
Copy link
Copy Markdown
Contributor

cmsbuild commented Apr 1, 2026

+1

Size: This PR adds an extra 80KB to repository
Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-25ca41/52396/summary.html
COMMIT: c3282cf
CMSSW: CMSSW_16_1_X_2026-03-31-2300/el8_amd64_gcc13
Additional Tests: GPU,HLT_P2_INTEGRATION,HLT_P2_TIMING,AMD_MI300X,AMD_W7900,NVIDIA_H100,NVIDIA_L40S
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week1/cms-sw/cmssw/50606/52396/install.sh to create a dev area with all the needed externals and cmssw changes.

HLT P2 Timing: chart

Comparison Summary

Summary:

  • You potentially removed 4 lines from the logs
  • ROOTFileChecks: Some differences in event products or their sizes found
  • Reco comparison results: 7 differences found in the comparisons
  • DQMHistoTests: Total files compared: 68
  • DQMHistoTests: Total histograms compared: 4795858
  • DQMHistoTests: Total failures: 407
  • DQMHistoTests: Total nulls: 0
  • DQMHistoTests: Total successes: 4795431
  • DQMHistoTests: Total skipped: 20
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 0.0 KiB( 67 files compared)
  • Checked 282 log files, 243 edm output root files, 68 DQM output files
  • TriggerResults: no differences found

AMD_MI300X Comparison Summary

Summary:

AMD_W7900 Comparison Summary

Summary:

NVIDIA_H100 Comparison Summary

Summary:

NVIDIA_L40S Comparison Summary

Summary:

@fwyzard
Copy link
Copy Markdown
Contributor

fwyzard commented Apr 2, 2026

+heterogeneous

@mmusich
Copy link
Copy Markdown
Contributor

mmusich commented Apr 2, 2026

reducing per-event time by roughly 40-50%

For future record, here are the measurements from the bot when running the timing HLT menu on CPU:

Vanilla CMSSW_16_1_X_2026-03-31-2300 CMSSW_16_1_X_2026-03-31-2300 + this PR
Screenshot from 2026-04-02 10-12-10 Screenshot from 2026-04-02 10-12-17

Total execution time per event goes from 8.762 s/ev to 8.4782 s/ev, i.e. a 3.2% speed-up
The saving are as expected concentrated in the Tracking category:

plot(9)

out of which everything is under LSTProducer@alpaka:

plot(10)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants