LST CPU Optimizations - Trig Simplifications and Redundant Load Removal#50606
LST CPU Optimizations - Trig Simplifications and Redundant Load Removal#50606GNiendorf wants to merge 1 commit intocms-sw:masterfrom
Conversation
|
cms-bot internal usage |
|
+code-checks Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-50606/48792 |
|
A new Pull Request was created by @GNiendorf for master. It involves the following packages:
@Moanwar, @cmsbuild, @fwyzard, @jfernan2, @makortel, @mandrenguyen, @srimanob can you please review it and eventually sign? Thanks. cms-bot commands are listed here |
|
test parameters:
|
|
@cmsbuild please test |
|
-1 Failed Tests: UnitTests HLTP2Timing amd_mi300xUnitTests amd_w7900UnitTests nvidia_h100UnitTests nvidia_l40sUnitTests Failed Unit TestsI found 1 errors in the following unit tests: ---> test alpakaTestDeltaPhiSerialSync had ERRORS Comparison SummarySummary:
AMD_MI300X Comparison SummarySummary:
AMD_W7900 Comparison SummarySummary:
NVIDIA_H100 Comparison SummarySummary:
NVIDIA_L40S Comparison SummarySummary:
Max Memory Comparisons exceeding threshold NVIDIA_L40S@cms-sw/core-l2 , I found 1 workflow step(s) with memory usage exceeding the error threshold: Expand to see workflows ...
|
Seems like the unit test errors come from a few ambiguous test cases (e.g. answer is looking for -pi but new formula returns +pi, even though these correspond to the same angle) |
|
Is
the expected outcome for these optimisations ? |
I see. Can you fix the tests to use a better "matcher" ? |
1d8b0bb to
63cb5db
Compare
|
+code-checks Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-50606/48812
|
Yes, these optimizations address CPU-side bottlenecks. The GPU backend has its own set of bottlenecks, e.g. early exits are not as helpful on GPU, trig calls seem to be not as expensive as on CPU, etc. |
Ah, OK. |
63cb5db to
c3282cf
Compare
|
+code-checks Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-50606/48813 |
|
@cmsbuild please test |
it looks like the timing test failed: was it something in the IB already? I'm not sure I find this PR-related notes in the timing log |
|
+1 Size: This PR adds an extra 80KB to repository HLT P2 Timing: chart Comparison SummarySummary:
AMD_MI300X Comparison SummarySummary:
AMD_W7900 Comparison SummarySummary:
NVIDIA_H100 Comparison SummarySummary:
NVIDIA_L40S Comparison SummarySummary:
|
|
+heterogeneous |




Speeds up the LST CPU backend, reducing per-event time for LST by roughly 40-50% (excluding pixel seed duplicate cleaning which is disabled in the online version of LST). The changes fall into three categories. First, algebraic trig simplifications replace expensive transcendental calls with equivalent expressions:
sin(atan(x))becomesx/sqrt(1+x^2),tan(asin(s))/asin(s)is computed without calling tan, anddeltaPhi(x1,y1,x2,y2)which internally callsatan2twice is replaced with a singleatan2(cross, dot)using the 2D cross and dot products directly. Second, loose pre-checks are added that compare the cross product or dot product of hit coordinate vectors against thresholds, effectively cutting onsin(dPhi)orcos(dPhi)rather than computing the angle first, allowing early rejection of clearly failing candidates before the more expensive exact-angle computation. These are applied in the MiniDoublet barrel and endcap creation kernels, the Segment counting kernel, and the Triplet betaIn check. Third, SoA fields that are invariant across inner loop iterations (module geometry properties, pixel seed kinematics) are pre-loaded once into local structs (ModuleMDData, ModuleSegData, PixelSeedData) to avoid redundant memory reads. GPU timing and physics performance have negligible changes.There is also a small decrease in the memory footprint from a restructuring of the Segment counting kernel, where we use only the non-trig pre-checks and added an additional cut.