Skip to content

A More Flexible And Lightweight CA #47611

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 30 commits into
base: master
Choose a base branch
from

Conversation

AdrianoDee
Copy link
Contributor

@AdrianoDee AdrianoDee commented Mar 17, 2025

This PR proposes a general restructring of the Alpaka implementaion of the CA based pixel tracking. The idea here would be to make the CA:

  • a bit more flexible, relying less on the TrackerTraits and reading the geometry at runtime from the TrackerGeometry in a configurable way. This would greatly simplify the inclusion of new layers (e.g. strips);
  • a bit more lightweight (on memory), redesigning the containers to be average-sized rather than max-sized.

The three updates here have many overlaps and sinergies but may be taken separately (if really needed).

Developments

CA Structures

In the CA few structures (defined in RecoTracker/PixelSeeding/plugins/alpaka/CAStructures.h) have a fixed size containers to hold the intermediate results. These are:

  1. OuterHitOfCellContainer: an array keeping the index of a cell (uint32_t). One per hit. Keeps track of the cells that have that his as the outer one.
  2. CellNeighbors: an array keeping the index of a cell (uint32_t). One per cell (==maxNumberOfCells). Keeps track of cells connected through the outer hit.
  3. CellTracks: an array keeping the index of a tuple (uint32_t/uint16_t). One per cell. Keeps track of the tuples to which a cell belong. Mostly to remove duplicates.

and they are sized on the maximum number of possible association for each cells/track. The current numbers were estimated using TTbar PU samples before Run3 start.

The idea here is to just move all these structures to be sized with the average per element using OneToManyAssoc{Sequential|RandomAccess} sizable at runtime (allocating the need buffers for the storage and the offsets) and so we can pass the averages at config level:

  1. HitToCell -> device_hitToCell_; with nOnes = nHits and nMany = avgCellsPerHit * nHits
  2. CellToCell -> device_cellToNeighbors_, with nOnes = maxCells and nMany = avgCellsPerCell * maxCells
  3. CellToTrack -> device_cellToTracks_; with nOnes = maxCells and nMany = avgTracksPerCell * maxCells

W.r.t the fixed size approach we need a couple of extra things:

  • A count step to size the offsets, added in the already existing kernels replacing the various “push_back”.
  • A structure to hold the pairings between count and fill when it was not already there (e.g. for the Cells we can recycle what we have): a dummy CACoupleSoA.
  • A fill step in an extra kernel, generic for all the histograms: Kernel_fillGenericCouple.

Container Sizes

Find in https://adriano.web.cern.ch/ca_geometry/containers/ the plots for all the container sizes:

  • stats for single quantities: nHits, nOuterHits (hits excluding barrel 1), nCells, nTracks, nTrips (a cell attached to another cell);
  • averages for:
    • number of cells per outer hits: nCells_vs_nOuterHits;
    • number of trips for each cell (so number of cell neighbors): nTrips_vs_nCells_avg. For the variant with nTrips_vs_maxDoublets_avg the denominator is the fixed size for cells we get from the maxDoublets parameter.
    • number of tracks per cell (so the number of track using a cell): nCellTracks_vs_nCells_avg. For the variant with nCellTracks_vs_maxDoublets the denominator is the same as above.
  • the trends for nCells and nTracks vs the number of hits, with a fit.

An example here. "wp" stands for the working point selected for the given scenario.

image

N.B. the phase2 quads and trips have the same trends for cells since the graph is the same.

Euristical Sizes for Doublets and Tracks

The PR propose also to allow to define euristically the container sizes (maxNumeberOfDoublets and maxNumberOfTuples) via a TFormula. At the moment I haven't run any proper test for the impact of this update on the memory usage for pp conditions. But the fact that the number of hits, cells and tracks show a wide span for consecutive events seems to point to the fact that making the maxNumeberOfDoublets dependent on the number of hits may be beneficial for a production setup (in which we run on consecutive events in parallel).

For run Run2024F, Run=383631 and LS=476 EphemeralHLTPhysics data:

image

The trends show also that we can leverage on a functional dependency between nHits and {nTracks|nCells} (~quadratical, as one would expect) for any of the scenarios:

image

(Run3 trips on MC and HLT pp on data overlaps nicely).

For example running on pp data (from Run2024F) one can easily fit the number of cells with the number of hits giving some safety margin.

image

Samples used:

  • HIon : data /store/hidata/HIRun2024B/HIEphemeralHLTPhysics/RAW/v1/000/388/305/00000/d8b13b7d-a94e-4b1f-9aae-bd86836a0459.root;

  • HIon Hyet: MC private HydjetQMinBias_5020GeV+2023HIRP;

  • HLT pp: data Run2024F, Run383631 and ls0476 EphemeralHLTPhysics data;

  • Run3 quads/trips: MC /store/relval/CMSSW_14_1_0_pre2/RelValTTbar_14TeV/GEN-SIM-DIGI-RAW/PU_140X_mcRun3_2024_realistic_v7_STD_2024_PU-v1/2580000/;

  • Phase2 quads/trips: MC /RelValTTbar_14TeV/CMSSW_15_0_0_pre1-PU_141X_mcRun4_realistic_v3_STD_Run4D110_PU-v1/GEN-SIM-DIGI-RAW.

  • hlt_hion data from densely populated HIRun2024B events (run 388305).

**N.B. for HLT pp on EphemeralHLTPhysics data for run 383631 the current limit (512*1024) was too low for ~0.32% of the events (over the 10000 used). This does imply any crash, we just stop pushing new doublets. **

CAGeometry

A new ESProduct (CAGeometry) is introduced that holds:

  • phiCuts, minZ, maxZ, maxR for doublets;
  • the graph (now harcoded in the SimplePixelTopology);
  • the module numberings, read directly from the TrackerGeometry.
  • the dcaCut and CAThetaCut, also expanded to be one per layer useful for future tuning (especially if including the stript detector);

Pixel DataFormats SoA

The Track and TrackingRecHit SoAs were templated with the TrackerTraits mostly to hold the helper histograms and for fixed size arrays (for the number of hits per track or modules). This could be simplified levaraging on the Portable{Host|Device}MultiCollection.

  • to remove the Pixel Reco DQM paths from the menu since we could write the SoA to ROOT (that would allow to solve https://its.cern.ch/jira/browse/CMSHLT-3147);
  • a couple of test to write and read the hits and tracks SoAs has been added;
  • to greatly simplify all the modules that downstream consume the SoAs and that are heavily templated just for the inputs (while doing exactly the same thing);
  • to integrate in an easier way non-pixel hits in the CA chain (e.g. strip hits);

Miscs

I took the chance also to do some clean up here and there:

  • removing from the chain but leaving the definition in the code of the AverageGeometry actually never used. It can be easily re-enabled if needed;
  • a fix to SimplePixelTopology numbering for Phase2 modules that was affecting the cluster doublet cuts;
  • removed idealConditions flag that was changing the cluster cut based on the pixel barrel side (only for Run3). It has never been used and no beneficial effect was found (studies were done in late 2021 and the efficiency was degradated).
  • avoiding to have the CPE borught around in the chaing just for a single call to the FrameSoA that has been moved in the CAGeometry;
  • remove the limit to the number of vertices for HI conditions (set to 32 instead of standard 256). This lead to crashes documented in Relval wf 14949.402 fails at runtime in CMSSW_14_2_X #46693. This would need anyway to be investigatged since, also in master, there are HI events (with noPU) for which we reconstruct >100 vertices.

Performance and Physics Studies

No changes to physics performance observed (as expected) for:

Small fluctuations visible for Phase2 and HI, I haven't found anything strange or a reason for them. They might be the "usual" irreproducibilities.

Posting here a couple of examples for the records

image

image


pp HLT

Performance measured on devfu-c2b03-44-01 running /frozen/2024/2e34/v1.4/CMSSW_14_1_X/HLT (adapted to be compatible with 15_0_0_pre2 and the PR) on ~10k events from Run2024I, Run386593 and LS 94 EphemeralHLTPhysics data.


Througput and timing

The throughput is basically untouched

this PR:

Running 3 times over 10000 events with 8 jobs, each with 32 threads, 24 streams and 1 GPUs
   543.1 ±   0.1 ev/s (9700 events, 98.1% overlap)
   538.2 ±   0.1 ev/s (9700 events, 99.0% overlap)
   543.2 ±   0.1 ev/s (9700 events, 97.4% overlap)
 --------------------
   541.5 ±   2.8 ev/s

master:

Running 3 times over 10000 events with 8 jobs, each with 32 threads, 24 streams and 1 GPUs
   539.9 ±   0.1 ev/s (9700 events, 98.5% overlap)
   540.6 ±   0.1 ev/s (9700 events, 98.9% overlap)
   540.7 ±   0.1 ev/s (9700 events, 98.4% overlap)
 --------------------
   540.4 ±   0.5 ev/s

As for the event timings find here all the piecharts. Posting here two of them (for real_time) for the records. The average timings measured are (for the 8 jobs x 3 times):

  • master: 488.48 ± 4.08 ms/ev
  • this PR: 487.72 ± 4.40 ms/ev

Memory

All the memory plots are under https://adriano.web.cern.ch/ca_geometry/memory/.

The memory usage for 8 jobs x 32 threads x 24 streams is reduced by ~47%.

hlt_pp_mem_32t24s8j

Phase2

Performance measured on a TTbar D110 PU Run4 RelVal sample (EDM input):

  • /RelValTTbar_14TeV/CMSSW_15_0_0_pre1-PU_141X_mcRun4_realistic_v3_STD_Run4D110_PU-v1/GEN-SIM-DIGI-RAW

Througput and timing

The optimal setup I found for master is 2 jobs with 8 threads and 8 streams. With 3 jobs we go out of memory and with 12 or 16 threads the througput is the same, just the memory increases. So I took this as the baseline for the comparisons. Here I'm running quadruplets only since for triplets the memory occupancy is very similar. This is a consequences of the fact that doublets related containers are vastly dominating and that, at the moment, they are the same for quads and trips given the same cell graph.

this PR:

Running 3 times over 1300 events with 2 jobs, each with 8 threads, 8 streams and 1 GPUs
    77.3 ±   0.0 ev/s (1000 events, 99.6% overlap)
    77.0 ±   0.0 ev/s (1000 events, 99.6% overlap)
    77.2 ±   0.0 ev/s (1000 events, 99.5% overlap)
 --------------------
    77.2 ±   0.2 ev/s

master:

Running 3 times over 1300 events with 2 jobs, each with 8 threads, 8 streams and 1 GPUs
    70.1 ±   0.1 ev/s (1000 events, 99.7% overlap)
    69.4 ±   0.1 ev/s (1000 events, 99.4% overlap)
    69.2 ±   0.1 ev/s (1000 events, 98.7% overlap)
 --------------------
    69.6 ±   0.5 ev/s  

Memory

In terms of memory the effect is even more important that for Run3 HLT with a reduction of ~70% in a configuration with 2 jobs 8 threads and 8 streams (8t8s2j).

phase2_memory_throughput_quads_reference

The same for 12t12s2j that is not really beneficial for the througput and that is almost filling the two T4 available (for master). Also, having more memory available, configurations with more jobs may be tested brining almost a factor 2 to max throughput.

phase2_memory_throughput_quads

(note maybe there's some further room for improvement using the euristical sizes)


HIon

(thanks to Soohwan for the informations and the samples to set this up)

At the moment the HIon menu runs only the pixel local reconstruction on GPU since the pixel track reco is too heavy on the GPU memory. The performance here are mesured:

  • on MinBias events from /store/hidata/HIRun2024B/HIEphemeralHLTPhysics/RAW/v1/000/388/305/00000/d8b13b7d-a94e-4b1f-9aae-bd86836a0459.root, converted to raw;
  • with /dev/CMSSW_14_2_0/HIon/V11 menu:
    • as is in master (RefCpu);
    • as is in master turning on the GPU pixel track reco in Alpaka (Ref);
    • modified for this PR with the pixel track reco running on GPU in Alpaka and the max number of cells fixed to the current threshold (Dev).

If I understood well and the configuration stayed the same we currently run with 8 jobs 8 threads and 8 streams (see https://its.cern.ch/jira/browse/CMSHLT-2951). For a full GPU menu (Ref) the best I could fit in two T4 is a setup with 8t8s2j. Running the same setup with this PR the memory usage is reduced by ~72% (with a +70% in througput).

hlt_hion_memory_throughput_fixed

Given the lighter memory footprint we can push a bit the full GPU HI menu reaching the same througput w.r.t. to the RefCpu setup (16j16s8j) with 12t12s4j. And can go up to +240% it with 16t16s8j (the maximum I could get).

hlt_hion_memory_throughput_fixed_best

The HI runs seems also a good candidate to test the euristical sizes. For example or run HIRun2024B, Run=388305 and LS=123 EphemeralHLTPhysics data, if we plot the number of cells, hits or tracks from consecutive events, we see:

run388305_HIEphemeralHLTPhysics_stats

For cells we go from 1e4 to 1e6 (for non zero values). And we can fit the number of cells vs the number of hits:

hlt_hionVsHits

Using this cut for cells we can reach up to more than three times the current througput with 16t16s16j (while keeping a good margin on the max memory available). Going above (with e.g. 16t16s20j) is just increasing the memory occupation while keeping the same througput.

image

@cmsbuild
Copy link
Contributor

cmsbuild commented Mar 17, 2025

cms-bot internal usage

@cmsbuild
Copy link
Contributor

-code-checks

Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-47611/44120

Code check has found code style and quality issues which could be resolved by applying following patch(s)

@cmsbuild
Copy link
Contributor

+code-checks

Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-47611/44121

@cmsbuild
Copy link
Contributor

A new Pull Request was created by @AdrianoDee for master.

It involves the following packages:

  • DQM/SiPixelHeterogeneous (dqm)
  • DataFormats/TrackSoA (heterogeneous, reconstruction)
  • DataFormats/TrackingRecHitSoA (heterogeneous, reconstruction)
  • Geometry/CommonTopologies (geometry)
  • HLTrigger/Configuration (hlt)
  • HeterogeneousCore/AlpakaInterface (heterogeneous)
  • RecoLocalTracker/ClusterParameterEstimator (reconstruction)
  • RecoLocalTracker/SiPixelClusterizer (reconstruction)
  • RecoLocalTracker/SiPixelRecHits (reconstruction)
  • RecoTauTag/HLTProducers (hlt)
  • RecoTracker/Configuration (reconstruction)
  • RecoTracker/PixelSeeding (reconstruction)
  • RecoTracker/PixelTrackFitting (reconstruction)
  • RecoTracker/Record (reconstruction)
  • RecoVertex/Configuration (reconstruction)
  • RecoVertex/PixelVertexFinding (reconstruction)

@Dr15Jones, @Martin-Grunewald, @antoniovagnerini, @bsunanda, @civanch, @cmsbuild, @fwyzard, @jfernan2, @kpedro88, @makortel, @mandrenguyen, @mdhildreth, @mmusich, @rseidita can you please review it and eventually sign? Thanks.
@GiacomoSguazzoni, @Martin-Grunewald, @VinInn, @VourMa, @alesaggio, @azotz, @bsunanda, @dgulhan, @dkotlins, @echabert, @fabiocos, @felicepantaleo, @ferencek, @fioriNTU, @gbenelli, @gpetruc, @idebruyn, @jandrea, @jlidrych, @makortel, @martinamalberti, @mbluj, @missirol, @mmusich, @mroguljic, @mtosi, @robervalwalsh, @rovere, @threus, @tsusa, @tvami, @yduhm this is something you requested to watch as well.
@antoniovilela, @mandrenguyen, @rappoccio, @sextonkennedy you are the release manager for this.

cms-bot commands are listed here

@AdrianoDee
Copy link
Contributor Author

enable gpu

@AdrianoDee
Copy link
Contributor Author

please test

process.hltCAGeometry = cms.ESProducer('CAGeometryESProducer@alpaka',
caDCACuts = cms.vdouble(
0.15, 0.25, 0.25, 0.25, 0.25, 0.25, 0.25, 0.25, 0.25, 0.25
),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are these (collections of) parameters explicitly listed here in the HLT customisation function, instead of using fillDescriptions for this?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since CAGeometryESProducer is a new plugin, fillDescriptions can take care of any/all parameters!


if not hasattr(prod, 'caGeometry'):
setattr(prod, 'caGeometry', cms.string('hltCAGeometry'))

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are these (collections of) parameters explicitly listed here in the HLT customisation function, instead of using fillDescriptions for this?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These plugins are already existing, so simply delete here in the customisation all old/removed parameters as well as all parameters with changed values, with fillDescriptions then taking care of new values and new parameters.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True, let me take care of these (and those above).

@cmsbuild
Copy link
Contributor

cmsbuild commented Apr 7, 2025

-code-checks

Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-47611/44387

Code check has found code style and quality issues which could be resolved by applying following patch(s)

@cmsbuild
Copy link
Contributor

cmsbuild commented Apr 7, 2025

-code-checks

Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-47611/44388

Code check has found code style and quality issues which could be resolved by applying following patch(s)

@cmsbuild
Copy link
Contributor

cmsbuild commented Apr 7, 2025

+code-checks

Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-47611/44389

@cmsbuild
Copy link
Contributor

cmsbuild commented Apr 7, 2025

Pull request #47611 was updated. @Dr15Jones, @Martin-Grunewald, @antoniovagnerini, @bsunanda, @civanch, @cmsbuild, @fwyzard, @jfernan2, @kpedro88, @makortel, @mandrenguyen, @mdhildreth, @mmusich, @rseidita can you please check and sign again.

@AdrianoDee
Copy link
Contributor Author

please test

@@ -9,3 +9,6 @@
<use name="CalibTracker/Records"/>
<use name="clhep"/>
<use name="boost"/>
<use name="alpaka"/>
<flags ALPAKA_BACKENDS="1"/>

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The RecoTracker/Record package seems to be intended for EventSetup Record classes. I'd suggest to keep it that way, i.e. to move the new SoA definitions elsewhere.

@cmsbuild
Copy link
Contributor

cmsbuild commented Apr 7, 2025

-1

Failed Tests: UnitTests RelVals RelVals-CUDA RelVals-INPUT RelVals-ROCM AddOn cudaUnitTests rocmUnitTests
Size: This PR adds an extra 896KB to repository
Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-94c57a/45402/summary.html
COMMIT: e5ae12f
CMSSW: CMSSW_15_1_X_2025-04-07-1100/el8_amd64_gcc12
Additional Tests: CUDA,ROCM
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week0/cms-sw/cmssw/47611/45402/install.sh to create a dev area with all the needed externals and cmssw changes.

Unit Tests

I found 1 errors in the following unit tests:

---> test alpakaTestPrefixScanSerialSync had ERRORS

AddOn Tests

  • unknown
AddOnTest might have timed out: FAILED -  secs

CUDA Unit Tests

I found 3 errors in the following unit tests:

---> test deviceVertexFinderByDensity_tCudaAsync had ERRORS
---> test deviceVertexFinderDBSCAN_tCudaAsync had ERRORS
---> test deviceVertexFinderOneKernel_tCudaAsync had ERRORS

ROCm Unit Tests

I found 1 errors in the following unit tests:

---> test alpakaTestBufferROCmAsync had ERRORS

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants