Skip to content

Implement stream::FixedQueueEDProducer#50627

Open
fwyzard wants to merge 4 commits intocms-sw:masterfrom
fwyzard:FixedQueueEDProducer
Open

Implement stream::FixedQueueEDProducer#50627
fwyzard wants to merge 4 commits intocms-sw:masterfrom
fwyzard:FixedQueueEDProducer

Conversation

@fwyzard
Copy link
Copy Markdown
Contributor

@fwyzard fwyzard commented Apr 1, 2026

PR description:

Implement a new kind of alpaka stream::EDProducer with a fixed association of device queues (e.g. CUDA streams) to framework streams.

This is useful for using external software that associates resources to the device queues, for example the PyTorch device memory caching allocator.

Migrating the PyTorch alpaka modules from stream::EDProducer to stream::FixedQueueEDProducer ensures that PyTorch sees only a limited number of device queues, reducing the overall device memory utilisation.

For more background information see the presentation ML inference on GPUs in CMSSW with PyTorch by @EmanueleCoradin at the CMS developments with GPUs on March 30th, 2026.

PR validation:

All unit tests pass.

@cmsbuild
Copy link
Copy Markdown
Contributor

cmsbuild commented Apr 1, 2026

cms-bot internal usage

@fwyzard
Copy link
Copy Markdown
Contributor Author

fwyzard commented Apr 1, 2026

enable gpu

@fwyzard
Copy link
Copy Markdown
Contributor Author

fwyzard commented Apr 1, 2026

please test

@cmsbuild
Copy link
Copy Markdown
Contributor

cmsbuild commented Apr 1, 2026

-code-checks

Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-50627/48825

ERROR: Build errors found during clang-tidy run.

Suppressed 1322 warnings (1318 in non-user code, 4 NOLINT).
--
src/HeterogeneousCore/AlpakaCore/interface/alpaka/stream/FixedQueueEDProducer.h:32:11: error: 'maybe_unused' attribute cannot be applied to a statement [clang-diagnostic-error]
   32 |         [[maybe_unused]] ev.queue();
      |           ^              ~~
Suppressed 2968 warnings (2964 in non-user code, 4 NOLINT).
--
src/HeterogeneousCore/AlpakaCore/interface/alpaka/stream/FixedQueueEDProducer.h:32:11: error: 'maybe_unused' attribute cannot be applied to a statement [clang-diagnostic-error]
   32 |         [[maybe_unused]] ev.queue();
      |           ^              ~~
Suppressed 2966 warnings (2962 in non-user code, 4 NOLINT).
--
src/HeterogeneousCore/AlpakaCore/interface/alpaka/stream/FixedQueueEDProducer.h:32:11: error: 'maybe_unused' attribute cannot be applied to a statement [clang-diagnostic-error]
   32 |         [[maybe_unused]] ev.queue();
      |           ^              ~~
Suppressed 2974 warnings (2970 in non-user code, 4 NOLINT).
--
src/HeterogeneousCore/AlpakaCore/interface/alpaka/stream/FixedQueueEDProducer.h:32:11: error: 'maybe_unused' attribute cannot be applied to a statement [clang-diagnostic-error]
   32 |         [[maybe_unused]] ev.queue();
      |           ^              ~~
Suppressed 2966 warnings (2962 in non-user code, 4 NOLINT).
--
gmake: *** [config/SCRAM/GMake/Makefile.coderules:129: code-checks] Error 2
gmake: *** [There are compilation/build errors. Please see the detail log above.] Error 2

fwyzard added 3 commits April 1, 2026 23:14
stream::FixedQueueEDProducer is a stream EDProducer with a fixed association of
device queues to framework streams.
This ensures that PyTorch sees only a limited number of device streams,
reducing the overall device memory utilisation.
@fwyzard
Copy link
Copy Markdown
Contributor Author

fwyzard commented Apr 1, 2026

please test

@cmsbuild
Copy link
Copy Markdown
Contributor

cmsbuild commented Apr 1, 2026

+code-checks

Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-50627/48826

@cmsbuild
Copy link
Copy Markdown
Contributor

cmsbuild commented Apr 1, 2026

A new Pull Request was created by @fwyzard for master.

It involves the following packages:

  • HeterogeneousCore/AlpakaCore (heterogeneous)
  • HeterogeneousCore/AlpakaTest (heterogeneous)
  • PhysicsTools/PyTorchAlpakaTest (heterogeneous, ml)

@fwyzard, @hjkwon260, @makortel, @valsdav, @y19y19 can you please review it and eventually sign? Thanks.
@makortel, @rovere this is something you requested to watch as well.
@ftenchini, @mandrenguyen, @sextonkennedy you are the release manager for this.

cms-bot commands are listed here

@cmsbuild
Copy link
Copy Markdown
Contributor

cmsbuild commented Apr 2, 2026

+1

Size: This PR adds an extra 44KB to repository
Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-35ac0e/52411/summary.html
COMMIT: 8b5045e
CMSSW: CMSSW_16_1_X_2026-04-01-1100/el8_amd64_gcc13
Additional Tests: GPU,AMD_MI300X,AMD_W7900,NVIDIA_H100,NVIDIA_L40S
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week1/cms-sw/cmssw/50627/52411/install.sh to create a dev area with all the needed externals and cmssw changes.

The following merge commits were also included on top of IB + this PR after doing git cms-merge-topic:

You can see more details here:
https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-35ac0e/52411/git-recent-commits.json
https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-35ac0e/52411/git-merge-result

Comparison Summary

The workflows 2025.0010001, 2025.0000002, 2024.0070001, 2024.0060001, 2024.0050001, 2024.0040001, 2024.0030001, 2024.0020001, 2024.0010001, 2024.0000001, 2023.0020001 have different files in step1_dasquery.log than the ones found in the baseline. You may want to check and retrigger the tests if necessary. You can check it in the "files" directory in the results of the comparisons

Summary:

  • You potentially removed 299 lines from the logs
  • ROOTFileChecks: Some differences in event products or their sizes found
  • Reco comparison results: 41482 differences found in the comparisons
  • DQMHistoTests: Total files compared: 52
  • DQMHistoTests: Total histograms compared: 3449714
  • DQMHistoTests: Total failures: 162
  • DQMHistoTests: Total nulls: 0
  • DQMHistoTests: Total successes: 3449532
  • DQMHistoTests: Total skipped: 20
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 1.876 KiB( 40 files compared)
  • DQMHistoSizes: changed ( 18434.0,... ): 0.938 KiB HLT/ScoutingOffline
  • Checked 223 log files, 193 edm output root files, 52 DQM output files
  • TriggerResults: no differences found

AMD_MI300X Comparison Summary

Summary:

AMD_W7900 Comparison Summary

Summary:

NVIDIA_H100 Comparison Summary

Summary:

NVIDIA_L40S Comparison Summary

Summary:

Max Memory Comparisons exceeding threshold

@cms-sw/core-l2 , I found 6 workflow step(s) with memory usage exceeding the error threshold:

Expand to see workflows ...
  • Error: Workflow 2023.0020001_RunJetMET02023D_10k step3 max memory diff 329.9 exceeds +/- 90.0 MiB
  • Error: Workflow 2024.0000001_RunZeroBias2024B_10k step3 max memory diff -96.5 exceeds +/- 90.0 MiB
  • Error: Workflow 2024.0010001_RunJetMET02024C_10k step3 max memory diff 111.9 exceeds +/- 90.0 MiB
  • Error: Workflow 2024.0030001_RunDisplacedJet2024E_10k step3 max memory diff 184.7 exceeds +/- 90.0 MiB
  • Error: Workflow 2024.0050001_RunBTagMu2024G_10k step3 max memory diff 110.3 exceeds +/- 90.0 MiB
  • Error: Workflow 2025.0010001_RunJetMET02025C_10k step3 max memory diff 179.0 exceeds +/- 90.0 MiB

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants