Implement `stream::FixedQueueEDProducer` by fwyzard · Pull Request #50627 · cms-sw/cmssw

fwyzard · 2026-04-01T21:05:24Z

PR description:

Implement a new kind of alpaka stream::EDProducer with a fixed association of device queues (e.g. CUDA streams) to framework streams.

This is useful for using external software that associates resources to the device queues, for example the PyTorch device memory caching allocator.

Migrating the PyTorch alpaka modules from stream::EDProducer to stream::FixedQueueEDProducer ensures that PyTorch sees only a limited number of device queues, reducing the overall device memory utilisation.

For more background information see the presentation ML inference on GPUs in CMSSW with PyTorch by @EmanueleCoradin at the CMS developments with GPUs on March 30th, 2026.

PR validation:

All unit tests pass.

cmsbuild · 2026-04-01T21:05:50Z

cms-bot internal usage

fwyzard · 2026-04-01T21:08:09Z

enable gpu

fwyzard · 2026-04-01T21:08:12Z

please test

cmsbuild · 2026-04-01T21:08:16Z

-code-checks

Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-50627/48825

ERROR: Build errors found during clang-tidy run.

Suppressed 1322 warnings (1318 in non-user code, 4 NOLINT).
--
src/HeterogeneousCore/AlpakaCore/interface/alpaka/stream/FixedQueueEDProducer.h:32:11: error: 'maybe_unused' attribute cannot be applied to a statement [clang-diagnostic-error]
   32 |         [[maybe_unused]] ev.queue();
      |           ^              ~~
Suppressed 2968 warnings (2964 in non-user code, 4 NOLINT).
--
src/HeterogeneousCore/AlpakaCore/interface/alpaka/stream/FixedQueueEDProducer.h:32:11: error: 'maybe_unused' attribute cannot be applied to a statement [clang-diagnostic-error]
   32 |         [[maybe_unused]] ev.queue();
      |           ^              ~~
Suppressed 2966 warnings (2962 in non-user code, 4 NOLINT).
--
src/HeterogeneousCore/AlpakaCore/interface/alpaka/stream/FixedQueueEDProducer.h:32:11: error: 'maybe_unused' attribute cannot be applied to a statement [clang-diagnostic-error]
   32 |         [[maybe_unused]] ev.queue();
      |           ^              ~~
Suppressed 2974 warnings (2970 in non-user code, 4 NOLINT).
--
src/HeterogeneousCore/AlpakaCore/interface/alpaka/stream/FixedQueueEDProducer.h:32:11: error: 'maybe_unused' attribute cannot be applied to a statement [clang-diagnostic-error]
   32 |         [[maybe_unused]] ev.queue();
      |           ^              ~~
Suppressed 2966 warnings (2962 in non-user code, 4 NOLINT).
--
gmake: *** [config/SCRAM/GMake/Makefile.coderules:129: code-checks] Error 2
gmake: *** [There are compilation/build errors. Please see the detail log above.] Error 2

stream::FixedQueueEDProducer is a stream EDProducer with a fixed association of device queues to framework streams.

This ensures that PyTorch sees only a limited number of device streams, reducing the overall device memory utilisation.

fwyzard · 2026-04-01T21:15:51Z

please test

cmsbuild · 2026-04-01T21:17:56Z

+code-checks

Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-50627/48826

There are other open Pull requests which might conflict with changes you have proposed:
- File HeterogeneousCore/AlpakaCore/README.md modified in PR(s): Various fixes for heterogeneous utilities #47605
- File HeterogeneousCore/AlpakaCore/interface/alpaka/EDMetadataSentry.h modified in PR(s): Add an MPISenderPortable and MPIReceiverPortable modules to send/receive arbitrary device collections #50503
- File HeterogeneousCore/AlpakaCore/src/alpaka/EDMetadataSentry.cc modified in PR(s): [hack] Force one GPU queue per framework stream #49547

cmsbuild · 2026-04-01T21:18:18Z

A new Pull Request was created by @fwyzard for master.

It involves the following packages:

HeterogeneousCore/AlpakaCore (heterogeneous)
HeterogeneousCore/AlpakaTest (heterogeneous)
PhysicsTools/PyTorchAlpakaTest (heterogeneous, ml)

@fwyzard, @hjkwon260, @makortel, @valsdav, @y19y19 can you please review it and eventually sign? Thanks.
@makortel, @rovere this is something you requested to watch as well.
@ftenchini, @mandrenguyen, @sextonkennedy you are the release manager for this.

cms-bot commands are listed here

cmsbuild · 2026-04-02T01:42:38Z

+1

Size: This PR adds an extra 44KB to repository
Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-35ac0e/52411/summary.html
COMMIT: 8b5045e
CMSSW: CMSSW_16_1_X_2026-04-01-1100/el8_amd64_gcc13
Additional Tests: GPU,AMD_MI300X,AMD_W7900,NVIDIA_H100,NVIDIA_L40S
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week1/cms-sw/cmssw/50627/52411/install.sh to create a dev area with all the needed externals and cmssw changes.

The following merge commits were also included on top of IB + this PR after doing git cms-merge-topic:

You can see more details here:
https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-35ac0e/52411/git-recent-commits.json
https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-35ac0e/52411/git-merge-result

Comparison Summary

The workflows 2025.0010001, 2025.0000002, 2024.0070001, 2024.0060001, 2024.0050001, 2024.0040001, 2024.0030001, 2024.0020001, 2024.0010001, 2024.0000001, 2023.0020001 have different files in step1_dasquery.log than the ones found in the baseline. You may want to check and retrigger the tests if necessary. You can check it in the "files" directory in the results of the comparisons

Summary:

You potentially removed 299 lines from the logs
ROOTFileChecks: Some differences in event products or their sizes found
Reco comparison results: 41482 differences found in the comparisons
DQMHistoTests: Total files compared: 52
DQMHistoTests: Total histograms compared: 3449714
DQMHistoTests: Total failures: 162
DQMHistoTests: Total nulls: 0
DQMHistoTests: Total successes: 3449532
DQMHistoTests: Total skipped: 20
DQMHistoTests: Total Missing objects: 0
DQMHistoSizes: Histogram memory added: 1.876 KiB( 40 files compared)
DQMHistoSizes: changed ( 18434.0,... ): 0.938 KiB HLT/ScoutingOffline
Checked 223 log files, 193 edm output root files, 52 DQM output files
TriggerResults: no differences found

AMD_MI300X Comparison Summary

Summary:

You potentially removed 26 lines from the logs
Reco comparison results: 351 differences found in the comparisons
DQMHistoTests: Total files compared: 13
DQMHistoTests: Total histograms compared: 216539
DQMHistoTests: Total failures: 39130
DQMHistoTests: Total nulls: 29
DQMHistoTests: Total successes: 177380
DQMHistoTests: Total skipped: 0
DQMHistoTests: Total Missing objects: 0
DQMHistoSizes: Histogram memory added: 0.0 KiB( 12 files compared)
Checked 49 log files, 50 edm output root files, 13 DQM output files
TriggerResults: found differences in 3 / 12 workflows

AMD_W7900 Comparison Summary

Summary:

You potentially added 29 lines to the logs
Reco comparison results: 367 differences found in the comparisons
DQMHistoTests: Total files compared: 13
DQMHistoTests: Total histograms compared: 216539
DQMHistoTests: Total failures: 30336
DQMHistoTests: Total nulls: 39
DQMHistoTests: Total successes: 186164
DQMHistoTests: Total skipped: 0
DQMHistoTests: Total Missing objects: 0
DQMHistoSizes: Histogram memory added: 0.0 KiB( 12 files compared)
Checked 49 log files, 50 edm output root files, 13 DQM output files
TriggerResults: found differences in 1 / 12 workflows

NVIDIA_H100 Comparison Summary

Summary:

You potentially removed 17 lines from the logs
Reco comparison results: 367 differences found in the comparisons
DQMHistoTests: Total files compared: 13
DQMHistoTests: Total histograms compared: 216539
DQMHistoTests: Total failures: 30937
DQMHistoTests: Total nulls: 35
DQMHistoTests: Total successes: 185567
DQMHistoTests: Total skipped: 0
DQMHistoTests: Total Missing objects: 0
DQMHistoSizes: Histogram memory added: 0.0 KiB( 12 files compared)
Checked 49 log files, 50 edm output root files, 13 DQM output files
TriggerResults: no differences found

NVIDIA_L40S Comparison Summary

Summary:

You potentially added 20 lines to the logs
Reco comparison results: 366 differences found in the comparisons
DQMHistoTests: Total files compared: 13
DQMHistoTests: Total histograms compared: 216539
DQMHistoTests: Total failures: 29805
DQMHistoTests: Total nulls: 28
DQMHistoTests: Total successes: 186706
DQMHistoTests: Total skipped: 0
DQMHistoTests: Total Missing objects: 0
DQMHistoSizes: Histogram memory added: 0.0 KiB( 12 files compared)
Checked 49 log files, 50 edm output root files, 13 DQM output files
TriggerResults: no differences found

Max Memory Comparisons exceeding threshold

@cms-sw/core-l2 , I found 6 workflow step(s) with memory usage exceeding the error threshold:

Expand to see workflows ...

Error: Workflow 2023.0020001_RunJetMET02023D_10k step3 max memory diff 329.9 exceeds +/- 90.0 MiB
Error: Workflow 2024.0000001_RunZeroBias2024B_10k step3 max memory diff -96.5 exceeds +/- 90.0 MiB
Error: Workflow 2024.0010001_RunJetMET02024C_10k step3 max memory diff 111.9 exceeds +/- 90.0 MiB
Error: Workflow 2024.0030001_RunDisplacedJet2024E_10k step3 max memory diff 184.7 exceeds +/- 90.0 MiB
Error: Workflow 2024.0050001_RunBTagMu2024G_10k step3 max memory diff 110.3 exceeds +/- 90.0 MiB
Error: Workflow 2025.0010001_RunJetMET02025C_10k step3 max memory diff 179.0 exceeds +/- 90.0 MiB

Add an EDMetadataSentry constructor with an explicit queue

9c1d5da

cmsbuild added this to the CMSSW_16_1_X milestone Apr 1, 2026

cmsbuild added pending-signatures tests-pending orp-pending code-checks-pending heterogeneous-pending ml-pending labels Apr 1, 2026

cmsbuild added tests-started code-checks-rejected and removed tests-pending code-checks-pending labels Apr 1, 2026

fwyzard added 3 commits April 1, 2026 23:14

Implement stream::FixedQueueEDProducer

db12e85

stream::FixedQueueEDProducer is a stream EDProducer with a fixed association of device queues to framework streams.

Add a unit test for stream::FixedQueueEDProducer

fc6daac

Migrate PyTorch alpaka modules to stream::FixedQueueEDProducer

8b5045e

This ensures that PyTorch sees only a limited number of device streams, reducing the overall device memory utilisation.

fwyzard force-pushed the FixedQueueEDProducer branch from 0dff13b to 8b5045e Compare April 1, 2026 21:14

cmsbuild added tests-pending code-checks-pending and removed tests-started code-checks-rejected labels Apr 1, 2026

cmsbuild added tests-started and removed tests-pending labels Apr 1, 2026

cmsbuild added code-checks-approved and removed code-checks-pending labels Apr 1, 2026

cmsbuild added tests-approved and removed tests-started labels Apr 2, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement `stream::FixedQueueEDProducer`#50627

Implement `stream::FixedQueueEDProducer`#50627
fwyzard wants to merge 4 commits intocms-sw:masterfrom
fwyzard:FixedQueueEDProducer

fwyzard commented Apr 1, 2026

Uh oh!

cmsbuild commented Apr 1, 2026 •

edited

Loading

Uh oh!

fwyzard commented Apr 1, 2026

Uh oh!

fwyzard commented Apr 1, 2026

Uh oh!

cmsbuild commented Apr 1, 2026

Uh oh!

fwyzard commented Apr 1, 2026

Uh oh!

cmsbuild commented Apr 1, 2026

Uh oh!

cmsbuild commented Apr 1, 2026

Uh oh!

cmsbuild commented Apr 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

fwyzard commented Apr 1, 2026

PR description:

PR validation:

Uh oh!

cmsbuild commented Apr 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

fwyzard commented Apr 1, 2026

Uh oh!

fwyzard commented Apr 1, 2026

Uh oh!

cmsbuild commented Apr 1, 2026

Uh oh!

fwyzard commented Apr 1, 2026

Uh oh!

cmsbuild commented Apr 1, 2026

Uh oh!

cmsbuild commented Apr 1, 2026

Uh oh!

cmsbuild commented Apr 2, 2026

Comparison Summary

AMD_MI300X Comparison Summary

AMD_W7900 Comparison Summary

NVIDIA_H100 Comparison Summary

NVIDIA_L40S Comparison Summary

Max Memory Comparisons exceeding threshold

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

cmsbuild commented Apr 1, 2026 •

edited

Loading