Skip to content

Conversation

@lukaszmichalskii
Copy link
Contributor

@lukaszmichalskii lukaszmichalskii commented Apr 30, 2025

PR description:

This PR enables seamless integration between PyTorch and the Alpaka-based heterogeneous computing backend, supporting inference workflows with usage of pytorch library with PortableCollections objects. It provides:

  • Compatibility with Alpaka device/queue abstractions.
  • Support for automatic conversion of optimized SoA to torch tensors, with memory blob reusage.
  • Support for just-in-time (JIT) model execution (with some proof-of-concept ahead-of-time (AOT) solution).
  • Single-threading and CUDA stream management are handled by QueueGuard objects specialized for each supported backend.

This implementation was presented and discussed at:

PR validation:

Included demonstration code of interoperability between SoA constructs with PyTorch C++ API and CMSSW environment in PyTorchAlpakaTest package.

PyTorch Ahead-of-time compilation

This pull request also investigates AOT compilation strategy but is in beta version (proof of concept) not yet ready for production usage.

GPU support

CUDA backend is supported and tested, ROCm is not yet supported: cms-sw/cmsdist#9786. To ensure pipelines running on AMD nodes are not left without inference capability, the CPU fallback is implemented. This fallback transfers inference data to the host (explicitly synchronizes it with alpaka::wait()), executes inference on the CPU, and then copies the results back to the output buffer.

FYI @valsdav @ericcano @felicepantaleo @chrisizeh @leobeltra

@cmsbuild
Copy link
Contributor

cmsbuild commented Apr 30, 2025

cms-bot internal usage

@cmsbuild
Copy link
Contributor

-code-checks

Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-47984/44654

Code check has found code style and quality issues which could be resolved by applying following patch(s)

@cmsbuild
Copy link
Contributor

+code-checks

Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-47984/44655

@cmsbuild
Copy link
Contributor

A new Pull Request was created by @lukaszmichalskii for master.

It involves the following packages:

  • DataFormats/PyTorchTest (****)
  • PhysicsTools/PyTorch (ml)

The following packages do not have a category, yet:

DataFormats/PyTorchTest
Please create a PR for https://github.com/cms-sw/cms-bot/blob/master/categories_map.py to assign category

@cmsbuild, @valsdav, @y19y19 can you please review it and eventually sign? Thanks.
@missirol, @mmusich, @rovere this is something you requested to watch as well.
@antoniovilela, @mandrenguyen, @rappoccio, @sextonkennedy you are the release manager for this.

cms-bot commands are listed here

@lukaszmichalskii lukaszmichalskii changed the title Integrating PyTorch in Alpakac heterogeneous core Integrating PyTorch in Alpaka heterogeneous core Apr 30, 2025
@valsdav
Copy link
Contributor

valsdav commented May 1, 2025

enable gpu

@valsdav
Copy link
Contributor

valsdav commented May 7, 2025

please test

@cmsbuild
Copy link
Contributor

cmsbuild commented May 7, 2025

-1

Failed Tests: Build ClangBuild
Size: This PR adds an extra 20KB to repository
Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-c68a5d/45909/summary.html
COMMIT: c01d07f
CMSSW: CMSSW_15_1_X_2025-05-06-2300/el8_amd64_gcc12
Additional Tests: CUDA,ROCM
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week0/cms-sw/cmssw/47984/45909/install.sh to create a dev area with all the needed externals and cmssw changes.

Build

I found compilation error when building:

>> Compiling  src/PhysicsTools/PyTorch/test/testModel.cc
/cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02888/el8_amd64_gcc12/external/gcc/12.3.1-40d504be6370b5a30e3947a6e575ca28/bin/c++ -c -DCMS_MICRO_ARCH='x86-64-v3' -DGNU_GCC -D_GNU_SOURCE -DCMSSW_GIT_HASH='CMSSW_15_1_X_2025-05-06-2300' -DPROJECT_NAME='CMSSW' -DPROJECT_VERSION='CMSSW_15_1_X_2025-05-06-2300' -Isrc -Ipoison -I/cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02888/el8_amd64_gcc12/cms/cmssw/CMSSW_15_1_X_2025-05-06-2300/src -I/cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02888/el8_amd64_gcc12/external/pytorch/2.6.0-a6d0e4413a9e766b40a2b79f83b4b176/include -I/cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02888/el8_amd64_gcc12/external/pytorch/2.6.0-a6d0e4413a9e766b40a2b79f83b4b176/include/torch/csrc/api/include -I/cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02888/el8_amd64_gcc12/external/cppunit/1.15.x-25a760f1303b0fca73df75b14e1358bc/include -I/cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02888/el8_amd64_gcc12/external/cuda/12.8.1-f1c01abd08373a07ceeffab8d5f1930a/include -I/cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02888/el8_amd64_gcc12/external/protobuf/3.21.9-1126508a53768c90e66f6bf1821ac03a/include -I/cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02888/el8_amd64_gcc12/external/zlib/1.2.13-d217cdbdd8d586e845e05946de2796be/include -O3 -pthread -pipe -Werror=main -Werror=pointer-arith -Werror=overlength-strings -Wno-vla -Werror=overflow -std=c++20 -ftree-vectorize -Werror=array-bounds -Werror=format-contains-nul -Werror=type-limits -fvisibility-inlines-hidden -fno-math-errno --param vect-max-version-for-alias-checks=50 -Xassembler --compress-debug-sections -Wno-error=array-bounds -Warray-bounds -fuse-ld=bfd -march=x86-64-v3 -felide-constructors -fmessage-length=0 -Wall -Wno-non-template-friend -Wno-long-long -Wreturn-type -Wextra -Wpessimizing-move -Wclass-memaccess -Wno-cast-function-type -Wno-unused-but-set-parameter -Wno-ignored-qualifiers -Wno-unused-parameter -Wunused -Wparentheses -Werror=return-type -Werror=missing-braces -Werror=unused-value -Werror=unused-label -Werror=address -Werror=format -Werror=sign-compare -Werror=write-strings -Werror=delete-non-virtual-dtor -Werror=strict-aliasing -Werror=narrowing -Werror=unused-but-set-variable -Werror=reorder -Werror=unused-variable -Werror=conversion-null -Werror=return-local-addr -Wnon-virtual-dtor -Werror=switch -fdiagnostics-show-option -Wno-unused-local-typedefs -Wno-attributes -Wno-psabi -DBOOST_DISABLE_ASSERTS -flto=auto -fipa-icf -flto-odr-type-merging -fno-fat-lto-objects -Wodr -fPIC -MMD -MF tmp/el8_amd64_gcc12/src/PhysicsTools/PyTorch/test/testModel/testModel.cc.d src/PhysicsTools/PyTorch/test/testModel.cc -o tmp/el8_amd64_gcc12/src/PhysicsTools/PyTorch/test/testModel/testModel.cc.o
>> Compiling  src/PhysicsTools/PyTorch/test/testRunner.cc
/cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02888/el8_amd64_gcc12/external/gcc/12.3.1-40d504be6370b5a30e3947a6e575ca28/bin/c++ -c -DCMS_MICRO_ARCH='x86-64-v3' -DGNU_GCC -D_GNU_SOURCE -DCMSSW_GIT_HASH='CMSSW_15_1_X_2025-05-06-2300' -DPROJECT_NAME='CMSSW' -DPROJECT_VERSION='CMSSW_15_1_X_2025-05-06-2300' -Isrc -Ipoison -I/cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02888/el8_amd64_gcc12/cms/cmssw/CMSSW_15_1_X_2025-05-06-2300/src -I/cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02888/el8_amd64_gcc12/external/pytorch/2.6.0-a6d0e4413a9e766b40a2b79f83b4b176/include -I/cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02888/el8_amd64_gcc12/external/pytorch/2.6.0-a6d0e4413a9e766b40a2b79f83b4b176/include/torch/csrc/api/include -I/cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02888/el8_amd64_gcc12/external/cppunit/1.15.x-25a760f1303b0fca73df75b14e1358bc/include -I/cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02888/el8_amd64_gcc12/external/cuda/12.8.1-f1c01abd08373a07ceeffab8d5f1930a/include -I/cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02888/el8_amd64_gcc12/external/protobuf/3.21.9-1126508a53768c90e66f6bf1821ac03a/include -I/cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02888/el8_amd64_gcc12/external/zlib/1.2.13-d217cdbdd8d586e845e05946de2796be/include -O3 -pthread -pipe -Werror=main -Werror=pointer-arith -Werror=overlength-strings -Wno-vla -Werror=overflow -std=c++20 -ftree-vectorize -Werror=array-bounds -Werror=format-contains-nul -Werror=type-limits -fvisibility-inlines-hidden -fno-math-errno --param vect-max-version-for-alias-checks=50 -Xassembler --compress-debug-sections -Wno-error=array-bounds -Warray-bounds -fuse-ld=bfd -march=x86-64-v3 -felide-constructors -fmessage-length=0 -Wall -Wno-non-template-friend -Wno-long-long -Wreturn-type -Wextra -Wpessimizing-move -Wclass-memaccess -Wno-cast-function-type -Wno-unused-but-set-parameter -Wno-ignored-qualifiers -Wno-unused-parameter -Wunused -Wparentheses -Werror=return-type -Werror=missing-braces -Werror=unused-value -Werror=unused-label -Werror=address -Werror=format -Werror=sign-compare -Werror=write-strings -Werror=delete-non-virtual-dtor -Werror=strict-aliasing -Werror=narrowing -Werror=unused-but-set-variable -Werror=reorder -Werror=unused-variable -Werror=conversion-null -Werror=return-local-addr -Wnon-virtual-dtor -Werror=switch -fdiagnostics-show-option -Wno-unused-local-typedefs -Wno-attributes -Wno-psabi -DBOOST_DISABLE_ASSERTS -flto=auto -fipa-icf -flto-odr-type-merging -fno-fat-lto-objects -Wodr -fPIC -MMD -MF tmp/el8_amd64_gcc12/src/PhysicsTools/PyTorch/test/testModel/testRunner.cc.d src/PhysicsTools/PyTorch/test/testRunner.cc -o tmp/el8_amd64_gcc12/src/PhysicsTools/PyTorch/test/testModel/testRunner.cc.o
In file included from src/PhysicsTools/PyTorch/test/testModel.cc:5:
src/PhysicsTools/PyTorch/test/testUtilities.h:4:10: fatal error: boost/filesystem.hpp: No such file or directory
    4 | #include 
      |          ^~~~~~~~~~~~~~~~~~~~~~
compilation terminated.
gmake: *** [tmp/el8_amd64_gcc12/src/PhysicsTools/PyTorch/test/testModel/testModel.cc.o] Error 1
>> Building binary testModel


Clang Build

I found compilation error while trying to compile with clang. Command used:

USER_CUDA_FLAGS='--expt-relaxed-constexpr' USER_CXXFLAGS='-Wno-register -fsyntax-only' /usr/bin/time -v scram build -k -j 32 COMPILER='llvm compile'

>> Local Products Rules ..... done
>> Creating project symlinks
>> Entering Package PhysicsTools/PyTorch
>> Entering Package DataFormats/PyTorchTest
>> Compile sequence completed for CMSSW CMSSW_15_1_X_2025-05-06-2300
gmake: *** [There are compilation/build errors. Please see the detail log above.] Error 1
Command exited with non-zero status 1
	Command being timed: "scram build -k -j 32 COMPILER=llvm compile BUILD_LOG=yes"
	User time (seconds): 907.23
	System time (seconds): 91.81
	Percent of CPU this job got: 655%


@valsdav
Copy link
Contributor

valsdav commented Nov 21, 2025

+ml

@ftenchini
Copy link

Does this still count as "new package pending"?

@makortel
Copy link
Contributor

Does this still count as "new package pending"?

Yes, the

PhysicsTools/PyTorchAlpaka
PhysicsTools/PyTorchAlpakaTest

need to be added to the bot. Should these be only for ml, or also for heterogeneous?

@fwyzard
Copy link
Contributor

fwyzard commented Nov 25, 2025

I guess also heterogeneous?

@makortel
Copy link
Contributor

I opened cms-sw/cms-bot#2624 to add the two packages for heterogeneous and ml

@cmsbuild
Copy link
Contributor

This pull request is fully signed and it will be integrated in one of the next master IBs (tests are also fine). This pull request will now be reviewed by the release team before it's merged. @mandrenguyen, @sextonkennedy, @ftenchini (and backports should be raised in the release meeting by the corresponding L2)
Notice This PR was tested with additional Pull Request(s), please also merge them if necessary: cms-data/PhysicsTools-PyTorchAlpaka#1, cms-data/PhysicsTools-PyTorchAlpakaTest#1, cms-data/PhysicsTools-PyTorch#1

@makortel
Copy link
Contributor

Does this still count as "new package pending"?

The package assignment has now been completed.

@mandrenguyen
Copy link
Contributor

@cms-sw/heterogeneous-l2 @cms-sw/ml-l2 It takes some scrolling but it seems the requires-extenal flag is triggered b/c of: cms-data/PhysicsTools-PyTorchAlpakaTest#1
Would you mind signing that one as well, if you're ok with it?

@fwyzard
Copy link
Contributor

fwyzard commented Nov 26, 2025

Sure - I didn't know we got the possibility of signing the externals :-)

@mandrenguyen
Copy link
Contributor

+1

@cmsbuild cmsbuild merged commit 8945253 into cms-sw:master Nov 26, 2025
26 checks passed
@mandrenguyen
Copy link
Contributor

I see a couple of add-on test failures from last night's IB related to PyTorchAlpaka:
https://cmssdt.cern.ch/SDT/cgi-bin/logreader/el8_amd64_gcc13/CMSSW_16_0_X_2025-11-26-2300/unitTestLogs/PhysicsTools/PyTorchAlpaka#/

Presumably that's related to this PR. We are trying to build the next 16_0_0 pre-release this week. Is this something that can be fixed on a short time scale?

@valsdav
Copy link
Contributor

valsdav commented Nov 27, 2025

Hi @mandrenguyen I think this PR cms-data/PhysicsTools-PyTorchAlpaka#1 got missed and it's needed for those tests

@mandrenguyen
Copy link
Contributor

Hi @mandrenguyen I think this PR cms-data/PhysicsTools-PyTorchAlpaka#1 got missed and it's needed for those tests

Thanks @valsdav It seems we got a bit confused having both:
cms-data/PhysicsTools-PyTorchAlpaka#1
and
cms-data/PhysicsTools-PyTorchAlpakaTest#1

Is is really necessary to have both?

@valsdav
Copy link
Contributor

valsdav commented Nov 27, 2025

@mandrenguyen I think they are both necessary as the the PyTorchAlpakaTest package contains few example producers and configurations which we wanted to keep separate from the core implementation.

@mandrenguyen
Copy link
Contributor

mandrenguyen commented Nov 27, 2025

@mandrenguyen I think they are both necessary as the the PyTorchAlpakaTest package contains few example producers and configurations which we wanted to keep separate from the core implementation.

It's the first time I see the same external file having to be added to two packages in order to integrate a PR. It's not clear to me why one would like to keep things separate from the core implementation, but nonetheless in the main cmssw instead of some fork. Of course I don't know all the details, so maybe this is completely justified. But since this is the only case, it's perhaps worth thinking about whether it really needs to be done this way.

@fwyzard
Copy link
Contributor

fwyzard commented Nov 27, 2025

@mandrenguyen I'm not sure of the contant, but what I see is that

under PhysicsTools/PyTorchAlpakaTest we added

  • MaskedNet.pt
  • MultiHeadNet.pt
  • SimpleNet.pt
  • TinyResNet.pt

which are used by PhysicsTools/PyTorchAlpakaTest/test/...,

while

under PhysicsTools/PyTorchAlpaka we added

  • linear_dnn.pt

which is used by PhysicsTools/PyTorchAlpaka/test/alpaka/....

So, insofar as the two packages are separate, it makes sense to have two externals 🤷🏻‍♂️

@mandrenguyen
Copy link
Contributor

Sure, by construction. Look, it was just a general comment to think if this can be streamlined, as it could lead to confusion in the future. If it not the case, it's not the case.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.