-
Notifications
You must be signed in to change notification settings - Fork 4.6k
Integrating PyTorch in Alpaka heterogeneous core #47984
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
cms-bot internal usage |
|
-code-checks Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-47984/44654
Code check has found code style and quality issues which could be resolved by applying following patch(s)
|
46072a6 to
c01d07f
Compare
|
+code-checks Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-47984/44655
|
|
A new Pull Request was created by @lukaszmichalskii for master. It involves the following packages:
The following packages do not have a category, yet: DataFormats/PyTorchTest @cmsbuild, @valsdav, @y19y19 can you please review it and eventually sign? Thanks. cms-bot commands are listed here |
|
enable gpu |
|
please test |
|
-1 Failed Tests: Build ClangBuild BuildI found compilation error when building: >> Compiling src/PhysicsTools/PyTorch/test/testModel.cc
/cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02888/el8_amd64_gcc12/external/gcc/12.3.1-40d504be6370b5a30e3947a6e575ca28/bin/c++ -c -DCMS_MICRO_ARCH='x86-64-v3' -DGNU_GCC -D_GNU_SOURCE -DCMSSW_GIT_HASH='CMSSW_15_1_X_2025-05-06-2300' -DPROJECT_NAME='CMSSW' -DPROJECT_VERSION='CMSSW_15_1_X_2025-05-06-2300' -Isrc -Ipoison -I/cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02888/el8_amd64_gcc12/cms/cmssw/CMSSW_15_1_X_2025-05-06-2300/src -I/cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02888/el8_amd64_gcc12/external/pytorch/2.6.0-a6d0e4413a9e766b40a2b79f83b4b176/include -I/cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02888/el8_amd64_gcc12/external/pytorch/2.6.0-a6d0e4413a9e766b40a2b79f83b4b176/include/torch/csrc/api/include -I/cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02888/el8_amd64_gcc12/external/cppunit/1.15.x-25a760f1303b0fca73df75b14e1358bc/include -I/cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02888/el8_amd64_gcc12/external/cuda/12.8.1-f1c01abd08373a07ceeffab8d5f1930a/include -I/cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02888/el8_amd64_gcc12/external/protobuf/3.21.9-1126508a53768c90e66f6bf1821ac03a/include -I/cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02888/el8_amd64_gcc12/external/zlib/1.2.13-d217cdbdd8d586e845e05946de2796be/include -O3 -pthread -pipe -Werror=main -Werror=pointer-arith -Werror=overlength-strings -Wno-vla -Werror=overflow -std=c++20 -ftree-vectorize -Werror=array-bounds -Werror=format-contains-nul -Werror=type-limits -fvisibility-inlines-hidden -fno-math-errno --param vect-max-version-for-alias-checks=50 -Xassembler --compress-debug-sections -Wno-error=array-bounds -Warray-bounds -fuse-ld=bfd -march=x86-64-v3 -felide-constructors -fmessage-length=0 -Wall -Wno-non-template-friend -Wno-long-long -Wreturn-type -Wextra -Wpessimizing-move -Wclass-memaccess -Wno-cast-function-type -Wno-unused-but-set-parameter -Wno-ignored-qualifiers -Wno-unused-parameter -Wunused -Wparentheses -Werror=return-type -Werror=missing-braces -Werror=unused-value -Werror=unused-label -Werror=address -Werror=format -Werror=sign-compare -Werror=write-strings -Werror=delete-non-virtual-dtor -Werror=strict-aliasing -Werror=narrowing -Werror=unused-but-set-variable -Werror=reorder -Werror=unused-variable -Werror=conversion-null -Werror=return-local-addr -Wnon-virtual-dtor -Werror=switch -fdiagnostics-show-option -Wno-unused-local-typedefs -Wno-attributes -Wno-psabi -DBOOST_DISABLE_ASSERTS -flto=auto -fipa-icf -flto-odr-type-merging -fno-fat-lto-objects -Wodr -fPIC -MMD -MF tmp/el8_amd64_gcc12/src/PhysicsTools/PyTorch/test/testModel/testModel.cc.d src/PhysicsTools/PyTorch/test/testModel.cc -o tmp/el8_amd64_gcc12/src/PhysicsTools/PyTorch/test/testModel/testModel.cc.o
>> Compiling src/PhysicsTools/PyTorch/test/testRunner.cc
/cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02888/el8_amd64_gcc12/external/gcc/12.3.1-40d504be6370b5a30e3947a6e575ca28/bin/c++ -c -DCMS_MICRO_ARCH='x86-64-v3' -DGNU_GCC -D_GNU_SOURCE -DCMSSW_GIT_HASH='CMSSW_15_1_X_2025-05-06-2300' -DPROJECT_NAME='CMSSW' -DPROJECT_VERSION='CMSSW_15_1_X_2025-05-06-2300' -Isrc -Ipoison -I/cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02888/el8_amd64_gcc12/cms/cmssw/CMSSW_15_1_X_2025-05-06-2300/src -I/cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02888/el8_amd64_gcc12/external/pytorch/2.6.0-a6d0e4413a9e766b40a2b79f83b4b176/include -I/cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02888/el8_amd64_gcc12/external/pytorch/2.6.0-a6d0e4413a9e766b40a2b79f83b4b176/include/torch/csrc/api/include -I/cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02888/el8_amd64_gcc12/external/cppunit/1.15.x-25a760f1303b0fca73df75b14e1358bc/include -I/cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02888/el8_amd64_gcc12/external/cuda/12.8.1-f1c01abd08373a07ceeffab8d5f1930a/include -I/cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02888/el8_amd64_gcc12/external/protobuf/3.21.9-1126508a53768c90e66f6bf1821ac03a/include -I/cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02888/el8_amd64_gcc12/external/zlib/1.2.13-d217cdbdd8d586e845e05946de2796be/include -O3 -pthread -pipe -Werror=main -Werror=pointer-arith -Werror=overlength-strings -Wno-vla -Werror=overflow -std=c++20 -ftree-vectorize -Werror=array-bounds -Werror=format-contains-nul -Werror=type-limits -fvisibility-inlines-hidden -fno-math-errno --param vect-max-version-for-alias-checks=50 -Xassembler --compress-debug-sections -Wno-error=array-bounds -Warray-bounds -fuse-ld=bfd -march=x86-64-v3 -felide-constructors -fmessage-length=0 -Wall -Wno-non-template-friend -Wno-long-long -Wreturn-type -Wextra -Wpessimizing-move -Wclass-memaccess -Wno-cast-function-type -Wno-unused-but-set-parameter -Wno-ignored-qualifiers -Wno-unused-parameter -Wunused -Wparentheses -Werror=return-type -Werror=missing-braces -Werror=unused-value -Werror=unused-label -Werror=address -Werror=format -Werror=sign-compare -Werror=write-strings -Werror=delete-non-virtual-dtor -Werror=strict-aliasing -Werror=narrowing -Werror=unused-but-set-variable -Werror=reorder -Werror=unused-variable -Werror=conversion-null -Werror=return-local-addr -Wnon-virtual-dtor -Werror=switch -fdiagnostics-show-option -Wno-unused-local-typedefs -Wno-attributes -Wno-psabi -DBOOST_DISABLE_ASSERTS -flto=auto -fipa-icf -flto-odr-type-merging -fno-fat-lto-objects -Wodr -fPIC -MMD -MF tmp/el8_amd64_gcc12/src/PhysicsTools/PyTorch/test/testModel/testRunner.cc.d src/PhysicsTools/PyTorch/test/testRunner.cc -o tmp/el8_amd64_gcc12/src/PhysicsTools/PyTorch/test/testModel/testRunner.cc.o
In file included from src/PhysicsTools/PyTorch/test/testModel.cc:5:
src/PhysicsTools/PyTorch/test/testUtilities.h:4:10: fatal error: boost/filesystem.hpp: No such file or directory
4 | #include
| ^~~~~~~~~~~~~~~~~~~~~~
compilation terminated.
gmake: *** [tmp/el8_amd64_gcc12/src/PhysicsTools/PyTorch/test/testModel/testModel.cc.o] Error 1
>> Building binary testModel
Clang BuildI found compilation error while trying to compile with clang. Command used: >> Local Products Rules ..... done >> Creating project symlinks >> Entering Package PhysicsTools/PyTorch >> Entering Package DataFormats/PyTorchTest >> Compile sequence completed for CMSSW CMSSW_15_1_X_2025-05-06-2300 gmake: *** [There are compilation/build errors. Please see the detail log above.] Error 1 Command exited with non-zero status 1 Command being timed: "scram build -k -j 32 COMPILER=llvm compile BUILD_LOG=yes" User time (seconds): 907.23 System time (seconds): 91.81 Percent of CPU this job got: 655% |
|
+ml |
|
Does this still count as "new package pending"? |
Yes, the need to be added to the bot. Should these be only for |
|
I guess also heterogeneous? |
|
I opened cms-sw/cms-bot#2624 to add the two packages for |
|
This pull request is fully signed and it will be integrated in one of the next master IBs (tests are also fine). This pull request will now be reviewed by the release team before it's merged. @mandrenguyen, @sextonkennedy, @ftenchini (and backports should be raised in the release meeting by the corresponding L2) |
The package assignment has now been completed. |
|
@cms-sw/heterogeneous-l2 @cms-sw/ml-l2 It takes some scrolling but it seems the |
|
Sure - I didn't know we got the possibility of signing the externals :-) |
|
+1 |
|
I see a couple of add-on test failures from last night's IB related to Presumably that's related to this PR. We are trying to build the next 16_0_0 pre-release this week. Is this something that can be fixed on a short time scale? |
|
Hi @mandrenguyen I think this PR cms-data/PhysicsTools-PyTorchAlpaka#1 got missed and it's needed for those tests |
Thanks @valsdav It seems we got a bit confused having both: Is is really necessary to have both? |
|
@mandrenguyen I think they are both necessary as the the PyTorchAlpakaTest package contains few example producers and configurations which we wanted to keep separate from the core implementation. |
It's the first time I see the same external file having to be added to two packages in order to integrate a PR. It's not clear to me why one would like to keep things separate from the core implementation, but nonetheless in the main cmssw instead of some fork. Of course I don't know all the details, so maybe this is completely justified. But since this is the only case, it's perhaps worth thinking about whether it really needs to be done this way. |
|
@mandrenguyen I'm not sure of the contant, but what I see is that under
which are used by while under
which is used by So, insofar as the two packages are separate, it makes sense to have two externals 🤷🏻♂️ |
|
Sure, by construction. Look, it was just a general comment to think if this can be streamlined, as it could lead to confusion in the future. If it not the case, it's not the case. |
PR description:
This PR enables seamless integration between PyTorch and the Alpaka-based heterogeneous computing backend, supporting inference workflows with usage of
pytorchlibrary withPortableCollections objects. It provides:QueueGuardobjects specialized for each supported backend.This implementation was presented and discussed at:
PR validation:
Included demonstration code of interoperability between
SoAconstructs withPyTorchC++ API and CMSSW environment in PyTorchAlpakaTest package.PyTorch Ahead-of-time compilation
This pull request also investigates AOT compilation strategy but is in beta version (proof of concept) not yet ready for production usage.
GPU support
CUDA backend is supported and tested, ROCm is not yet supported: cms-sw/cmsdist#9786. To ensure pipelines running on AMD nodes are not left without inference capability, the CPU fallback is implemented. This fallback transfers inference data to the host (explicitly synchronizes it with
alpaka::wait()), executes inference on the CPU, and then copies the results back to the output buffer.FYI @valsdav @ericcano @felicepantaleo @chrisizeh @leobeltra