Mpi local run #30

Annnnya · 2025-03-17T12:47:38Z

PR description:

PR validation:

If this PR is a backport please specify the original PR and why you need to backport that PR. If this PR will be backported please specify to which release cycle the backport is meant for:

Before submitting your pull requests, make sure you followed this checklist:

verify that the PR is really intended for the chosen branch
verify that changes follow CMS Naming, Coding, And Style Rules
verify that the PR passes the basic test procedure suggested in the CMSSW PR instructions

…-overflow Fix event number overflow in ES packer - 150X

[15_0_X] Reduce minYsizeB1 and minYsizeB2 CA cuts for Phase1

[15_0_X] Remove size scalar from HcalRecHitSoALayout as unused

…ScoutingPFMonitor dataset

add back `@standardDQM+@miniAODDQM+@nanoAODDQM` sequences to wf 145.415

Implement a compile-time warp size constant [15.0]

…Clusterizer_GPU_DEBUG Fix `SiPixelClusterizer` alpaka code when `GPU_DEBUG` is defined [15.0.x]

…tingDQM_15_0_X [15.0.X] add a test workflow for testing the new `@hltScouting` DQM sequence in `ScoutingPFMonitor` dataset

add dictionary for `std::pair<short,int>` [`15_0_X`]

…0pre1 Backport: Developing offline JetMET DQM for Scouting jets

[15.0.X] Updated HLT BTV validation paths from DeepCSV to PNet

…e_throwOnMissing_15_0_X [15.0.X] fix `throwOnMissing` logic in `ObjectSelectorBase`

…y_ctor Fix warning about implicitly-declared copy constructor [15.0.x]

add a filter sequence if it is present in the fragment [15.0.X]

fixes to DTH parser based on tests with real DTH output. Also fix case of orbit containing only one event and by default disable checksum in unit test because this does not work for data at the moment

…an be parametrized through DaqDirector, in the future it will be detected from ramdisk metadata. * added fileDiscoveryMode that can be used live instead of fileBroker lower performance expected on NFS due to doing many file operations atomicity in grabbing files ensured by renaming file to unique name (even over NFS) * for the new mode, added eventCounter function (to models) which can do early counting of events in the file if neither the (deprecated) json index file is not provided and file does not come with file header providing event count and (optionally) file size. * Autodetection of raw file header without the file-broker is implemented. * unit tests implemented for various scenarios * copy new daqParameters json file by the fakeBU

…r47047_15_0_x [15.0.X] fix `customizeHLTfor47047` to work also on menus already customized

…_15_0_X Backport to produce 2024 and 2025 Tau Embedding samples

…gration_tests Implement additional framework integration tests [15.0.x]

[GEM][backport] turning on the applyMasking for 2025 Run 3 GEM data taking [15.0.x]

…tern Implement `edm::ProductNamePattern` [15.0.x]

This EDProducer will clone all the event products declared by its configuration, using their ROOT dictionaries.

Let multiple CMSSW processes on the same or different machines coordinate event processing and transfer data products over MPI. The implementation is based on four CMSSW modules. Two are responsible for setting up the communication channels and coordinate the event processing: - the MPIController - the MPISource and two are responsible for the transfer of data products: - the MPISender - the MPIReceiver . The MPIController is an EDProducer running in a regular CMSSW process. After setting up the communication with an MPISource, it transmits to it all EDM run, lumi and event transitions, and instructs the MPISource to replicate them in the second process. The MPISource is a Source controlling the execution of a second CMSSW process. After setting up the communication with an MPIController, it listens for EDM run, lumi and event transitions, and replicates them in its process. Both MPIController and MPISource produce an MPIToken, a special data product that encapsulates the information about the MPI communication channel. The MPISender is an EDProducer that can read a collection of a predefined type from the Event, serialise it using its ROOT dictionary, and send it over the MPI communication channel. The MPIReceiver is an EDProducer that can receive a collection of a predefined type over the MPI communication channel, deserialise is using its ROOT dictionary, and put it in the Event. Both MPISender and MPIReceiver are templated on the type to be transmitted and de/serialised. Each MPISender and MPIReceiver is configured with an instance value that is used to match one MPISender in one process to one MPIReceiver in another process. Using different instance values allows the use of multiple MPISenders/MPIReceivers in a process. Both MPISender and MPIReceiver obtain the MPI communication channel reading an MPIToken from the event. They also produce a copy of the MPIToken, so other modules can consume it to declare a dependency on the previous modules. An automated test is available in the test/ directory.

Let MPISender and MPIReceiver consume, send/receive and produce collections of arbitrary types, as long as they have a ROOT dictionary and can be persisted. Note that any transient information is lost during the transfer, and needs to be recreated by the receiving side. The documentation and tests are updated accordingly. Warning: this approach is a work in progress! TODO: - improve framework integration - add checks between send/receive types

…ostCollection

Let multiple CMSSW processes on the same or different machines coordinate event processing and transfer data products over MPI. The implementation is based on four CMSSW modules. Two are responsible for setting up the communication channels and coordinate the event processing: - the MPIController - the MPISource and two are responsible for the transfer of data products: - the MPISender - the MPIReceiver . The MPIController is an EDProducer running in a regular CMSSW process. After setting up the communication with an MPISource, it transmits to it all EDM run, lumi and event transitions, and instructs the MPISource to replicate them in the second process. The MPISource is a Source controlling the execution of a second CMSSW process. After setting up the communication with an MPIController, it listens for EDM run, lumi and event transitions, and replicates them in its process. Both MPIController and MPISource produce an MPIToken, a special data product that encapsulates the information about the MPI communication channel. The MPISender is an EDProducer that can read a collection of a predefined type from the Event, serialise it using its ROOT dictionary, and send it over the MPI communication channel. The MPIReceiver is an EDProducer that can receive a collection of a predefined type over the MPI communication channel, deserialise is using its ROOT dictionary, and put it in the Event. Both MPISender and MPIReceiver are templated on the type to be transmitted and de/serialised. Each MPISender and MPIReceiver is configured with an instance value that is used to match one MPISender in one process to one MPIReceiver in another process. Using different instance values allows the use of multiple MPISenders/MPIReceivers in a process. Both MPISender and MPIReceiver obtain the MPI communication channel reading an MPIToken from the event. They also produce a copy of the MPIToken, so other modules can consume it to declare a dependency on the previous modules. An automated test is available in the test/ directory.

Let MPISender and MPIReceiver consume, send/receive and produce collections of arbitrary types, as long as they have a ROOT dictionary and can be persisted. Note that any transient information is lost during the transfer, and needs to be recreated by the receiving side. The documentation and tests are updated accordingly. Warning: this approach is a work in progress! TODO: - improve framework integration - add checks between send/receive types

fwyzard · 2025-03-17T22:10:38Z

HeterogeneousCore/MPICore/plugins/MPIReceiver.cc

-    int numProducts;
-    token.channel()->receiveProduct(instance_, numProducts);
-    edm::LogVerbatim("MPIReceiver") << "Received number of products: " << numProducts;
+    // int numProducts;


why these are commented out ?

fwyzard · 2025-03-18T07:03:52Z

I think "local run" may be misleading: both approaches (single mpirun command, or multiple mpirun commands with the ompi-server) support running all processes on the same node, or on different nodes.

Could you rename the option to "useMPINameServer" ?

code checks

cmsbuild and others added 30 commits February 10, 2025 10:05

Merge pull request cms-sw#47294 from thomreis/fix-es-packer-evtnumber…

daa89f5

…-overflow Fix event number overflow in ES packer - 150X

Merge pull request cms-sw#47296 from bdanzi/PR_ClusterSize

b565528

[15_0_X] Reduce minYsizeB1 and minYsizeB2 CA cuts for Phase1

Merge pull request cms-sw#47311 from makortel/hcalRecHitSoAsize

548ae7d

[15_0_X] Remove size scalar from HcalRecHitSoALayout as unused

add a filter sequence if it is present in the fragment

115f236

add a test workflow for testing the new @hltScouting DQM sequence in …

6e3f2ad

…ScoutingPFMonitor dataset

Apply suggestions from code review

60f0a24

add back `@standardDQM+@miniAODDQM+@nanoAODDQM` sequences to wf 145.415

Merge pull request cms-sw#47301 from fwyzard/alpaka_constexpr_warp_size

25af8f1

Implement a compile-time warp size constant [15.0]

Merge pull request cms-sw#47326 from fwyzard/RecoLocalTracker_SiPixel…

635a8bf

…Clusterizer_GPU_DEBUG Fix `SiPixelClusterizer` alpaka code when `GPU_DEBUG` is defined [15.0.x]

Merge pull request cms-sw#47349 from mmusich/mm_dev_add_wf_for2024Sco…

8f2fac2

…tingDQM_15_0_X [15.0.X] add a test workflow for testing the new `@hltScouting` DQM sequence in `ScoutingPFMonitor` dataset

Merge pull request cms-sw#47347 from missirol/devel_addL1TCondDict

f71cd74

add dictionary for `std::pair<short,int>` [`15_0_X`]

fix throwOnMissing logic in ObjectSelectorBase

fe07539

Merge pull request cms-sw#47336 from etzovara/JetMETScoutDQM_cmssw150…

b0ab531

…0pre1 Backport: Developing offline JetMET DQM for Scouting jets

fix customizeHLTfor47047 to work also on already migrated menus

e6c3a91

Merge pull request cms-sw#47345 from brunolopesbr2/update-paths

eff6955

[15.0.X] Updated HLT BTV validation paths from DeepCSV to PNet

Merge pull request cms-sw#47360 from mmusich/mm_fix_ObjectSelectorBas…

f5b4a4c

…e_throwOnMissing_15_0_X [15.0.X] fix `throwOnMissing` logic in `ObjectSelectorBase`

Merge pull request cms-sw#47364 from fwyzard/fix_warning_implicit_cop…

f874b3e

…y_ctor Fix warning about implicitly-declared copy constructor [15.0.x]

Merge pull request cms-sw#47334 from vlimant/filter_in_fragment_150X

1d52e3d

add a filter sequence if it is present in the fragment [15.0.X]

Update a few GTs in autoCond

49b0ae4

reverse byte order in DTH and SLinkRocket 128-byte words:

d794ba4

fixes to DTH parser based on tests with real DTH output. Also fix case of orbit containing only one event and by default disable checksum in unit test because this does not work for data at the moment

documentation of DAQSource interfaces

a2741e3

code-checks

b2413c1

code-format

1e00e2b

update doc

e02b08d

remove initialization from variable which is parameter

23689b2

Merge pull request cms-sw#47395 from cms-tsg-storm/fix_customizeHLTfo…

a3770dd

…r47047_15_0_x [15.0.X] fix `customizeHLTfor47047` to work also on menus already customized

fix clang warnings

ddb86e8

update TauAnalysis folder from master branch

c88b56e

Merge pull request cms-sw#47424 from KIT-CMS/embedding_backport_CMSSW…

7102c1b

…_15_0_X Backport to produce 2024 and 2025 Tau Embedding samples

Update L1TCaloLayer1FetchLUTs.cc

7f9e776

cmsbuild and others added 25 commits February 26, 2025 10:55

Merge pull request cms-sw#47438 from fwyzard/implement_framework_inte…

0ab6c58

…gration_tests Implement additional framework integration tests [15.0.x]

Merge pull request cms-sw#47448 from yeckang/masking_on

4bed3cf

[GEM][backport] turning on the applyMasking for 2025 Run 3 GEM data taking [15.0.x]

Merge pull request cms-sw#47445 from fwyzard/implement_ProductNamePat…

4c02ddc

…tern Implement `edm::ProductNamePattern` [15.0.x]

Specialize Handle and OrphanHandle for WrapperBase

e048cc6

Implement integration tests for WrapperBase specialisations

b374744

Implement edmtest::GenericCloner

f78bd3c

This EDProducer will clone all the event products declared by its configuration, using their ROOT dictionaries.

Introduce the MemcpyTraits

58b873e

Specialise the MemcpyTraits for PortableObject and PortableCollection

c98d5b1

Optimize GenericCloner using the MemcpyTraits

1d3853e

Improve MPI tests

819d8c9

Fix sending and receiving primitive types

258a68b

Support product wrappers in MPISender/MPIReceiver

d8019c8

Introduce the TrivialCopyTraits

1d1bf86

Specialise the TrivialCopyTraits for PortableHostObject and PortableH…

c4c8d6d

…ostCollection

Optimize GenericCloner using the TrivialCopyTraits

6400ffe

Improve MPI tests

d0c0c4a

Fix sending and receiving primitive types

b184575

Support product wrappers in MPISender/MPIReceiver

0fece6a

Optimize MPI transfers using the TrivialCopyTraits

7d5cf67

Merged refs/pull/32632/head from repository cms-sw with cms-merge-topic

4fbb925

added configuration to run mpi in local and remote mode

855def5

fwyzard reviewed Mar 17, 2025

View reviewed changes

fwyzard force-pushed the MPI_updates branch from eebc226 to b61a4ad Compare March 26, 2025 22:28

fwyzard pushed a commit that referenced this pull request May 16, 2025

Merge pull request #30 from gmelachr/from-CMSSW_15_1_0_pre1

18805f1

code checks

fwyzard force-pushed the MPI_updates branch from b61a4ad to 7a9e3b1 Compare June 11, 2025 08:46

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Mpi local run #30

Mpi local run #30

Uh oh!

Annnnya commented Mar 17, 2025

Uh oh!

fwyzard Mar 17, 2025

Uh oh!

fwyzard commented Mar 18, 2025

Uh oh!

Uh oh!

Mpi local run #30

Are you sure you want to change the base?

Mpi local run #30

Uh oh!

Conversation

Annnnya commented Mar 17, 2025

PR description:

PR validation:

If this PR is a backport please specify the original PR and why you need to backport that PR. If this PR will be backported please specify to which release cycle the backport is meant for:

Uh oh!

fwyzard Mar 17, 2025

Choose a reason for hiding this comment

Uh oh!

fwyzard commented Mar 18, 2025

Uh oh!

Uh oh!