Skip to content

Mpi local run #30

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 13 commits into
base: MPI_updates
Choose a base branch
from
Open

Mpi local run #30

wants to merge 13 commits into from

Conversation

Annnnya
Copy link

@Annnnya Annnnya commented Mar 17, 2025

PR description:

PR validation:

If this PR is a backport please specify the original PR and why you need to backport that PR. If this PR will be backported please specify to which release cycle the backport is meant for:

Before submitting your pull requests, make sure you followed this checklist:

fwyzard and others added 13 commits March 4, 2025 22:24
Let multiple CMSSW processes on the same or different machines coordinate
event processing and transfer data products over MPI.

The implementation is based on four CMSSW modules.
Two are responsible for setting up the communication channels and
coordinate the event processing:
  - the MPIController
  - the MPISource
and two are responsible for the transfer of data products:
  - the MPISender
  - the MPIReceiver
.

The MPIController is an EDProducer running in a regular CMSSW process.
After setting up the communication with an MPISource, it transmits to it
all EDM run, lumi and event transitions, and instructs the MPISource to
replicate them in the second process.

The MPISource is a Source controlling the execution of a second CMSSW
process. After setting up the communication with an MPIController, it
listens for EDM run, lumi and event transitions, and replicates them in
its process.

Both MPIController and MPISource produce an MPIToken, a special data
product that encapsulates the information about the MPI communication
channel.

The MPISender is an EDProducer that can read a collection of a predefined
type from the Event, serialise it using its ROOT dictionary, and send it
over the MPI communication channel.

The MPIReceiver is an EDProducer that can receive a collection of a
predefined type over the MPI communication channel, deserialise is using
its ROOT dictionary, and put it in the Event.

Both MPISender and MPIReceiver are templated on the type to be
transmitted and de/serialised.

Each MPISender and MPIReceiver is configured with an instance value
that is used to match one MPISender in one process to one MPIReceiver in
another process. Using different instance values allows the use of
multiple MPISenders/MPIReceivers in a process.

Both MPISender and MPIReceiver obtain the MPI communication channel
reading an MPIToken from the event. They also produce a copy of the
MPIToken, so other modules can consume it to declare a dependency on
the previous modules.

An automated test is available in the test/ directory.
Let MPISender and MPIReceiver consume, send/receive and produce
collections of arbitrary types, as long as they have a ROOT dictionary
and can be persisted.

Note that any transient information is lost during the transfer, and
needs to be recreated by the receiving side.

The documentation and tests are updated accordingly.

Warning: this approach is a work in progress!
TODO:
  - improve framework integration
  - add checks between send/receive types
int numProducts;
token.channel()->receiveProduct(instance_, numProducts);
edm::LogVerbatim("MPIReceiver") << "Received number of products: " << numProducts;
// int numProducts;
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why these are commented out ?

@fwyzard
Copy link
Owner

fwyzard commented Mar 18, 2025

I think "local run" may be misleading: both approaches (single mpirun command, or multiple mpirun commands with the ompi-server) support running all processes on the same node, or on different nodes.

Could you rename the option to "useMPINameServer" ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants