Use SST_MPI_Comm_spawn_multiple in Airel PIN backend by plavin · Pull Request #2638 · sstsimulator/sst-elements

plavin · 2026-03-04T22:22:56Z

This PR updates Ariel to use MPI_Comm_spawn_multiple for launching applications when core is compiled with MPI support. This simplifies the build system as now the only parts of Ariel that depend on an MPI compiler are the test applications. This also makes launching applications more robust, as MPI applications are not supposed to call fork. Fixes #2624.

This PR also adds new functionality to the Ariel API. Two new functions are added:

void ariel_output_stats_begin_region(const char *name);
void ariel_output_stats_end_region(const char *name);

These will each cause a message to be output to stdout, with the region name and the current simulation timestamp. This can be used to correlate stat dumps with locations in the source app.

Justification for changes to `fesimple.cc`

The file ariel/frontend/pin3/fesimple.cc required extensive changes. One tricky part about tracing MPI applications with PIN is that the MPI library will typically spawn its own threads when MPI_Init is called. This causes two issues: (1) the program has more threads than specified by the user in their Ariel config meaning the shared memory tunnel is not big enough, and (2) if MPI_Init is called before all of the application threads are launched, the MPI threads will receive lower IDs than the application threads. This makes ignoring them harder.

The first solution, which is removed by this PR, was to try and place an OMP parallel region before MPI_Init, so that the application threads would always be numbered 0..N-1. But this obviously only works for OpenMP programs and meant that the Ariel API needed to be compiled with -fopenmp.

The new approach is to track which threads were MPI threads by checking if libmpi.so was in their callstack. This works but now we have to maintain a map of the thread IDs (typically [0,3,4,...,N+1] -> [0,...,N-1]). We then need to check this map when writing to the tunnel. PIN won't let us change how it numbers threads. In a future update, I hope to move this functionality to a class that wraps the tunnel so that the map can be queried in a single location instead of all over fesimple.cc .

Known Issues

There is a non-deterministic bug that will sometimes cause messages sent across the tunnel to appear in the wrong order to the arielcore. This will lead to errors such as FATAL: ArielComponent[arielcore.cc:486:refillQueue] Error: Ariel did not understand command (128) provided during instruction queue refill. or it may cause the program to hang indefinitely. This error seems to mostly affect test_Ariel_test_ivb_pin.
Calling MPI_Comm_spawn_multiple will cause an error if there are not enough slots in the allocation to run all the SST ranks and the application ranks in their own slots. To remedy this, we set OMPI_MCA_rmaps_base_oversubscribe=1 in the MPI testsuite file.
The EPA backend needs to be updated to use SST_MPI_Comm_spawn_multiple instead of fork. Trying to launch an MPI app with the EPA backend will cause an error.
The remap functionality in fesimple.cc may break if more application threads are launched than the the corecount parameter passed to the ariel.ariel component.

gvoskuilen · 2026-03-06T22:17:57Z

This PR needs to be retargeted at the devel branch

plavin added 6 commits March 3, 2026 12:22

merge new mpi_comm_spawn changes

290d604

cleanup, small bugfix

e8c13f2

fixed pin non-mpi tests

64e86ae

fix MPI tests

390a1c0

disable EPA+MPI in Ariel, remove some extra output

11d05b1

set OMPI_MCA_rmaps_base_oversubscribe for MPI tests

213acbb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use SST_MPI_Comm_spawn_multiple in Airel PIN backend#2638

Use SST_MPI_Comm_spawn_multiple in Airel PIN backend#2638
plavin wants to merge 6 commits intosstsimulator:masterfrom
plavin:ariel-fixup

plavin commented Mar 4, 2026 •

edited

Loading

Uh oh!

gvoskuilen commented Mar 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

plavin commented Mar 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Justification for changes to fesimple.cc

Known Issues

Uh oh!

gvoskuilen commented Mar 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

plavin commented Mar 4, 2026 •

edited

Loading

Justification for changes to `fesimple.cc`