Use SST_MPI_Comm_spawn_multiple in Airel PIN backend#2638
Open
plavin wants to merge 6 commits intosstsimulator:masterfrom
Open
Use SST_MPI_Comm_spawn_multiple in Airel PIN backend#2638plavin wants to merge 6 commits intosstsimulator:masterfrom
plavin wants to merge 6 commits intosstsimulator:masterfrom
Conversation
Contributor
|
This PR needs to be retargeted at the devel branch |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This PR updates Ariel to use
MPI_Comm_spawn_multiplefor launching applications when core is compiled with MPI support. This simplifies the build system as now the only parts of Ariel that depend on an MPI compiler are the test applications. This also makes launching applications more robust, as MPI applications are not supposed to call fork. Fixes #2624.This PR also adds new functionality to the Ariel API. Two new functions are added:
These will each cause a message to be output to stdout, with the region name and the current simulation timestamp. This can be used to correlate stat dumps with locations in the source app.
Justification for changes to
fesimple.ccThe file
ariel/frontend/pin3/fesimple.ccrequired extensive changes. One tricky part about tracing MPI applications with PIN is that the MPI library will typically spawn its own threads when MPI_Init is called. This causes two issues: (1) the program has more threads than specified by the user in their Ariel config meaning the shared memory tunnel is not big enough, and (2) if MPI_Init is called before all of the application threads are launched, the MPI threads will receive lower IDs than the application threads. This makes ignoring them harder.The first solution, which is removed by this PR, was to try and place an OMP parallel region before MPI_Init, so that the application threads would always be numbered 0..N-1. But this obviously only works for OpenMP programs and meant that the Ariel API needed to be compiled with
-fopenmp.The new approach is to track which threads were MPI threads by checking if
libmpi.sowas in their callstack. This works but now we have to maintain a map of the thread IDs (typically [0,3,4,...,N+1] -> [0,...,N-1]). We then need to check this map when writing to the tunnel. PIN won't let us change how it numbers threads. In a future update, I hope to move this functionality to a class that wraps the tunnel so that the map can be queried in a single location instead of all overfesimple.cc.Known Issues
FATAL: ArielComponent[arielcore.cc:486:refillQueue] Error: Ariel did not understand command (128) provided during instruction queue refill.or it may cause the program to hang indefinitely. This error seems to mostly affecttest_Ariel_test_ivb_pin.OMPI_MCA_rmaps_base_oversubscribe=1in the MPI testsuite file.fesimple.ccmay break if more application threads are launched than the thecorecountparameter passed to theariel.arielcomponent.