-
Notifications
You must be signed in to change notification settings - Fork 1
topics_for_forum_wg_block_03_18_24
To be read later this week:
- groups: relax restriction on pset name provenance (second vote and no-no vote)
- https://github.com/mpi-forum/mpi-issues/issues/811
- https://github.com/mpi-forum/mpi-standard/pull/938
We left off at whether introduction of a "global state" ability to specify a MPI reinit error handler
- most recent slide deck is here.
None fault aware MPI Sessions example
// general high-level optimistic application
void main() {
MPI_Session session;
MPI_Session_Init(MPI_INFO_NULL, MPI_ERRORS_ARE_FATAL, &session);
MPI_Group group;
MPI_Group_from_session_pset(session, "mpi://world", &group);
MPI_Comm comm;
MPI_Comm_create_from_group(group, &comm);
ret = do_stuff_with_comm(comm);
if (MPI_SUCCESS == ret) {
MPI_Comm_disconnect(&comm);
MPI_Session_Finalize(&session);
break;
} else {
panic();
}
}ULFM style fault tolerant aware MPI Sessions example:
// general high-level pragmatic application
void main() {
// additional code
while (1) {
MPI_Session session;
MPI_Session_Init(MPI_INFO_NULL, MPI_ERRORS_RETURN, &session);
MPI_Group world, failed, group;
MPI_Group_from_session_pset(session, "mpi://world", &world);
// additional code
MPI_Session_get_proc_failed(session, &failed); // new API, seems easy to do
MPI_Group_difference(world, failed, &group);
MPI_Group_free(&world);
MPI_Group_free(&failed);
MPI_Comm comm;
MPI_Comm_create_from_group(group, &comm); // <-- the detail-devils live here
MPI_Group_free(&group);
ret = do_stuff_with_comm(comm);
MPI_Comm_disconnect(&comm);
MPI_Session_Finalize(&session);
if (MPI_SUCCESS == ret) {
break; // all done!
} else if (MPI_ERR_PROC_FAILED == ret) {
continue; // no more panic
}
// additional code
} // end while
}
There has been discussion (emails) about challenges in implementing a process fail-stop fault aware MPI_Comm_create_from_group.
Some random notes Howard made while trying to drive. Some are captured in the pptx below. Note the meeting was not recorded.
There was some discussion about how how/if to support nesting of error handlers if app is using a MPI_ERRORS_INIT as the global error handler. Dan suggested use of MPI_Comm_run_errhandler as one approach.
Joseph brought up how the example codes that check for MPI_ERR_REINIT but in a possibly multi-threaded region.
Need for app specific thread synchronization was discussed.
There was also discussion of invocation of MPI_Test_failure implicitly sets global MPI_ERRORS_REINIT. Or maybe make this done by resurrecting MPI_Reinit but for a new purpose?
link to a few slides from WG discussion
- https://drive.google.com/drive/folders/1NLAMZtH5B3bVSnWk-ZxM9FhCJesZ92rL
- https://miro.com/app/board/o9J_l_Rxe9Q=/
- www.martin-schreiber.info/pub/tmp/2022_01_11_dominik_huber_dynamic_sessions_interface_annotated.pdf
https://github.com/mpiwg-sessions/sessions-issues/wiki/2016-12-12-webex/2016-12-12-webex.pptx https://github.com/mpiwg-sessions/sessions-issues/wiki/SessionsV2-ideas.pptx