-
Notifications
You must be signed in to change notification settings - Fork 938
Open
Description
The UCX PML doesn't work in singleton mode.
To reproduce do these steps
- Build Open MPI main with UCX support.
- Build hello_c in the examples folder.
- run hello_c in singleton mode:
the process will hang in MPI_Finalize.
traceback shows:
#0 0x00007ffff6d8365a in opal_common_ucx_mca_pmix_fence (worker=0x87e310) at common_ucx.c:451
#1 0x00007ffff6d83a3c in opal_common_ucx_del_procs (procs=0x7fffec000e70, count=1, my_rank=0, max_disconnect=1, worker=0x87e310) at common_ucx.c:531
#2 0x00007ffff79bd10a in mca_pml_ucx_del_procs (procs=0x7fffec000ff0, nprocs=1) at pml_ucx.c:567
#3 0x00007ffff76e8d66 in ompi_mpi_instance_cleanup_pml () at instance/instance.c:189
#4 0x00007ffff6d0f566 in opal_finalize_cleanup_domain (domain=0x7ffff70480e0 <opal_init_domain>) at runtime/opal_finalize_core.c:129
#5 0x00007ffff6cfe757 in opal_finalize () at runtime/opal_finalize.c:56
#6 0x00007ffff76e5886 in ompi_rte_finalize () at runtime/ompi_rte.c:1045
#7 0x00007ffff76eb6e1 in ompi_mpi_instance_finalize_common () at instance/instance.c:951
#8 0x00007ffff76eb958 in ompi_mpi_instance_finalize (instance=0x7ffff7d93ac0 <ompi_mpi_instance_default>) at instance/instance.c:996
#9 0x00007ffff76dfd36 in ompi_mpi_finalize () at runtime/ompi_mpi_finalize.c:294
#10 0x00007ffff7731f1e in PMPI_Finalize () at finalize_generated.c:53
#11 0x00000000004008b8 in main (argc=1, argv=0x7ffffffefc08) at hello_c.c:23
This problem can be fixed by patching the UCX PML for the singleton case. Alternatively, the PMIx_Fence_nb implementation can be patched to invoke the cbfunc (if non null) in the quick exit for singleton case:
/* if we are a singleton, there is nothing to do */
if (pmix_client_globals.singleton) {
- return PMIX_SUCCESS;
+ rc = PMIX_SUCCESS;
+ if(NULL != cbfunc) {
+ cbfunc(rc, cbdata);
+ }
+ return rc;
}
Metadata
Metadata
Assignees
Labels
No labels