-
Notifications
You must be signed in to change notification settings - Fork 0
Open
Description
With Intel MPI and scheduler=shell, it seems that the MPI processes in a step can't be launched concurrently and fail with this error :
f01.1629534PSM2 can't open hfi unit: 0 (err=23)
Abort(1615247) on node 0 (rank 0 in comm 0): Fatal error in PMPI_Init_thread: Other MPI error, error stack:
MPIR_Init_thread(178)........:
MPID_Init(1532)..............:
MPIDI_OFI_mpi_init_hook(1552):
create_vni_context(2131).....: OFI endpoint open failed (ofi_init.c:2131:create_vni_context:Invalid argument)
Using n-worker strictly less than the number of cores seems to solve the issue
Metadata
Metadata
Assignees
Labels
No labels