forked from E3SM-Project/E3SM
-
Notifications
You must be signed in to change notification settings - Fork 7
Open
Description
When running on any node configuration besides 8 nodes I get the following error for 1024x1024 and 2048x2048 mesh.
738 aborting job:
739 Fatal error in PMPI_Isend: Other MPI error, error stack:
740 PMPI_Isend(161)...........: MPI_Isend(buf=0x7fe8cb8eb680, count=120960, MPI_DOUBLE, dest=119, tag=0, comm=0x84000003, request=0x2a9a89c) failed
741 MPID_Isend(595)...........:
742 MPIDI_isend_unsafe(142)...:
743 MPIDI_OFI_send_normal(372): OFI tagged senddata failed (ofi_send.h:372:MPIDI_OFI_send_normal:Resource temporarily unavailable)
744 Kokkos::Cuda ERROR: Failed to call Kokkos::Cuda::finalize()
745 MPICH ERROR [Rank 64] [job id 36981763.1] [Wed Mar 19 11:40:52 2025] [nid001528] - Abort(203556751) (rank 64 in comm 0): Fatal error in PMPI_Test: Other MPI e rror, error stack:
746 PMPI_Test(202).................: MPI_Test(request=0x23f8148, flag=0x23f8150, status=0x1) failed
747 MPIR_Test(75)..................:
748 MPIR_Test_impl(36).............:
749 MPIDI_Progress_test(97)........:
750 MPIDI_OFI_handle_cq_error(1075): OFI poll failed (ofi_events.c:1077:MPIDI_OFI_handle_cq_error:Transport endpoint is not connected - VNI_NOT_FOUND)
I am able to run on 1, 2, and 4 nodes for a 512x512 mesh so this issue only appears for larger mesh sizes.
The code is made using
cmake \ -DOMEGA_BUILD_TYPE=Release \ -DOMEGA_CIME_COMPILER=gnugpu \ -DOMEGA_CIME_MACHINE=pm-gpu \ -DOMEGA_PARMETIS_ROOT=${PARMETIS_ROOT} \ -DOMEGA_BUILD_TEST=ON \ -Wno-dev \ -S /global/homes/k/kringel/omega/repos/omega/250312_timingwPhilPR/components/omega \ -B .
This appears to be an unresolved vendor bug (https://docs.nersc.gov/systems/perlmutter/vendorbugs/#code-fails-with-mpidi_ofi_send_normalresource-temporarily-unavailable) but I am adding this issue for tracking.
Metadata
Metadata
Assignees
Labels
No labels