Skip to content

Perlmutter GNU GPU cannot run on larger meshes (1024 and 2048) #214

@kieran-ringel

Description

@kieran-ringel

When running on any node configuration besides 8 nodes I get the following error for 1024x1024 and 2048x2048 mesh.

 738 aborting job:
 739 Fatal error in PMPI_Isend: Other MPI error, error stack:
 740 PMPI_Isend(161)...........: MPI_Isend(buf=0x7fe8cb8eb680, count=120960, MPI_DOUBLE, dest=119, tag=0, comm=0x84000003, request=0x2a9a89c) failed
 741 MPID_Isend(595)...........:
 742 MPIDI_isend_unsafe(142)...:
 743 MPIDI_OFI_send_normal(372): OFI tagged senddata failed (ofi_send.h:372:MPIDI_OFI_send_normal:Resource temporarily unavailable)
 744 Kokkos::Cuda ERROR: Failed to call Kokkos::Cuda::finalize()
 745 MPICH ERROR [Rank 64] [job id 36981763.1] [Wed Mar 19 11:40:52 2025] [nid001528] - Abort(203556751) (rank 64 in comm 0): Fatal error in PMPI_Test: Other MPI e     rror, error stack:
 746 PMPI_Test(202).................: MPI_Test(request=0x23f8148, flag=0x23f8150, status=0x1) failed
 747 MPIR_Test(75)..................:
 748 MPIR_Test_impl(36).............:
 749 MPIDI_Progress_test(97)........:
 750 MPIDI_OFI_handle_cq_error(1075): OFI poll failed (ofi_events.c:1077:MPIDI_OFI_handle_cq_error:Transport endpoint is not connected - VNI_NOT_FOUND)

I am able to run on 1, 2, and 4 nodes for a 512x512 mesh so this issue only appears for larger mesh sizes.
The code is made using
cmake \ -DOMEGA_BUILD_TYPE=Release \ -DOMEGA_CIME_COMPILER=gnugpu \ -DOMEGA_CIME_MACHINE=pm-gpu \ -DOMEGA_PARMETIS_ROOT=${PARMETIS_ROOT} \ -DOMEGA_BUILD_TEST=ON \ -Wno-dev \ -S /global/homes/k/kringel/omega/repos/omega/250312_timingwPhilPR/components/omega \ -B .
This appears to be an unresolved vendor bug (https://docs.nersc.gov/systems/perlmutter/vendorbugs/#code-fails-with-mpidi_ofi_send_normalresource-temporarily-unavailable) but I am adding this issue for tracking.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions