Skip to content

GASNet errors with AMD GPUs on El Capitan (early access) #16

Open
@unpyoukz

Description

@unpyoukz

I am running in-house Regent/Pygion/Legion codes on Tioga (early access of El Capitan at LLNL) using AMD GPUs with various versions of GASNet, with the control_replication branch of Legion (28db23b4). I am getting the following errors.

  1. With GASNet-2022.3.0:
*** FATAL ERROR (proc 0): in gasnetc_segment_register() at anguage/gasnet/GASNet-2022.3.0/ofi-conduit/gasnet_ofi.c:1246: fi_mr_reg for rdma failed: -22(Invalid argument)

Tracing of the single rank case is here: trace.dat.

This error shows up frequently when using a single rank and almost always with multiple ranks. This did not happen before and I wonder if it is potentially due to some recent changes in the machine config and/or some other compatibility issues.

A similar application worked last year with an old commit of Legion (5a77dcbf) with GASNet-2022.3.0. The app works fine on various machines with NVIDIA GPUs with Intel- and IBM-CPUs.

  1. With GASNet-2022.9.2:
test.exec: symbol lookup error: [Legion]/bindings/regent/libregent.so: undefined symbol: PMI_Allgather

Naturally the application does not even start, regardless of the number of ranks used. The executable (.exec) of the app seems to run on a single rank when interactively executed without srun.

I am using the following modules for these cases:

rocm/4.5.0   gcc/11.2.0   cray-pmi/6.1.3   cray-mpich/8.1.2

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions