Description
I am running in-house Regent/Pygion/Legion codes on Tioga (early access of El Capitan at LLNL) using AMD GPUs with various versions of GASNet, with the control_replication branch of Legion (28db23b4). I am getting the following errors.
- With GASNet-2022.3.0:
*** FATAL ERROR (proc 0): in gasnetc_segment_register() at anguage/gasnet/GASNet-2022.3.0/ofi-conduit/gasnet_ofi.c:1246: fi_mr_reg for rdma failed: -22(Invalid argument)
Tracing of the single rank case is here: trace.dat.
This error shows up frequently when using a single rank and almost always with multiple ranks. This did not happen before and I wonder if it is potentially due to some recent changes in the machine config and/or some other compatibility issues.
A similar application worked last year with an old commit of Legion (5a77dcbf) with GASNet-2022.3.0. The app works fine on various machines with NVIDIA GPUs with Intel- and IBM-CPUs.
- With GASNet-2022.9.2:
test.exec: symbol lookup error: [Legion]/bindings/regent/libregent.so: undefined symbol: PMI_Allgather
Naturally the application does not even start, regardless of the number of ranks used. The executable (.exec) of the app seems to run on a single rank when interactively executed without srun
.
I am using the following modules for these cases:
rocm/4.5.0 gcc/11.2.0 cray-pmi/6.1.3 cray-mpich/8.1.2