Open
Description
Memory allocation on host side of WarpX job grows constantly with MPI GPU awareness ON.
This is observed on LUMI (AMD GPUs amd-gfx90a
) using user's input. Reproducer run is on two one GPU node (8 devices).
WarpX was built using instruction for LUMI from the documentation.
Steps taken so far:
- Originaly observed with WarpX release 24.01 (and embedded AMReX 24.01)
- Recently tested with releases 25.03 and standalone AMReX 25.03
- Tested with AMD ROCm (6.0.3 and 6.2.2) and Cray compilers
- Tested with manually installed libfabric 2.0.0 and recent MPICH (no GTL library)
The only mitigation found so far is to disable MPI support for GPU entirely at runtime:
MPICH_GPU_SUPPORT_ENABLED=0
for Cray MPICHMPICH_ENABLE_GPU=0
for vanilla MPICH- Turning off GPU IPC only does not prevent memory grow
Observed with Slurm memory reports, pmap and valgrind.