Skip to content

Host memory allocation grows constantly with MPI GPU awareness on #5859

Open
@mszpindler

Description

@mszpindler

Memory allocation on host side of WarpX job grows constantly with MPI GPU awareness ON.

This is observed on LUMI (AMD GPUs amd-gfx90a) using user's input. Reproducer run is on two one GPU node (8 devices).

WarpX was built using instruction for LUMI from the documentation.

Steps taken so far:

  • Originaly observed with WarpX release 24.01 (and embedded AMReX 24.01)
  • Recently tested with releases 25.03 and standalone AMReX 25.03
  • Tested with AMD ROCm (6.0.3 and 6.2.2) and Cray compilers
  • Tested with manually installed libfabric 2.0.0 and recent MPICH (no GTL library)

The only mitigation found so far is to disable MPI support for GPU entirely at runtime:

  • MPICH_GPU_SUPPORT_ENABLED=0 for Cray MPICH
  • MPICH_ENABLE_GPU=0 for vanilla MPICH
  • Turning off GPU IPC only does not prevent memory grow

Observed with Slurm memory reports, pmap and valgrind.

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions