Skip to content

Tpetra: GPU-aware MPI toggling is broken on GFX942_APU #14681

@jjellio

Description

@jjellio

Bug Report

Currently, we cannot disable GPU-aware MPI in Tpetra, because the implementation uses a DualView and chooses the Host or Device (based on the runtime toggle). With the APU arch (kokkos Unified Memory) - that dualview silently combines the host/device views to avoid copies. But in the case of Device-aware MPI, this means that when you set GPU_AWARE=OFF, you actually still pass a HipSpace buffer to MPI - which will cause a segfault or hang.

Description

Using the GFX942_APU arch, and running with
MPICH_GPU_SUPPORT_ENABLED=0 TPETRA_ASSUME_GPU_AWARE_MPI=0 flux run -x -N1 -n1 ./app
will result in a crash at launch.

swapping to GPU-aware ON, will run

MPICH_GPU_SUPPORT_ENABLED=1 TPETRA_ASSUME_GPU_AWARE_MPI=1 flux run -x -N1 -n1 ./app

If you build without the APU both cases will run fine.

And yes, if you feel the Host/Device memory is both addressable - it is. But the MPI implementations all work based off who allocated the memory (AMD/Nvidia vs Malloc).

This currently prohibits running w/out GPU-aware ON, which is a functionality we need if we want to use ROCM 7.x (as MPI currently doesn't support the 7.x ROCM , major releases typically have a 4-5 month lag before MPI supports it)

@trilinos/tpetra
@rppawlo @cgcgcg

Metadata

Metadata

Assignees

No one assigned

    Labels

    pkg: Tpetratype: bugThe primary issue is a bug in Trilinos code or tests

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions