Tpetra:  GPU-aware MPI toggling is broken on GFX942_APU

## Bug Report
Currently, we cannot disable GPU-aware MPI in Tpetra, because the implementation uses a DualView and chooses the Host or Device (based on the runtime toggle).  With the APU arch (kokkos Unified Memory) - that dualview silently combines the host/device views to avoid copies.  But in the case of Device-aware MPI, this means that when you set GPU_AWARE=OFF, you actually still pass a HipSpace buffer to MPI - which will cause a segfault or hang.

### Description
Using the GFX942_APU arch, and running with 
`MPICH_GPU_SUPPORT_ENABLED=0 TPETRA_ASSUME_GPU_AWARE_MPI=0 flux run -x -N1 -n1 ./app`
will result in a crash at launch.

swapping to GPU-aware ON, will run

`MPICH_GPU_SUPPORT_ENABLED=1 TPETRA_ASSUME_GPU_AWARE_MPI=1 flux run -x -N1 -n1 ./app`

If you build without the `APU` both cases will run fine.

#
And yes,  if you feel the Host/Device memory is both addressable - it is. But the MPI implementations all work based off who allocated the memory (AMD/Nvidia vs Malloc).  

This currently prohibits running w/out GPU-aware ON, which is a functionality we need if we want to use ROCM 7.x (as MPI currently doesn't support the 7.x ROCM , major releases typically have a 4-5 month lag before MPI supports it)

@trilinos/tpetra 
@rppawlo @cgcgcg 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Tpetra: GPU-aware MPI toggling is broken on GFX942_APU #14681

Bug Report

Description

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Tpetra: GPU-aware MPI toggling is broken on GFX942_APU #14681

Description

Bug Report

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions