-
Notifications
You must be signed in to change notification settings - Fork 605
Description
Bug Report
Currently, we cannot disable GPU-aware MPI in Tpetra, because the implementation uses a DualView and chooses the Host or Device (based on the runtime toggle). With the APU arch (kokkos Unified Memory) - that dualview silently combines the host/device views to avoid copies. But in the case of Device-aware MPI, this means that when you set GPU_AWARE=OFF, you actually still pass a HipSpace buffer to MPI - which will cause a segfault or hang.
Description
Using the GFX942_APU arch, and running with
MPICH_GPU_SUPPORT_ENABLED=0 TPETRA_ASSUME_GPU_AWARE_MPI=0 flux run -x -N1 -n1 ./app
will result in a crash at launch.
swapping to GPU-aware ON, will run
MPICH_GPU_SUPPORT_ENABLED=1 TPETRA_ASSUME_GPU_AWARE_MPI=1 flux run -x -N1 -n1 ./app
If you build without the APU both cases will run fine.
And yes, if you feel the Host/Device memory is both addressable - it is. But the MPI implementations all work based off who allocated the memory (AMD/Nvidia vs Malloc).
This currently prohibits running w/out GPU-aware ON, which is a functionality we need if we want to use ROCM 7.x (as MPI currently doesn't support the 7.x ROCM , major releases typically have a 4-5 month lag before MPI supports it)