Enable OpenMP GPU-to-GPU MPI blocking transfers

**Is your feature request related to a problem? Please describe.**

We've started porting MOM6 to GPUs using OpenMP target offload. Currently, when performing halo updates, we ship data back to CPU to communicate data via MPI. We want to update FMS (mainly the mpp parts) to be able to perform the transfers directly between GPUs so we can avoid shipping data back and forth between the CPU and GPU, which is currently significantly negatively impacting our performance.

**Describe the solution you'd like**

For now, porting blocking communications should be sufficient. This requires a few changes:
* an optional flag in `mpp_do_group_update` to indicate whether CPU or GPU transfer should happen
  * this flag would need to be passed to called subroutines like `mpp_send/recv` and consequently `mpp_transmit`. 
  * The flag can be optional with a default so existing code that uses mpp_do_group_update doesn't need to be updated.
* In the packing/unpacking step, OpenMP target directives need to be added to allocate `buffer` on GPU and to pack that variable with GPU data, which would optionally execute based on value of the above flag.
  * The packing/unpacking loops probably need to be modified to enable GPU parallelization.
* The MPI calls would need to be wrapped in openmp directives to ensure the GPU version of the variable is being used (conditionally based on above flag).

**Describe alternatives you've considered**

So far, we've been shipping data from the GPU back to the CPU, which is terrible for performance.

**Additional context**

I've made a first attempt at these changes in [my fork of FMS](https://github.com/edoyango/FMS/compare/bd32c3b...2025.03-ompoffload) which hopefully illustrates the changes needed. In my changes, `mpp_do_group_update` has an `omp_offload` flag which, if set to `.true.` does the packing and communication on the GPU (assuming the MPI is GPU-aware). These changes have been tested with openmpi 5.0.8 built with cuda 12.9 and using nvfortran 24.9 using flags `-mp=gpu -gpu=mem:separate -O1 -Mnofma`.

The set of changes in my fork is just enough to get our test case working which only uses the real versions of the calls. More changes are needed to cover more cases - especially in the packing/unpacking step.

I'll be opening a pull request once I've ported the remaining packing loops.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Enable OpenMP GPU-to-GPU MPI blocking transfers #1771

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Enable OpenMP GPU-to-GPU MPI blocking transfers #1771

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions