Skip to content

Enable OpenMP GPU-to-GPU MPI blocking transfers #1771

@edoyango

Description

@edoyango

Is your feature request related to a problem? Please describe.

We've started porting MOM6 to GPUs using OpenMP target offload. Currently, when performing halo updates, we ship data back to CPU to communicate data via MPI. We want to update FMS (mainly the mpp parts) to be able to perform the transfers directly between GPUs so we can avoid shipping data back and forth between the CPU and GPU, which is currently significantly negatively impacting our performance.

Describe the solution you'd like

For now, porting blocking communications should be sufficient. This requires a few changes:

  • an optional flag in mpp_do_group_update to indicate whether CPU or GPU transfer should happen
    • this flag would need to be passed to called subroutines like mpp_send/recv and consequently mpp_transmit.
    • The flag can be optional with a default so existing code that uses mpp_do_group_update doesn't need to be updated.
  • In the packing/unpacking step, OpenMP target directives need to be added to allocate buffer on GPU and to pack that variable with GPU data, which would optionally execute based on value of the above flag.
    • The packing/unpacking loops probably need to be modified to enable GPU parallelization.
  • The MPI calls would need to be wrapped in openmp directives to ensure the GPU version of the variable is being used (conditionally based on above flag).

Describe alternatives you've considered

So far, we've been shipping data from the GPU back to the CPU, which is terrible for performance.

Additional context

I've made a first attempt at these changes in my fork of FMS which hopefully illustrates the changes needed. In my changes, mpp_do_group_update has an omp_offload flag which, if set to .true. does the packing and communication on the GPU (assuming the MPI is GPU-aware). These changes have been tested with openmpi 5.0.8 built with cuda 12.9 and using nvfortran 24.9 using flags -mp=gpu -gpu=mem:separate -O1 -Mnofma.

The set of changes in my fork is just enough to get our test case working which only uses the real versions of the calls. More changes are needed to cover more cases - especially in the packing/unpacking step.

I'll be opening a pull request once I've ported the remaining packing loops.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementIssue/PR for a modification that increases performance, improves syntax, or adds functionality.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions