Skip to content

Conversation

@mwarusz
Copy link
Member

@mwarusz mwarusz commented Oct 16, 2025

This PR adds wrappers for Kokkos hierarchical parallelism together with tests. It is based on top of #295, which should be merged first. Since hierarchical parallelism enables many combinations of For/Reduce/Scan patterns I did not attempt to enable and test every possibility, but instead focused on the patterns that we are likely to use in Omega. This PR also doesn't include any advanced optimizations that I showed before. I figured we should start with the simplest version and optimize after we start using the wrappers.

Two of the new tests that involve outer parallel reduce failed on Aurora with the SYCL backed. Updating Kokkos to 4.7.1
made them pass. I decided to disable them with ifdef guards and just wait for E3SM to update its Kokkos version.

The parallel loops developer docs have been expanded and can be viewed here:
https://portal.nersc.gov/project/e3sm/mwarusz/kokkos-hipar/html/devGuide/ParallelLoops.html

Checklist

  • Documentation:
    • Developer's Guide has been updated
    • Documentation has been built locally and changes look as expected
  • Building
    • CMake build does not produce any new warnings from changes in this PR
  • Testing
    • CTest unit tests for new features have been added per the approved design.
    • Unit tests have passed. Please provide a relevant CDash build entry for verification.

Copy link

@sbrus89 sbrus89 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mwarusz, this looks great. The documentation is excellent; I just had a few minor comments.

Comment on lines +33 to +28
static KOKKOS_FUNCTION int f2(int J1, int J2, int N1, int N2) {
return -(N1 * N2) / 4 + J1 + N1 * J2;
}

static KOKKOS_FUNCTION int f3(int J1, int J2, int J3, int N1, int N2, int N3) {
return -(N1 * N2 * N3) / 4 + J1 + N1 * (J2 + N2 * J3);
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Definitely not a big deal, but since these are the same functions used in the flat tests, should we put all the f* functions in a common place?

Copy link
Member Author

@mwarusz mwarusz Oct 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't particularly mind that these functions are repeated. This is because there is no particular reason for them to be the same as in the flat tests. I am more concerned that hostArraysEqual is defined twice, and should probably work on GPU too. I want to make it more general and move it to OmegaKokkos.h, since comparing two arrays might be useful even outside of testing. In general, there is currently no good place to put common test utilities. There is OceanTestCommon.h, but these aren't ocean tests.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree this would be broadly useful. The Halo test also uses a similar comparison, and a function like this could be useful for debugging purposes. OmegaKokkos.h is probably the best place for it currently, maybe we should have something like an OmegaUtilities.h header? Some of the other functions in OceanTestCommon.h could be more generally useful too (like the sum and maxVal functions).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think adding OmegaUtilities.h is a great idea. A lot of stuff in OceanTestCommon.h needs to be redesigned to deal with vertical coordinates and hierarchical parallelism, which is a good opportunity to create a general-purpose utility header from parts of it. For now, I added arraysEqual to OmegaKokkos.h and updated the Kokkos wrappers and Halo tests to use it.

@brian-oneill
Copy link

Thanks @mwarusz, this is great. I agree it's probably best to focus on getting the wrappers in first and make sure we have things working accurately, then we can focus on optimization. Just to see where things stand without optimization, I did some performance tests where I replaced the flat parallelFor loops in AuxiliaryState and Tendencies with these HiPar wrappers, ran the driver with the mpaso.icoswisc30e3r2.20230901.nc mesh and compared to the current dev branch, using 1 full node in each case.

For GPU: On Frontier, the HiPar loops and the flat loops are pretty comparable, with HiPar maybe a tad faster on average (<~1%). On pm-gpu, the HiPar loops are about 15% slower.

For CPU: On pm-cpu, the HiPar loops are about 5% faster, while on chrysalis the HiPar loops are actually about 40% faster!

@sbrus89
Copy link

sbrus89 commented Oct 17, 2025

Thanks for testing @brian-oneill, those results seem like a solid place to start before any optimization. I'm sure further optimization would be able to address the 15% slowdown on pm-gpu. We can tackle that down the road as you both suggested.

@brian-oneill
Copy link

The ctests run successfully on frontier CPU & GPU, pm-cpu, pm-gpu, and chrysalis. I was thinking OMEGA_TEAMSIZE should just be named TeamSize within the code to be more consistent with our naming conventions, and we could make it configurable at build time like the VecLength with -DOMEGA_TEAMSIZE=xx, using default values if not specified. But that can wait for future updates.

Copy link

@sbrus89 sbrus89 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great. Thanks for adding this @mwarusz.

@sbrus89
Copy link

sbrus89 commented Oct 29, 2025

@mwarusz, can you resolve these conflicts?

@sbrus89 sbrus89 self-assigned this Oct 29, 2025
@mwarusz mwarusz force-pushed the omega/hipar-kokkos-wrappers branch from 47613d6 to 37173f1 Compare October 29, 2025 15:33
@mwarusz
Copy link
Member Author

mwarusz commented Oct 29, 2025

@sbrus89

@mwarusz, can you resolve these conflicts?

Done.

@sbrus89 sbrus89 merged commit 785230d into E3SM-Project:develop Oct 29, 2025
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants