-
Notifications
You must be signed in to change notification settings - Fork 7
Add wrappers for Kokkos hierarchical parallelism #299
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add wrappers for Kokkos hierarchical parallelism #299
Conversation
sbrus89
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mwarusz, this looks great. The documentation is excellent; I just had a few minor comments.
| static KOKKOS_FUNCTION int f2(int J1, int J2, int N1, int N2) { | ||
| return -(N1 * N2) / 4 + J1 + N1 * J2; | ||
| } | ||
|
|
||
| static KOKKOS_FUNCTION int f3(int J1, int J2, int J3, int N1, int N2, int N3) { | ||
| return -(N1 * N2 * N3) / 4 + J1 + N1 * (J2 + N2 * J3); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Definitely not a big deal, but since these are the same functions used in the flat tests, should we put all the f* functions in a common place?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't particularly mind that these functions are repeated. This is because there is no particular reason for them to be the same as in the flat tests. I am more concerned that hostArraysEqual is defined twice, and should probably work on GPU too. I want to make it more general and move it to OmegaKokkos.h, since comparing two arrays might be useful even outside of testing. In general, there is currently no good place to put common test utilities. There is OceanTestCommon.h, but these aren't ocean tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree this would be broadly useful. The Halo test also uses a similar comparison, and a function like this could be useful for debugging purposes. OmegaKokkos.h is probably the best place for it currently, maybe we should have something like an OmegaUtilities.h header? Some of the other functions in OceanTestCommon.h could be more generally useful too (like the sum and maxVal functions).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think adding OmegaUtilities.h is a great idea. A lot of stuff in OceanTestCommon.h needs to be redesigned to deal with vertical coordinates and hierarchical parallelism, which is a good opportunity to create a general-purpose utility header from parts of it. For now, I added arraysEqual to OmegaKokkos.h and updated the Kokkos wrappers and Halo tests to use it.
|
Thanks @mwarusz, this is great. I agree it's probably best to focus on getting the wrappers in first and make sure we have things working accurately, then we can focus on optimization. Just to see where things stand without optimization, I did some performance tests where I replaced the flat For GPU: On Frontier, the HiPar loops and the flat loops are pretty comparable, with HiPar maybe a tad faster on average (<~1%). On pm-gpu, the HiPar loops are about 15% slower. For CPU: On pm-cpu, the HiPar loops are about 5% faster, while on chrysalis the HiPar loops are actually about 40% faster! |
|
Thanks for testing @brian-oneill, those results seem like a solid place to start before any optimization. I'm sure further optimization would be able to address the 15% slowdown on pm-gpu. We can tackle that down the road as you both suggested. |
|
The ctests run successfully on frontier CPU & GPU, pm-cpu, pm-gpu, and chrysalis. I was thinking |
sbrus89
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks great. Thanks for adding this @mwarusz.
|
@mwarusz, can you resolve these conflicts? |
47613d6 to
37173f1
Compare
This PR adds wrappers for Kokkos hierarchical parallelism together with tests. It is based on top of #295, which should be merged first. Since hierarchical parallelism enables many combinations of For/Reduce/Scan patterns I did not attempt to enable and test every possibility, but instead focused on the patterns that we are likely to use in Omega. This PR also doesn't include any advanced optimizations that I showed before. I figured we should start with the simplest version and optimize after we start using the wrappers.
Two of the new tests that involve outer parallel reduce failed on Aurora with the SYCL backed. Updating Kokkos to 4.7.1
made them pass. I decided to disable them with
ifdefguards and just wait for E3SM to update its Kokkos version.The parallel loops developer docs have been expanded and can be viewed here:
https://portal.nersc.gov/project/e3sm/mwarusz/kokkos-hipar/html/devGuide/ParallelLoops.html
Checklist