Skip to content

Conversation

@mjschmidt271
Copy link
Contributor

This PR generalizes the GPU-related cmake bits and bobs so that it can handle being configured for HIP/AMD GPUs.

Haero built and passed tests for me on one of caraway's MI200 node, using amdclang v17.0.0 and hipcc/rocm v6.0

Note

  • This introduces a new cmake configuration variable, HAERO_DEVICE_ARCH, that is set automatically if DEVICE_ARCH is set.
    • This is a little bit hacky, but is employed at ${haero_root}/CMakeLists.txt:176 and seemed necessary because the name of the kokkos device flag, KOKKOS_ARCH_$DEVICE_ARCH, is not known ahead of time.
    • There's probably a more elegant solution, but this gets things working.
  • This was my first time building for AMD, so I just added things until it worked and can't guarantee 100% ideal configuration.
  • I noticed in the kokkos docs that the AMD_GFX<N> architecture flags are preferred, so I added the current ones to the setup script.

@bartgol I haven't taken a close look at EAMxx's strategy for building Haero, but with any luck, adding the HAERO_DEVICE_ARCH should have you off to the races.

Closes #488

Copy link
Collaborator

@jeff-cohere jeff-cohere left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice. I left a few minor comments for you to consider.

@bartgol
Copy link
Contributor

bartgol commented May 5, 2025

Thanks for putting this together. One thing that was puzzling was that the current code worked for the 1st cmake run. If I trigger cmake to run again (but not explicitly, e.g., via a modification of a cmake-related file), then the error would appear. Anyhow, this looks good to me.

@mjschmidt271
Copy link
Contributor Author

@bartgol Thank you for the tips!

And that's very strange about the second-time cmake config failing... I feel like the last time I saw this it was a matter of a cmake configure-file not being removed or rebuilt on the re-configure of cmake. Let me know if you see it again, and I'll try to run it down

@mjschmidt271 mjschmidt271 merged commit a3680d1 into main May 5, 2025
5 checks passed
@mjschmidt271 mjschmidt271 deleted the mjs/generalize_4_HIP branch May 5, 2025 17:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Haero is assuming that GPU=CUDA

4 participants