-
Notifications
You must be signed in to change notification settings - Fork 446
EAMxx: Kokkos 4.5 update, fixes and workarounds for Kokkos update, Aurora support #6916
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
this commit is running on sunspot |
|
i should have not committed to here turning off compose and mam, will remove once this PR is finalized. |
|
Looks good. A couple of points:
|
| if (Kokkos_ENABLE_SYCL) | ||
| #enable_language(SYCL) | ||
| set (EAMXX_ENABLE_GPU TRUE CACHE BOOL "" FORCE) | ||
| set (SYCL_BUILD TRUE CACHE BOOL "" FORCE) #needed for yakl if kokkos vars are not visible there? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could turn off COMPOSE here, I believe. (I'll fix the COMPOSE issues once I'm on the machine.)
| set(BUILD_HOMME_PRIM OFF CACHE BOOL "") | ||
| set(HOMME_ENABLE_COMPOSE ON CACHE BOOL "") | ||
| #set(HOMME_ENABLE_COMPOSE ON CACHE BOOL "") | ||
| set(HOMME_ENABLE_COMPOSE OFF CACHE BOOL "") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just noting here this is what needs to be removed from this PR.
|
|
||
| #<<<<<<< HEAD | ||
| #find_library(NETCDF_C netcdf HINTS $ENV{NETCDF_C_PATH}/lib) | ||
| #target_link_libraries(scream_rrtmgp_yakl ${NETCDF_C} rrtmgp scream_share) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you explain this change? The second line I'm guessing is due to Kokkos, but what about the NETCDF_C line?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i need to ask Jim about this -- iirc i had to set io libs in general, but also some cmake variables for io for rrtmgp separately. so i left this file messy, because i still need to sort it out.
|
i id not understant EKAT comment -- i have an ekat branch, oksanaguba/spot, which i maintained only to point to a different kokkos. so i periodically merge master into it, but not too frequently, so it may be behind the ekat used in master, but its kokkos points not to e3sm kokkos, but to kokkos develop branch. this is another issue to sort out (but i would not know how). |
Re: EKAT, it appears there are missing commits on your branch w.r.t. the submodule point in E3SM: E3SM-Project/EKAT@oksanaguba/spot...4231383 @bartgol am I seeing that correctly? Also, is current EKAT master ready to be used in E3SM? If so, Oksana, I recommend rebasing your ekat branch on master, then pointing the ekat submodule to that cleaned-up branch. More generally, testing across machines will help to flush out issues, if there are any, including any with ekat. |
|
there are some December commits in the diff you posted, yes, at least those are expected. my ekat branch was not updated too frequently. will do what you suggested, thanks. |
|
As an example, here is one commit I'm either confused about or is truly missing from your branch: E3SM-Project/EKAT@5804bfa. This commit was merged in late Nov 2024, which is why I'm thinking there's a fair bit missing. Edit: I think EKAT PR 347 was the last merged into your branch. PRs 349-356 are missing. (357-359 are in EKAT master but not used yet in E3SM, and 348 was closed without merging.) Thus, what I wrote above is probably the correct solution: Rebase your branch on EKAT master. |
bartgol
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me. I have a couple of questions/remarks, but nothing too important. I'm going to approve, modulo the fix of EKAT submodule that you are already working on.
| string(APPEND CMAKE_CXX_FLAGS_RELEASE " -O2") | ||
| string(APPEND CMAKE_Fortran_FLAGS_DEBUG " -O0 -g -fpe0") | ||
|
|
||
| #adding -g here leads to linker internal errors |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And yet I see -g below. Is this comment outdated, and link is fine now? If so, we should remove it to avoid confusion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"just -g" is in debug flags only, and i haven't tried to build debug. so i only changed release flags.
i was able to compile and to run with "just -g" on sunspot, however, the folder size for e3sm build gets from say 4 gb to 140 gb with full -g. last i tried "just -g" on aurora, there were linking issues. so this is all WIP. i'd say keep this as is now and sort it later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What I meant is that the comment is confusing: it makes it sound like we should not use -g b/c of link errors, but we do use it across the board. Maybe the comment needs to be clarified. Something like "WARNING: the -g flag seems to have a wild behavior sometimes, with side effects ranging from VERY large binaries to link errors. If you experience errors, remove it".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks, done
| #endif | ||
|
|
||
| #ifdef KOKKOS_ENABLE_SYCL | ||
| const int team_size = 4; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am puzzled by this. Do we really want just 4 on sycl? It's staggeringly different from other GPU platforms' default...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the change is from Nov 2023. i looked at my notes and did not see why it was needed. i will try to remove it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is for the P-D remapper (not the GllFvPhys one), so its relevance is somewhat limited (doesn't run at runtime, usually). So not a huge deal. But if things work without, no point in limiting team size to 4.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
used hip TS calculation and it worked at least for ne4 run.
| #find_library(NETCDF_C netcdf HINTS $ENV{NETCDF_C_PATH}/lib) | ||
| #target_link_libraries(scream_rrtmgp_yakl ${NETCDF_C} rrtmgp scream_share) | ||
| #======= | ||
| find_library(NETCDF_C netcdf HINTS ${NetCDF_C_PATH}/lib) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While at it, do you mind adding ${NetCDF_C_PATH}/lib64 to the hints? Since lib64 is quite common, we should be proactive and add it to the list of hints (e..g, my laptop and workstation use lib64)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks, done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jgfouca somewhere here maybe:
CMake Warning at eamxx/src/physics/rrtmgp/CMakeLists.txt:191 (target_link_libraries):
Target "eamxx_rrtmgp_interface" requests linking to directory
"/lus/flare/projects/E3SM_Dec/soft/netcdf/4.9.2c-4.6.1f/oneapi.eng.2024.07.30.002".
Targets may link only to libraries. CMake is dropping the item.
CMake Warning at eamxx/src/physics/rrtmgp/CMakeLists.txt:191 (target_link_libraries):
Target "scream_rrtmgp" requests linking to directory
"/lus/flare/projects/E3SM_Dec/soft/netcdf/4.9.2c-4.6.1f/oneapi.eng.2024.07.30.002".
Targets may link only to libraries. CMake is dropping the item.
CMake Warning at eamxx/src/physics/rrtmgp/CMakeLists.txt:191 (target_link_libraries):
Target "atm" requests linking to directory
"/lus/flare/projects/E3SM_Dec/soft/netcdf/4.9.2c-4.6.1f/oneapi.eng.2024.07.30.002".
Targets may link only to libraries. CMake is dropping the item.
CMake Warning at eamxx/src/physics/rrtmgp/CMakeLists.txt:191 (target_link_libraries):
Target "e3sm.exe" requests linking to directory
"/lus/flare/projects/E3SM_Dec/soft/netcdf/4.9.2c-4.6.1f/oneapi.eng.2024.07.30.002".
Targets may link only to libraries. CMake is dropping the item.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This code looks wrong. If we are building as part of a E3SM/CIME case, there's no need to find netcdf, it should already be a target. We can check if the netcdf target already exists or check if SCREAM_CIME_BUILD is on and not do this find if either is true.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note, this is issue is not related specifically to this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc @amametjanov (machine poc) as I haven't seen this problem on other machines, so maybe something specific to aurora? Not sure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we on ly link to netcdf here because of the unit tests, right? we don't need this stuff at actual runtime. Maybe I am wrong, I don't know
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably, but the source won't build without netcdf. We only need to do this find if we are standalone eamxx.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jgfouca IMO, it's fine to call find_package (and the likes) regradless of the value of SCREAM_CIME_BUILD... When the pkg/lib/path has alreday been found, it should be a no-op.
That said, the warning reported by Naser is indeed wrong, as it's likning a path instead of a tgt/lib...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it fine to mix find_package and find_library?
|
Looks like you need to merge ekat master into your branch one more time to get the macro name fix. |
|
ran this commit with kokkos 4.5.01 tag |
ambrad
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great.
Note that SCREAM nightlies run on master, not next, so I always run e3sm_scream_v1_medres on Frontier, Chrysalis, and pm-gpu before merging a complex PR to avoid having to fix master later. I recommend it to others, as well.
|
I added Conrad to the reviewers since he's working on upgrading Kokkos for multiple reasons, including Aurora readiness. |
|
CI is failing with a checkout on the ekat submodule. Looking at the branch, https://github.com/E3SM-Project/EKAT/commits/oksanaguba/spot/, it's unclear where 1eae1a is coming from. @oksanaguba I suggest checking the submodule state in this PR w.r.t. to your ekat branch. |
|
@ambrad my ekat branch had to use kokkos main repo. if CI somehow does not know to add a remote (the default remote in ekat is poinitng to e3sm kokkos (whatever it is)), then it would not be able to fetch my commit. |
I'm sure I'm confused, but on your ekat branch I see this as the Kokkos submodule: https://github.com/E3SM-Project/kokkos/tree/e6e5c4598d16d756db62225ab7f937ee833bd660. This appears to be in E3SM-Project/kokkos. |
|
Re: commit 1eae1a, I don't see where it's coming from. I don't see it in your EKAT spot branch, but I could easily be mistaken. |
|
Re: Kokkos: Ok, I'm guessing what's happening is you have your local submodule pointed to the Kokkos repo with this commit, but, as you wrote, the submodule metadata points to E3SM-Project/Kokkos. Generally, we have to put some extra commits on top of a Kokkos version, which is why E3SM-Project/Kokkos exists. Thus, I suggest you start by creating a branch in E3SM-Project/Kokkos with the desired Kokkos state and then use this in the submodule. That will let testing proceed. |
|
i'll ask Conrad as he prob already has a branch |
|
i pointed the branch to an e3sm kokkos branch, but haven't had a chance to see if CI is happier. also, will do more testing tomorrow. |
|
Following up here, after Luca and I iterated, it looks like if I add |
…#6916) EAMxx: Kokkos 4.5 update, fixes and workarounds for Kokkos update, Aurora support * Update externals/ekat/extern/kokkos to 4.5. * Update EKAT itself to fix deprecated usage and add Aurora cmake files. * Fix and work around various issues on Frontier due to the Kokkos update. * Fix deprecated usage in Hommexx. [BFB]
|
With a test merge to next, this is the testing status:
Chrysalis test sample: |
|
Looks like you already merge to next? |
No, that's just a test run pushed to my fork: "added a commit to ambrad/E3SM". I used the integrator template; the ref number triggered the message above. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Comments about older cmake.
Also, please remove these older cmake files:
externals/ekat/cmake/machine-files/auroracpu.cmakeexternals/ekat/cmake/machine-files/sunspot-gen.cmakeexternals/ekat/cmake/machine-files/sunspotcpu.cmake
cime_config/machines/cmake_macros/oneapi-ifxgpu_sunspot-pvc.cmake
Outdated
Show resolved
Hide resolved
| <arg name="total_num_tasks">-np {{ total_tasks }} --label</arg> | ||
| <arg name="ranks_per_node">-ppn {{ tasks_per_node }}</arg> | ||
| <arg name="ranks_bind">-envall</arg> | ||
| <arg name="ranks_bind">--no-vni --cpu-bind $ENV{RANKS_BIND} -envall</arg> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think --no-vni is not needed anymore: can be removed.
Also, default is -genvall now: -envall can be removed.
|
|
||
| set(EKAT_MPIRUN_EXE "mpiexec" CACHE STRING "" FORCE) | ||
| set(EKAT_MPI_NP_FLAG "-np" CACHE STRING "" FORCE) | ||
| set(EKAT_MPI_EXTRA_ARGS "--label --cpu-bind depth -envall" CACHE STRING "") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For reproducibility, please replace --cpu-bind depth -envall with
--cpu-bind list:0-7,104-111:8-15,112-119:16-23,120-127:24-31,128-135:32-39,136-143:40-47,144-151:52-59,156-163:60-67,164-171:68-75,172-179:76-83,180-187:84-91,188-195:92-99,196-203
|
@oksanaguba see Az's comments above. No need to rush. Let me know when you're done. |
|
@amametjanov i addressed your comments except for the ekat variable that is actually not used in cime runs (I think). It can be sorted later. The rest is done, -evvall and -no-vni , and extra files are removed. Could you please re-review? |
|
Thanks for removing older cmake. |
tcclevenger
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All CI passing
🍏🍏🍏 |
|
Great work, everyone. I'll merge this to next once I get the go-ahead on the devops channel. |
EAMxx: Kokkos 4.5 update, fixes and workarounds for Kokkos update, Aurora support * Update externals/ekat/extern/kokkos to 4.5. * Update EKAT itself to fix deprecated usage and add Aurora cmake files. * Fix and work around various issues on Frontier due to the Kokkos update. * Fix deprecated usage in Hommexx. [BFB]
|
Reverted from next by means of resetting next. pm-cpu e3sm_developer_next_gnu and e3sm_integration_next_intel need investigation. Also gcp12 e3sm_integration_next_gnu. (Later: This was just a machine fluke.) https://my.cdash.org/index.php?project=E3SM&date=2025-03-19 I'll report back once I've taken a look at pm-cpu. Update: There's something peculiar in the pm-cpu GNU developer test suite. See the convo on infrastructure-devops. I'll return to this on Monday. Update: Another round of nightlies shows something mysterious is happening in the pm-cpu GNU developer test suite independently of this PR. I'll remerge this PR once the dashboard is OK again. |
|
Abhishek just posted that we need to start using in April |
EAMxx: Kokkos 4.5 update, fixes and workarounds for Kokkos update, Aurora support * Update externals/ekat/extern/kokkos to 4.5. * Update EKAT itself to fix deprecated usage and add Aurora cmake files. * Fix and work around various issues on Frontier due to the Kokkos update. * Fix deprecated usage in Hommexx. [BFB]
Uh oh!
There was an error while loading. Please reload this page.