Skip to content

Conversation

@oksanaguba
Copy link
Contributor

@oksanaguba oksanaguba commented Jan 19, 2025

  • Update externals/ekat/extern/kokkos to 4.5.
  • Update EKAT itself to fix deprecated usage and add Aurora cmake files.
  • Fix and work around various issues on Frontier due to the Kokkos update.
  • Fix deprecated usage in Hommexx.

@oksanaguba oksanaguba mentioned this pull request Jan 19, 2025
@oksanaguba
Copy link
Contributor Author

this commit is running on sunspot

~/ess/v1-jan19/run/atm.log.11365537.amn-0001.250119-204314 

@oksanaguba
Copy link
Contributor Author

oksanaguba commented Jan 19, 2025

i should have not committed to here turning off compose and mam, will remove once this PR is finalized.

@ambrad ambrad requested a review from bartgol January 19, 2025 21:29
@ambrad
Copy link
Member

ambrad commented Jan 19, 2025

Looks good. A couple of points:

  1. In addition to any usual pre-merge testing, I strongly recommend running e3sm_scream_v1_medres with this PR on Frontier, pm-gpu, and Chrysalis, as well as homme_integration on Chrysalis.
  2. I'm thinking the EKAT update isn't right. No, it's fine. I looked more closely. I'm seeing some commits from Luca that diff, but I suspect that's intended. Edit 2: Ok, maybe my initial analysis was right: there seems to be some unintended diffs in ekat.

if (Kokkos_ENABLE_SYCL)
#enable_language(SYCL)
set (EAMXX_ENABLE_GPU TRUE CACHE BOOL "" FORCE)
set (SYCL_BUILD TRUE CACHE BOOL "" FORCE) #needed for yakl if kokkos vars are not visible there?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could turn off COMPOSE here, I believe. (I'll fix the COMPOSE issues once I'm on the machine.)

set(BUILD_HOMME_PRIM OFF CACHE BOOL "")
set(HOMME_ENABLE_COMPOSE ON CACHE BOOL "")
#set(HOMME_ENABLE_COMPOSE ON CACHE BOOL "")
set(HOMME_ENABLE_COMPOSE OFF CACHE BOOL "")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just noting here this is what needs to be removed from this PR.


#<<<<<<< HEAD
#find_library(NETCDF_C netcdf HINTS $ENV{NETCDF_C_PATH}/lib)
#target_link_libraries(scream_rrtmgp_yakl ${NETCDF_C} rrtmgp scream_share)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you explain this change? The second line I'm guessing is due to Kokkos, but what about the NETCDF_C line?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i need to ask Jim about this -- iirc i had to set io libs in general, but also some cmake variables for io for rrtmgp separately. so i left this file messy, because i still need to sort it out.

@oksanaguba
Copy link
Contributor Author

i id not understant EKAT comment -- i have an ekat branch, oksanaguba/spot, which i maintained only to point to a different kokkos. so i periodically merge master into it, but not too frequently, so it may be behind the ekat used in master, but its kokkos points not to e3sm kokkos, but to kokkos develop branch.

this is another issue to sort out (but i would not know how).

@ambrad
Copy link
Member

ambrad commented Jan 19, 2025

i id not understant EKAT comment

Re: EKAT, it appears there are missing commits on your branch w.r.t. the submodule point in E3SM: E3SM-Project/EKAT@oksanaguba/spot...4231383

@bartgol am I seeing that correctly? Also, is current EKAT master ready to be used in E3SM?

If so, Oksana, I recommend rebasing your ekat branch on master, then pointing the ekat submodule to that cleaned-up branch.

More generally, testing across machines will help to flush out issues, if there are any, including any with ekat.

@oksanaguba
Copy link
Contributor Author

there are some December commits in the diff you posted, yes, at least those are expected. my ekat branch was not updated too frequently. will do what you suggested, thanks.

@ambrad
Copy link
Member

ambrad commented Jan 19, 2025

As an example, here is one commit I'm either confused about or is truly missing from your branch: E3SM-Project/EKAT@5804bfa. This commit was merged in late Nov 2024, which is why I'm thinking there's a fair bit missing.

Edit: I think EKAT PR 347 was the last merged into your branch. PRs 349-356 are missing. (357-359 are in EKAT master but not used yet in E3SM, and 348 was closed without merging.)

Thus, what I wrote above is probably the correct solution: Rebase your branch on EKAT master.

Copy link
Contributor

@bartgol bartgol left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me. I have a couple of questions/remarks, but nothing too important. I'm going to approve, modulo the fix of EKAT submodule that you are already working on.

string(APPEND CMAKE_CXX_FLAGS_RELEASE " -O2")
string(APPEND CMAKE_Fortran_FLAGS_DEBUG " -O0 -g -fpe0")

#adding -g here leads to linker internal errors
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And yet I see -g below. Is this comment outdated, and link is fine now? If so, we should remove it to avoid confusion.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"just -g" is in debug flags only, and i haven't tried to build debug. so i only changed release flags.

i was able to compile and to run with "just -g" on sunspot, however, the folder size for e3sm build gets from say 4 gb to 140 gb with full -g. last i tried "just -g" on aurora, there were linking issues. so this is all WIP. i'd say keep this as is now and sort it later.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I meant is that the comment is confusing: it makes it sound like we should not use -g b/c of link errors, but we do use it across the board. Maybe the comment needs to be clarified. Something like "WARNING: the -g flag seems to have a wild behavior sometimes, with side effects ranging from VERY large binaries to link errors. If you experience errors, remove it".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks, done

#endif

#ifdef KOKKOS_ENABLE_SYCL
const int team_size = 4;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am puzzled by this. Do we really want just 4 on sycl? It's staggeringly different from other GPU platforms' default...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the change is from Nov 2023. i looked at my notes and did not see why it was needed. i will try to remove it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is for the P-D remapper (not the GllFvPhys one), so its relevance is somewhat limited (doesn't run at runtime, usually). So not a huge deal. But if things work without, no point in limiting team size to 4.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

used hip TS calculation and it worked at least for ne4 run.

#find_library(NETCDF_C netcdf HINTS $ENV{NETCDF_C_PATH}/lib)
#target_link_libraries(scream_rrtmgp_yakl ${NETCDF_C} rrtmgp scream_share)
#=======
find_library(NETCDF_C netcdf HINTS ${NetCDF_C_PATH}/lib)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While at it, do you mind adding ${NetCDF_C_PATH}/lib64 to the hints? Since lib64 is quite common, we should be proactive and add it to the list of hints (e..g, my laptop and workstation use lib64)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks, done

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jgfouca somewhere here maybe:

CMake Warning at eamxx/src/physics/rrtmgp/CMakeLists.txt:191 (target_link_libraries):
  Target "eamxx_rrtmgp_interface" requests linking to directory
  "/lus/flare/projects/E3SM_Dec/soft/netcdf/4.9.2c-4.6.1f/oneapi.eng.2024.07.30.002".
  Targets may link only to libraries.  CMake is dropping the item.


CMake Warning at eamxx/src/physics/rrtmgp/CMakeLists.txt:191 (target_link_libraries):
  Target "scream_rrtmgp" requests linking to directory
  "/lus/flare/projects/E3SM_Dec/soft/netcdf/4.9.2c-4.6.1f/oneapi.eng.2024.07.30.002".
  Targets may link only to libraries.  CMake is dropping the item.


CMake Warning at eamxx/src/physics/rrtmgp/CMakeLists.txt:191 (target_link_libraries):
  Target "atm" requests linking to directory
  "/lus/flare/projects/E3SM_Dec/soft/netcdf/4.9.2c-4.6.1f/oneapi.eng.2024.07.30.002".
  Targets may link only to libraries.  CMake is dropping the item.


CMake Warning at eamxx/src/physics/rrtmgp/CMakeLists.txt:191 (target_link_libraries):
  Target "e3sm.exe" requests linking to directory
  "/lus/flare/projects/E3SM_Dec/soft/netcdf/4.9.2c-4.6.1f/oneapi.eng.2024.07.30.002".
  Targets may link only to libraries.  CMake is dropping the item.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code looks wrong. If we are building as part of a E3SM/CIME case, there's no need to find netcdf, it should already be a target. We can check if the netcdf target already exists or check if SCREAM_CIME_BUILD is on and not do this find if either is true.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note, this is issue is not related specifically to this PR.

Copy link
Contributor

@mahf708 mahf708 Mar 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @amametjanov (machine poc) as I haven't seen this problem on other machines, so maybe something specific to aurora? Not sure.

Copy link
Contributor

@mahf708 mahf708 Mar 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we on ly link to netcdf here because of the unit tests, right? we don't need this stuff at actual runtime. Maybe I am wrong, I don't know

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably, but the source won't build without netcdf. We only need to do this find if we are standalone eamxx.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jgfouca IMO, it's fine to call find_package (and the likes) regradless of the value of SCREAM_CIME_BUILD... When the pkg/lib/path has alreday been found, it should be a no-op.

That said, the warning reported by Naser is indeed wrong, as it's likning a path instead of a tgt/lib...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it fine to mix find_package and find_library?

@ambrad
Copy link
Member

ambrad commented Jan 20, 2025

Looks like you need to merge ekat master into your branch one more time to get the macro name fix.

@oksanaguba
Copy link
Contributor Author

ran this commit with kokkos 4.5.01 tag tail ~/ess/v1-jan21/run/atm.log.11682723.amn-0001.250121-164543 on sunspot.

Copy link
Member

@ambrad ambrad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great.

Note that SCREAM nightlies run on master, not next, so I always run e3sm_scream_v1_medres on Frontier, Chrysalis, and pm-gpu before merging a complex PR to avoid having to fix master later. I recommend it to others, as well.

@ambrad ambrad requested a review from tcclevenger January 21, 2025 18:51
@ambrad
Copy link
Member

ambrad commented Jan 21, 2025

I added Conrad to the reviewers since he's working on upgrading Kokkos for multiple reasons, including Aurora readiness.

@ambrad
Copy link
Member

ambrad commented Jan 21, 2025

CI is failing with a checkout on the ekat submodule.

  Error: fatal: remote error: upload-pack: not our ref 1eae1ad5a01cdb510726624ac647e7f96dd883d0
  Error: fatal: Fetched in submodule path 'externals/ekat', but it did not contain 1eae1ad5a01cdb510726624ac647e7f96dd883d0. Direct fetching of that commit failed.

Looking at the branch, https://github.com/E3SM-Project/EKAT/commits/oksanaguba/spot/, it's unclear where 1eae1a is coming from. @oksanaguba I suggest checking the submodule state in this PR w.r.t. to your ekat branch.

@oksanaguba
Copy link
Contributor Author

@ambrad my ekat branch had to use kokkos main repo. if CI somehow does not know to add a remote (the default remote in ekat is poinitng to e3sm kokkos (whatever it is)), then it would not be able to fetch my commit.

@ambrad
Copy link
Member

ambrad commented Jan 21, 2025

my ekat branch had to use kokkos main repo.

I'm sure I'm confused, but on your ekat branch I see this as the Kokkos submodule: https://github.com/E3SM-Project/kokkos/tree/e6e5c4598d16d756db62225ab7f937ee833bd660. This appears to be in E3SM-Project/kokkos.

@ambrad
Copy link
Member

ambrad commented Jan 21, 2025

Re: commit 1eae1a, I don't see where it's coming from. I don't see it in your EKAT spot branch, but I could easily be mistaken.

@ambrad
Copy link
Member

ambrad commented Jan 21, 2025

Re: Kokkos: Ok, I'm guessing what's happening is you have your local submodule pointed to the Kokkos repo with this commit, but, as you wrote, the submodule metadata points to E3SM-Project/Kokkos. Generally, we have to put some extra commits on top of a Kokkos version, which is why E3SM-Project/Kokkos exists. Thus, I suggest you start by creating a branch in E3SM-Project/Kokkos with the desired Kokkos state and then use this in the submodule. That will let testing proceed.

@oksanaguba
Copy link
Contributor Author

i'll ask Conrad as he prob already has a branch

@oksanaguba
Copy link
Contributor Author

i pointed the branch to an e3sm kokkos branch, but haven't had a chance to see if CI is happier. also, will do more testing tomorrow.

@ndkeen
Copy link
Contributor

ndkeen commented Mar 13, 2025

Following up here, after Luca and I iterated, it looks like if I add -DKokkos_ARCH_AMD_GFX942_APU=OFF to the Kokkos options, I can get scream to build on tuolumne with this branch. So something must have changed between the version of kokkos we have now (4.2?) and 4.5, such that this was needed (and/or the code we use to interface with kokkos).

@rljacob rljacob marked this pull request as ready for review March 13, 2025 17:53
ambrad added a commit to ambrad/E3SM that referenced this pull request Mar 13, 2025
…#6916)

EAMxx: Kokkos 4.5 update, fixes and workarounds for Kokkos update, Aurora support

* Update externals/ekat/extern/kokkos to 4.5.
* Update EKAT itself to fix deprecated usage and add Aurora cmake files.
* Fix and work around various issues on Frontier due to the Kokkos update.
* Fix deprecated usage in Hommexx.

[BFB]
@ambrad
Copy link
Member

ambrad commented Mar 13, 2025

With a test merge to next, this is the testing status:

  • PM-GPU: e3sm_eamxx_v1_medres passes against baselines.
  • Frontier: e3sm_eamxx_v1_medres passes.
  • Chrysalis: A set of 5 tests, listed below, passes against baselines.
  • Update: Chrysalis e3sm_integration passes against baselines, except for the known MMF test failure.

Chrysalis test sample:

  ERS_Ld3.ne30pg2_r05_IcoswISC30E3r5.WCYCL1850.chrysalis_intel.allactive-nlmaps
  ERS_Ln22.ne30pg2_ne30pg2.F2010-SCREAMv1.chrysalis_intel.eamxx-L128--eamxx-output-preset-4
  SMS_Ld1.ne30pg2_r05_IcoswISC30E3r5.WCYCL1850.chrysalis_intel.allactive-wcprod_1850
  HOMMEBFB_P24.f19_g16_rx1.A.chrysalis_intel
  HOMME_P24.f19_g16_rx1.A.chrysalis_intel

@bartgol
Copy link
Contributor

bartgol commented Mar 13, 2025

Looks like you already merge to next?

@ambrad
Copy link
Member

ambrad commented Mar 13, 2025

Looks like you already merge to next?

No, that's just a test run pushed to my fork: "added a commit to ambrad/E3SM". I used the integrator template; the ref number triggered the message above.

Copy link
Member

@amametjanov amametjanov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comments about older cmake.
Also, please remove these older cmake files:

  • externals/ekat/cmake/machine-files/auroracpu.cmake
  • externals/ekat/cmake/machine-files/sunspot-gen.cmake
  • externals/ekat/cmake/machine-files/sunspotcpu.cmake

<arg name="total_num_tasks">-np {{ total_tasks }} --label</arg>
<arg name="ranks_per_node">-ppn {{ tasks_per_node }}</arg>
<arg name="ranks_bind">-envall</arg>
<arg name="ranks_bind">--no-vni --cpu-bind $ENV{RANKS_BIND} -envall</arg>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think --no-vni is not needed anymore: can be removed.
Also, default is -genvall now: -envall can be removed.


set(EKAT_MPIRUN_EXE "mpiexec" CACHE STRING "" FORCE)
set(EKAT_MPI_NP_FLAG "-np" CACHE STRING "" FORCE)
set(EKAT_MPI_EXTRA_ARGS "--label --cpu-bind depth -envall" CACHE STRING "")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For reproducibility, please replace --cpu-bind depth -envall with

--cpu-bind list:0-7,104-111:8-15,112-119:16-23,120-127:24-31,128-135:32-39,136-143:40-47,144-151:52-59,156-163:60-67,164-171:68-75,172-179:76-83,180-187:84-91,188-195:92-99,196-203

@ambrad
Copy link
Member

ambrad commented Mar 14, 2025

@oksanaguba see Az's comments above. No need to rush. Let me know when you're done.

@oksanaguba
Copy link
Contributor Author

@amametjanov i addressed your comments except for the ekat variable that is actually not used in cime runs (I think). It can be sorted later. The rest is done, -evvall and -no-vni , and extra files are removed. Could you please re-review?

@amametjanov
Copy link
Member

Thanks for removing older cmake.
This is breaking CPU builds: PR can go in and I'll fix CPU build and run errors later.

tcclevenger added a commit that referenced this pull request Mar 18, 2025
… next (PR #7143)

This was a bug where I was passing around StridedLayout views to functions expecting
LayoutLeft. Only caught in the PM-GPU nightlies and by PR #6916, so I will ensure those
pass before merging.

[BFB]
tcclevenger added a commit that referenced this pull request Mar 18, 2025
…7143)

This was a bug where I was passing around StridedLayout views to functions expecting
LayoutLeft. Only caught in the PM-GPU nightlies and by PR #6916, so I will ensure those
pass before merging.

 [BFB]
@mahf708 mahf708 marked this pull request as draft March 18, 2025 19:04
@mahf708 mahf708 marked this pull request as ready for review March 18, 2025 19:04
Copy link
Contributor

@tcclevenger tcclevenger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All CI passing

@mahf708
Copy link
Contributor

mahf708 commented Mar 18, 2025

All CI passing

🍏🍏🍏

@ambrad
Copy link
Member

ambrad commented Mar 18, 2025

Great work, everyone. I'll merge this to next once I get the go-ahead on the devops channel.

ambrad added a commit that referenced this pull request Mar 18, 2025
EAMxx: Kokkos 4.5 update, fixes and workarounds for Kokkos update, Aurora support

* Update externals/ekat/extern/kokkos to 4.5.
* Update EKAT itself to fix deprecated usage and add Aurora cmake files.
* Fix and work around various issues on Frontier due to the Kokkos update.
* Fix deprecated usage in Hommexx.

[BFB]
@ambrad
Copy link
Member

ambrad commented Mar 19, 2025

Reverted from next by means of resetting next.

pm-cpu e3sm_developer_next_gnu and e3sm_integration_next_intel need investigation.

Also gcp12 e3sm_integration_next_gnu. (Later: This was just a machine fluke.)

https://my.cdash.org/index.php?project=E3SM&date=2025-03-19

I'll report back once I've taken a look at pm-cpu.

Update: There's something peculiar in the pm-cpu GNU developer test suite. See the convo on infrastructure-devops. I'll return to this on Monday.

Update: Another round of nightlies shows something mysterious is happening in the pm-cpu GNU developer test suite independently of this PR. I'll remerge this PR once the dashboard is OK again.

@oksanaguba
Copy link
Contributor Author

Abhishek just posted that we need to start using in April

--cpu-bind list:1-8,105-112:9-16,113-120:17-24,121-128:25-32,129-136:33-40,137-144:41-48,145-152:53-60,157-164:61-68,165-172:69-76,173-180:77-84,181-188:85-92,189-196:93-100,197-204

ambrad added a commit that referenced this pull request Mar 24, 2025
EAMxx: Kokkos 4.5 update, fixes and workarounds for Kokkos update, Aurora support

* Update externals/ekat/extern/kokkos to 4.5.
* Update EKAT itself to fix deprecated usage and add Aurora cmake files.
* Fix and work around various issues on Frontier due to the Kokkos update.
* Fix deprecated usage in Hommexx.

[BFB]
@ambrad ambrad merged commit d51ee75 into master Mar 26, 2025
37 of 49 checks passed
@ambrad ambrad deleted the oksanaguba/eamxx/spot-clean branch March 26, 2025 19:05
@amametjanov amametjanov added the Aurora ALCF machine Aurora label May 21, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Aurora ALCF machine Aurora CI: workflow change approved Allow gh action PR testing on ghci-snl-* machines for PRs that alter a worfklow file EAMxx C++ based E3SM atmosphere model (aka SCREAM) Machine Files

Projects

None yet

Development

Successfully merging this pull request may close these issues.