ZM Bridge - Enable Running on GPUs #7791

whannah1 · 2025-10-13T17:28:17Z

The motivation for these changes was to enable GPU tests with EAMxx running the bridged ZM. However, the major clean-up of the ZM subroutine interfaces that facilitated this also led to more sprawling changes. In particular, many changes to the ZM microphysics routines were helpful in decoupling this capability from the primary ZM routines.

Other notable changes

remove support for the Hack convective adjustment scheme (not used since CAM3)
fix SHOC set_grids() to prevent the variable phis from using packs
A non-BFB change was introduced via the loop structure of zm_microphysics_history_convert() that corrects a previous issue in which two variable modifications could occur in the wrong order depending on conditions. This situation does not seem to occur in our normal testing (i.e. atm_developer) - but a longer 1-month test on the ne4pg2 grid with monthly output was able to show an impact. The non-BFB change only affects a few history output variables associated with the ZM microphysics, so the simulation itself is still BFB. The change can be easily reverted by fusing the two k loops in the aforementioned subroutine.

[BFB] (sort of... see # 3 above)

jgfouca · 2025-10-14T16:13:47Z

@whannah1 , I'm not qualified to review ZM f90, so I will pass on reviewing unless there is something specific you want me to look at.

whannah1 · 2025-10-14T16:26:14Z

@whannah1 , I'm not qualified to review ZM f90, so I will pass on reviewing unless there is something specific you want me to look at.

I still need you to review the C++ side changes - which aren't done yet. I'll ping you when that part is ready.

components/eamxx/src/physics/shoc/eamxx_shoc_process_interface.cpp

components/eamxx/src/physics/zm/eamxx_zm_process_interface.cpp

mahf708 · 2025-10-28T04:41:11Z

This is agreat! I'm approving based on a quick skim of c++ code. I added both Conradand Luca for review as well

components/eamxx/src/physics/zm/eamxx_zm_process_interface.cpp

tcclevenger · 2025-10-28T15:16:29Z

components/eamxx/src/physics/zm/eamxx_zm_process_interface.cpp

+    Kokkos::parallel_for("zm_update_precip",KT::RangePolicy(0, m_ncol*nlev_mid_packs), KOKKOS_LAMBDA (const int idx) {
+      const int i = idx/nlev_mid_packs;
+      const int k = idx%nlev_mid_packs;
+      T_mid(i,k) += loc_zm_output_tend_t (i,k) * dt;
+      qv   (i,k) += loc_zm_output_tend_qv(i,k) * dt;
+      uwind(i,k) += loc_zm_output_tend_u (i,k) * dt;
+      vwind(i,k) += loc_zm_output_tend_v (i,k) * dt;


Why not nested team policy here? Not a concern either way since it's in initialization, just curious.

More than a team policy I would use an MDRange here.

Jim suggested the current approach.

I find MDRange much cleaner and easier to understand. I know Jim found some cases in rrtmgp where MDRange was slower than a "manual mdrange" (the stuff you are doing here), but the kokkos devs claim this should not happen, so I think it must be some sort of peculiar rrtmgp case. Here, I would vote for clarity over the few micro seconds we may gain in perf...

I agree with Luca, it's during initialization so it's far from the point of performance critical to make that switch.

FWIW the block in question is in run_impl

Oh, well still it shouldn't be significant enough to become an issue.

components/eamxx/src/physics/zm/eamxx_zm_process_interface.cpp

bartgol

Personally, I would consider switching to MDRange where possible, as it hides the index arithmetic and makes the code easier to read. I would also some of the (somewhat) syntax-heavy scratch views initialization loops to range for's over initializer lists. But it's not for correctness or speed, so it boils down to preference.

e

ZM Bridge - Enable Running on GPUs The motivation for these changes was to enable GPU tests with EAMxx running the bridged ZM. However, the major clean-up of the ZM subroutine interfaces that facilitated this also led to more sprawling changes. In particular, many changes to the ZM microphysics routines were helpful in decoupling this capability from the primary ZM routines. Other notable changes remove support for the Hack convective adjustment scheme (not used since CAM3) fix SHOC set_grids() to prevent the variable phis from using packs A non-BFB change was introduced via the loop structure of zm_microphysics_history_convert() that corrects a previous issue in which two variable modifications could occur in the wrong order depending on conditions. This situation does not seem to occur in our normal testing (i.e. atm_developer) - but a longer 1-month test on the ne4pg2 grid with monthly output was able to show an impact. The non-BFB change only affects a few history output variables associated with the ZM microphysics, so the simulation itself is still BFB. The change can be easily reverted by fusing the two k loops in the aforementioned subroutine. [BFB] (sort of... see # 3 above) * whannah/eam/zm-bridge-02: (27 commits) add constexpr to fix build error bug fix for run-time issue in EAM fixes to restor BFB for EAM tests e remove team_policy move call for zm_microphysics_history_convert bug fix updates from PR review unod packed type for phis in SHOC add temporary explicit transpose/copy method for ZM bridge major updates for GPU support interim update to facilitate rebase update ZM bridge to output temperature tendency remove GPU clause for building zm enable host mirroring of ZM variables zm bridge - fix ol_snow and output initialization remove pcols from ZM fortran bridge move MCSP output to zm_conv_mcsp_hist move aero/micro to end of arg list move mudpcu and lambdadpcu to microp_st move frz argument to microp_st ...

singhbalwinder · 2025-11-04T00:51:46Z

on next

whannah1 · 2025-11-04T15:21:55Z

A curious run-time failure cropped up after merging to next that was strangely not caught by my testing (which I felt was very thorough!). Part of the complication is that the failure mode was not one single error. Instead, there were a few different errors across MPI ranks. Luckily, the failures were robustly repeatable, and all of these failures turned out to be related to the same root cause.

The root issue was that I switched the declared size of some variables to be (ncol,pver) instead of (pcols,pver). These variables were allocated only for ZM, and since I had removed all module level variables using pcols it seemed wasteful to always be allocating based on pcols for the MPI ranks that would never need to utilize that much memory.

However, I failed to realize that when these variables were passed down to the main ZM microphysics routine (zm_mphy()) the declared size was still pcols rather than ncol. I had intentionally tried not to touch this routine, but this unintentional inconsistency explained to the strange failure symptoms. I switched everything back to using pcols to solve it . Since pcols is passed in to ZM now as an arbitrary declared size this was a simple fix. The alternative of making everything use ncol was dable, but would have required more extensive changes, so I opted for simplicity. EAMxx can still use the rank local column number (i.e. ncol) instead of a max across all ranks because it will not be calling zm_mphy().

Revert 7791 off of next. This reverts commit beaf411, reversing changes made to 10a4263.

whannah1 · 2025-11-06T17:21:20Z

The failing CI tests appear to just be due to the new namelist variables from PR #7797 - although I can't find an explanation of the NLFAIL in the CI logs.

If that's the case then I think we're good to merge this again!

I reran various tests on both NERSC and LCRC and everything passes as expected.

ambrad · 2025-11-07T06:46:24Z

The failing CI tests appear to just be due to the new namelist variables from PR #7797 - although I can't find an explanation of the NLFAIL in the CI logs.

You can find the logs like this: click on the test, then "Upload log files" in the sequence of steps, then on the artifact download URL at the end of that section.

It looks like the NLFAIL was due to my PR. But I thought I had run the bless process correctly Wed night after Luca explained how here: #7807 (comment). @bartgol is there a way to tell if I did it incorrectly? I do see my bless run with a green check here: https://github.com/E3SM-Project/E3SM/actions (workflow https://github.com/E3SM-Project/E3SM/actions/runs/19124007410/workflow).

DJFJJA

zm_conv_types.F90
Line 75: Could you change
logical :: old_snow = .true. ! switch to revert snow production in zm_conv_evap (i.e. before zm_micro additions)
to
logical :: old_snow = .true. ! switch to calculate snow production in zm_conv_evap (i.e. using the old treatment before zm_micro was implemented)?

DJFJJA · 2025-10-31T15:21:58Z

components/eamxx/src/physics/zm/fortran_bridge/zm_eamxx_bridge_main.F90

Line 138: "md" should be "cloud downdraft mass flux".
Line 140: "eu" should be "entrainment in updraft".
Lines 201-214: Should we assign real values to real arrays?

@DJFJJA it's much more helpful if you make these comments directly on the lines instead of referencing them like this.

DJFJJA

zm_conv.F90
Line 891:
real(r8), intent(in ) :: prdprec(pcols,pver)! precipitation production (kg/ks/s)
Change “kg/ks/s” to “kg/kg/s”

whannah1 · 2025-11-07T14:58:08Z

@DJFJJA I made the changes you suggested - but for future code review please make these types of comments directly on the lines they refer to.

bartgol · 2025-11-07T15:00:30Z

It looks like the NLFAIL was due to my PR. But I thought I had run the bless process correctly Wed night after Luca explained how here: #7807 (comment). @bartgol is there a way to tell if I did it incorrectly? I do see my bless run with a green check here: E3SM-Project/E3SM/actions (workflow E3SM-Project/E3SM/actions/runs/19124007410/workflow).

I think you ran it correctly. The log clearly shows the cime baseline generation command (the -g is there):

./cime/scripts/create_test ERS_Ln9.ne4_ne4.F2000-SCREAMv1-AQP1.ghci-snl-cpu_gnu.eamxx-output-preset-2--eamxx-L72 -o -g -b master --wait

Hopping on mappy, and checking bless_log in the baseline folder clearly shows the correct time (which I think is UTC):

$ cat bless_log
sha:0f29546abee521717d474244b17655d289d293c8 date:2025-11-06_03:54:38

So I think your bless went through.

whannah1 requested review from DJFJJA, crterai and jgfouca October 13, 2025 17:28

whannah1 assigned singhbalwinder Oct 13, 2025

whannah1 added Atmosphere BFB PR leaves answers BFB EAMxx C++ based E3SM atmosphere model (aka SCREAM) EAM Fortran-based E3SM Atmosphere Model ZM labels Oct 13, 2025

whannah1 marked this pull request as draft October 13, 2025 17:28

whannah1 changed the title ~~ZM Bridge - Enable GPU runs~~ ZM Bridge - Enable Running on GPUs Oct 13, 2025

whannah1 force-pushed the whannah/eam/zm-bridge-02 branch from c69e4e5 to 7619ffc Compare October 22, 2025 16:20

whannah1 mentioned this pull request Oct 23, 2025

Frontogenesis function pressure gradient correction #7797

Merged

whannah1 force-pushed the whannah/eam/zm-bridge-02 branch 2 times, most recently from 9bc145a to ac1549a Compare October 27, 2025 22:07

mahf708 added the CI: approved Allow gh actions PR testing on ghci-snl-* machines label Oct 27, 2025

mahf708 marked this pull request as ready for review October 27, 2025 22:19

mahf708 requested review from bartgol and tcclevenger October 28, 2025 04:30

mahf708 reviewed Oct 28, 2025

View reviewed changes

components/eamxx/src/physics/shoc/eamxx_shoc_process_interface.cpp Show resolved Hide resolved

mahf708 reviewed Oct 28, 2025

View reviewed changes

components/eamxx/src/physics/zm/eamxx_zm_process_interface.cpp Show resolved Hide resolved

mahf708 reviewed Oct 28, 2025

View reviewed changes

components/eamxx/src/physics/zm/eamxx_zm_process_interface.cpp Show resolved Hide resolved

mahf708 approved these changes Oct 28, 2025

View reviewed changes

tcclevenger reviewed Oct 28, 2025

View reviewed changes

bogensch approved these changes Oct 28, 2025

View reviewed changes

bartgol reviewed Oct 28, 2025

View reviewed changes

components/eamxx/src/physics/zm/eamxx_zm_process_interface.cpp Outdated Show resolved Hide resolved

bartgol approved these changes Oct 28, 2025

View reviewed changes

whannah1 added 7 commits November 3, 2025 17:04

updates from PR review

a91d55b

bug fix

f5be2bc

move call for zm_microphysics_history_convert

7e04eb8

remove team_policy

404f99a

fixes to restor BFB for EAM tests

33f3720

e

bug fix for run-time issue in EAM

ddc6db5

add constexpr to fix build error

0ac1476

whannah1 force-pushed the whannah/eam/zm-bridge-02 branch from b852361 to 0ac1476 Compare November 3, 2025 23:47

mahf708 added CI: approved Allow gh actions PR testing on ghci-snl-* machines and removed CI: approved Allow gh actions PR testing on ghci-snl-* machines labels Nov 3, 2025

mahf708 marked this pull request as draft November 3, 2025 23:49

mahf708 marked this pull request as ready for review November 3, 2025 23:49

ambrad mentioned this pull request Nov 4, 2025

Homme(xx)/SL: Finish C++/Kokkos for ETM; modify vertical discretization. #7807

Merged

rljacob added a commit that referenced this pull request Nov 5, 2025

Revert "Merge branch 'whannah/eam/zm-bridge-02' into next (PR #7791)"

218992b

Revert 7791 off of next. This reverts commit beaf411, reversing changes made to 10a4263.

bug fix to address diffs in ne30 tests

1dc8c98

mahf708 added CI: approved Allow gh actions PR testing on ghci-snl-* machines and removed CI: approved Allow gh actions PR testing on ghci-snl-* machines labels Nov 5, 2025

DJFJJA reviewed Nov 7, 2025

View reviewed changes

whannah1 added 3 commits November 7, 2025 06:52

fix variable descriptions

f6527c1

fix variable descriptions

ef87994

fix units

a9f4f79

bartgol approved these changes Nov 7, 2025

View reviewed changes

ZM Bridge - Enable Running on GPUs #7791

Are you sure you want to change the base?

ZM Bridge - Enable Running on GPUs #7791

Conversation

whannah1 commented Oct 13, 2025 • edited by singhbalwinder Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jgfouca commented Oct 14, 2025

Uh oh!

whannah1 commented Oct 14, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mahf708 commented Oct 28, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

whannah1 Oct 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

bartgol left a comment

Choose a reason for hiding this comment

Uh oh!

singhbalwinder commented Nov 4, 2025

Uh oh!

whannah1 commented Nov 4, 2025

Uh oh!

whannah1 commented Nov 6, 2025

Uh oh!

ambrad commented Nov 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

DJFJJA left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

DJFJJA left a comment

Choose a reason for hiding this comment

Uh oh!

whannah1 commented Nov 7, 2025

Uh oh!

bartgol commented Nov 7, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

11 participants

whannah1 commented Oct 13, 2025 •

edited by singhbalwinder

Loading

whannah1 Oct 29, 2025 •

edited

Loading

ambrad commented Nov 7, 2025 •

edited

Loading

DJFJJA left a comment •

edited

Loading