Skip to content

Conversation

@whannah1
Copy link
Contributor

@whannah1 whannah1 commented Oct 13, 2025

The motivation for these changes was to enable GPU tests with EAMxx running the bridged ZM. However, the major clean-up of the ZM subroutine interfaces that facilitated this also led to more sprawling changes. In particular, many changes to the ZM microphysics routines were helpful in decoupling this capability from the primary ZM routines.

Other notable changes

  1. remove support for the Hack convective adjustment scheme (not used since CAM3)
  2. fix SHOC set_grids() to prevent the variable phis from using packs
  3. A non-BFB change was introduced via the loop structure of zm_microphysics_history_convert() that corrects a previous issue in which two variable modifications could occur in the wrong order depending on conditions. This situation does not seem to occur in our normal testing (i.e. atm_developer) - but a longer 1-month test on the ne4pg2 grid with monthly output was able to show an impact. The non-BFB change only affects a few history output variables associated with the ZM microphysics, so the simulation itself is still BFB. The change can be easily reverted by fusing the two k loops in the aforementioned subroutine.

[BFB] (sort of... see # 3 above)

@whannah1 whannah1 added Atmosphere BFB PR leaves answers BFB EAMxx C++ based E3SM atmosphere model (aka SCREAM) EAM Fortran-based E3SM Atmosphere Model ZM labels Oct 13, 2025
@whannah1 whannah1 marked this pull request as draft October 13, 2025 17:28
@whannah1 whannah1 changed the title ZM Bridge - Enable GPU runs ZM Bridge - Enable Running on GPUs Oct 13, 2025
@jgfouca
Copy link
Member

jgfouca commented Oct 14, 2025

@whannah1 , I'm not qualified to review ZM f90, so I will pass on reviewing unless there is something specific you want me to look at.

@whannah1
Copy link
Contributor Author

@whannah1 , I'm not qualified to review ZM f90, so I will pass on reviewing unless there is something specific you want me to look at.

I still need you to review the C++ side changes - which aren't done yet. I'll ping you when that part is ready.

@whannah1 whannah1 force-pushed the whannah/eam/zm-bridge-02 branch from c69e4e5 to 7619ffc Compare October 22, 2025 16:20
@whannah1 whannah1 force-pushed the whannah/eam/zm-bridge-02 branch 2 times, most recently from 9bc145a to ac1549a Compare October 27, 2025 22:07
@mahf708 mahf708 added the CI: approved Allow gh actions PR testing on ghci-snl-* machines label Oct 27, 2025
@mahf708 mahf708 marked this pull request as ready for review October 27, 2025 22:19
@mahf708
Copy link
Contributor

mahf708 commented Oct 28, 2025

This is agreat! I'm approving based on a quick skim of c++ code. I added both Conradand Luca for review as well

Comment on lines 267 to 273
Kokkos::parallel_for("zm_update_precip",KT::RangePolicy(0, m_ncol*nlev_mid_packs), KOKKOS_LAMBDA (const int idx) {
const int i = idx/nlev_mid_packs;
const int k = idx%nlev_mid_packs;
T_mid(i,k) += loc_zm_output_tend_t (i,k) * dt;
qv (i,k) += loc_zm_output_tend_qv(i,k) * dt;
uwind(i,k) += loc_zm_output_tend_u (i,k) * dt;
vwind(i,k) += loc_zm_output_tend_v (i,k) * dt;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not nested team policy here? Not a concern either way since it's in initialization, just curious.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More than a team policy I would use an MDRange here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Jim suggested the current approach.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find MDRange much cleaner and easier to understand. I know Jim found some cases in rrtmgp where MDRange was slower than a "manual mdrange" (the stuff you are doing here), but the kokkos devs claim this should not happen, so I think it must be some sort of peculiar rrtmgp case. Here, I would vote for clarity over the few micro seconds we may gain in perf...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with Luca, it's during initialization so it's far from the point of performance critical to make that switch.

Copy link
Contributor Author

@whannah1 whannah1 Oct 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW the block in question is in run_impl

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, well still it shouldn't be significant enough to become an issue.

Copy link
Contributor

@bartgol bartgol left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Personally, I would consider switching to MDRange where possible, as it hides the index arithmetic and makes the code easier to read. I would also some of the (somewhat) syntax-heavy scratch views initialization loops to range for's over initializer lists. But it's not for correctness or speed, so it boils down to preference.

@whannah1 whannah1 force-pushed the whannah/eam/zm-bridge-02 branch from b852361 to 0ac1476 Compare November 3, 2025 23:47
@mahf708 mahf708 added CI: approved Allow gh actions PR testing on ghci-snl-* machines and removed CI: approved Allow gh actions PR testing on ghci-snl-* machines labels Nov 3, 2025
@mahf708 mahf708 marked this pull request as draft November 3, 2025 23:49
@mahf708 mahf708 marked this pull request as ready for review November 3, 2025 23:49
singhbalwinder added a commit that referenced this pull request Nov 4, 2025
ZM Bridge - Enable Running on GPUs

The motivation for these changes was to enable GPU tests with EAMxx running the bridged ZM. However, the major clean-up of the ZM subroutine interfaces that facilitated this also led to more sprawling changes. In particular, many changes to the ZM microphysics routines were helpful in decoupling this capability from the primary ZM routines.

Other notable changes

remove support for the Hack convective adjustment scheme (not used since CAM3)
fix SHOC set_grids() to prevent the variable phis from using packs
A non-BFB change was introduced via the loop structure of zm_microphysics_history_convert() that corrects a previous issue in which two variable modifications could occur in the wrong order depending on conditions. This situation does not seem to occur in our normal testing (i.e. atm_developer) - but a longer 1-month test on the ne4pg2 grid with monthly output was able to show an impact. The non-BFB change only affects a few history output variables associated with the ZM microphysics, so the simulation itself is still BFB. The change can be easily reverted by fusing the two k loops in the aforementioned subroutine.

[BFB] (sort of... see # 3 above)

* whannah/eam/zm-bridge-02: (27 commits)
  add constexpr to fix build error
  bug fix for run-time issue in EAM
  fixes to restor BFB for EAM tests e
  remove team_policy
  move call for zm_microphysics_history_convert
  bug fix
  updates from PR review
  unod packed type for phis in SHOC
  add temporary explicit transpose/copy method for ZM bridge
  major updates for GPU support
  interim update to facilitate rebase
  update ZM bridge to output temperature tendency
  remove GPU clause for building zm
  enable host mirroring of ZM variables
  zm bridge - fix ol_snow and output initialization
  remove pcols from ZM fortran bridge
  move MCSP output to zm_conv_mcsp_hist
  move aero/micro to end of arg list
  move mudpcu and lambdadpcu to microp_st
  move frz argument to microp_st
  ...
@singhbalwinder
Copy link
Contributor

on next

@whannah1
Copy link
Contributor Author

whannah1 commented Nov 4, 2025

A curious run-time failure cropped up after merging to next that was strangely not caught by my testing (which I felt was very thorough!). Part of the complication is that the failure mode was not one single error. Instead, there were a few different errors across MPI ranks. Luckily, the failures were robustly repeatable, and all of these failures turned out to be related to the same root cause.

The root issue was that I switched the declared size of some variables to be (ncol,pver) instead of (pcols,pver). These variables were allocated only for ZM, and since I had removed all module level variables using pcols it seemed wasteful to always be allocating based on pcols for the MPI ranks that would never need to utilize that much memory.

However, I failed to realize that when these variables were passed down to the main ZM microphysics routine (zm_mphy()) the declared size was still pcols rather than ncol. I had intentionally tried not to touch this routine, but this unintentional inconsistency explained to the strange failure symptoms. I switched everything back to using pcols to solve it . Since pcols is passed in to ZM now as an arbitrary declared size this was a simple fix. The alternative of making everything use ncol was dable, but would have required more extensive changes, so I opted for simplicity. EAMxx can still use the rank local column number (i.e. ncol) instead of a max across all ranks because it will not be calling zm_mphy().

rljacob added a commit that referenced this pull request Nov 5, 2025
Revert 7791 off of next.

This reverts commit beaf411, reversing
changes made to 10a4263.
@mahf708 mahf708 added CI: approved Allow gh actions PR testing on ghci-snl-* machines and removed CI: approved Allow gh actions PR testing on ghci-snl-* machines labels Nov 5, 2025
@whannah1
Copy link
Contributor Author

whannah1 commented Nov 6, 2025

The failing CI tests appear to just be due to the new namelist variables from PR #7797 - although I can't find an explanation of the NLFAIL in the CI logs.

If that's the case then I think we're good to merge this again!

I reran various tests on both NERSC and LCRC and everything passes as expected.

@ambrad
Copy link
Member

ambrad commented Nov 7, 2025

The failing CI tests appear to just be due to the new namelist variables from PR #7797 - although I can't find an explanation of the NLFAIL in the CI logs.

You can find the logs like this: click on the test, then "Upload log files" in the sequence of steps, then on the artifact download URL at the end of that section.

It looks like the NLFAIL was due to my PR. But I thought I had run the bless process correctly Wed night after Luca explained how here: #7807 (comment). @bartgol is there a way to tell if I did it incorrectly? I do see my bless run with a green check here: https://github.com/E3SM-Project/E3SM/actions (workflow https://github.com/E3SM-Project/E3SM/actions/runs/19124007410/workflow).

Copy link

@DJFJJA DJFJJA left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

zm_conv_types.F90
Line 75: Could you change
logical :: old_snow = .true. ! switch to revert snow production in zm_conv_evap (i.e. before zm_micro additions)
to
logical :: old_snow = .true. ! switch to calculate snow production in zm_conv_evap (i.e. using the old treatment before zm_micro was implemented)?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Line 138: "md" should be "cloud downdraft mass flux".
Line 140: "eu" should be "entrainment in updraft".
Lines 201-214: Should we assign real values to real arrays?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@DJFJJA it's much more helpful if you make these comments directly on the lines instead of referencing them like this.

Copy link

@DJFJJA DJFJJA left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

zm_conv.F90
Line 891:
real(r8), intent(in ) :: prdprec(pcols,pver)! precipitation production (kg/ks/s)
Change “kg/ks/s” to “kg/kg/s”

@whannah1
Copy link
Contributor Author

whannah1 commented Nov 7, 2025

@DJFJJA I made the changes you suggested - but for future code review please make these types of comments directly on the lines they refer to.

@bartgol
Copy link
Contributor

bartgol commented Nov 7, 2025

It looks like the NLFAIL was due to my PR. But I thought I had run the bless process correctly Wed night after Luca explained how here: #7807 (comment). @bartgol is there a way to tell if I did it incorrectly? I do see my bless run with a green check here: E3SM-Project/E3SM/actions (workflow E3SM-Project/E3SM/actions/runs/19124007410/workflow).

I think you ran it correctly. The log clearly shows the cime baseline generation command (the -g is there):

./cime/scripts/create_test ERS_Ln9.ne4_ne4.F2000-SCREAMv1-AQP1.ghci-snl-cpu_gnu.eamxx-output-preset-2--eamxx-L72 -o -g -b master --wait

Hopping on mappy, and checking bless_log in the baseline folder clearly shows the correct time (which I think is UTC):

$ cat bless_log
sha:0f29546abee521717d474244b17655d289d293c8 date:2025-11-06_03:54:38

So I think your bless went through.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Atmosphere BFB PR leaves answers BFB CI: approved Allow gh actions PR testing on ghci-snl-* machines EAM Fortran-based E3SM Atmosphere Model EAMxx C++ based E3SM atmosphere model (aka SCREAM) ZM

Projects

None yet

Development

Successfully merging this pull request may close these issues.