-
Notifications
You must be signed in to change notification settings - Fork 446
ZM Bridge - Enable Running on GPUs #7791
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
|
@whannah1 , I'm not qualified to review ZM f90, so I will pass on reviewing unless there is something specific you want me to look at. |
I still need you to review the C++ side changes - which aren't done yet. I'll ping you when that part is ready. |
c69e4e5 to
7619ffc
Compare
9bc145a to
ac1549a
Compare
|
This is agreat! I'm approving based on a quick skim of c++ code. I added both Conradand Luca for review as well |
| Kokkos::parallel_for("zm_update_precip",KT::RangePolicy(0, m_ncol*nlev_mid_packs), KOKKOS_LAMBDA (const int idx) { | ||
| const int i = idx/nlev_mid_packs; | ||
| const int k = idx%nlev_mid_packs; | ||
| T_mid(i,k) += loc_zm_output_tend_t (i,k) * dt; | ||
| qv (i,k) += loc_zm_output_tend_qv(i,k) * dt; | ||
| uwind(i,k) += loc_zm_output_tend_u (i,k) * dt; | ||
| vwind(i,k) += loc_zm_output_tend_v (i,k) * dt; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not nested team policy here? Not a concern either way since it's in initialization, just curious.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
More than a team policy I would use an MDRange here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Jim suggested the current approach.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I find MDRange much cleaner and easier to understand. I know Jim found some cases in rrtmgp where MDRange was slower than a "manual mdrange" (the stuff you are doing here), but the kokkos devs claim this should not happen, so I think it must be some sort of peculiar rrtmgp case. Here, I would vote for clarity over the few micro seconds we may gain in perf...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with Luca, it's during initialization so it's far from the point of performance critical to make that switch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW the block in question is in run_impl
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, well still it shouldn't be significant enough to become an issue.
bartgol
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Personally, I would consider switching to MDRange where possible, as it hides the index arithmetic and makes the code easier to read. I would also some of the (somewhat) syntax-heavy scratch views initialization loops to range for's over initializer lists. But it's not for correctness or speed, so it boils down to preference.
b852361 to
0ac1476
Compare
ZM Bridge - Enable Running on GPUs The motivation for these changes was to enable GPU tests with EAMxx running the bridged ZM. However, the major clean-up of the ZM subroutine interfaces that facilitated this also led to more sprawling changes. In particular, many changes to the ZM microphysics routines were helpful in decoupling this capability from the primary ZM routines. Other notable changes remove support for the Hack convective adjustment scheme (not used since CAM3) fix SHOC set_grids() to prevent the variable phis from using packs A non-BFB change was introduced via the loop structure of zm_microphysics_history_convert() that corrects a previous issue in which two variable modifications could occur in the wrong order depending on conditions. This situation does not seem to occur in our normal testing (i.e. atm_developer) - but a longer 1-month test on the ne4pg2 grid with monthly output was able to show an impact. The non-BFB change only affects a few history output variables associated with the ZM microphysics, so the simulation itself is still BFB. The change can be easily reverted by fusing the two k loops in the aforementioned subroutine. [BFB] (sort of... see # 3 above) * whannah/eam/zm-bridge-02: (27 commits) add constexpr to fix build error bug fix for run-time issue in EAM fixes to restor BFB for EAM tests e remove team_policy move call for zm_microphysics_history_convert bug fix updates from PR review unod packed type for phis in SHOC add temporary explicit transpose/copy method for ZM bridge major updates for GPU support interim update to facilitate rebase update ZM bridge to output temperature tendency remove GPU clause for building zm enable host mirroring of ZM variables zm bridge - fix ol_snow and output initialization remove pcols from ZM fortran bridge move MCSP output to zm_conv_mcsp_hist move aero/micro to end of arg list move mudpcu and lambdadpcu to microp_st move frz argument to microp_st ...
|
on next |
|
A curious run-time failure cropped up after merging to next that was strangely not caught by my testing (which I felt was very thorough!). Part of the complication is that the failure mode was not one single error. Instead, there were a few different errors across MPI ranks. Luckily, the failures were robustly repeatable, and all of these failures turned out to be related to the same root cause. The root issue was that I switched the declared size of some variables to be However, I failed to realize that when these variables were passed down to the main ZM microphysics routine ( |
|
The failing CI tests appear to just be due to the new namelist variables from PR #7797 - although I can't find an explanation of the NLFAIL in the CI logs. If that's the case then I think we're good to merge this again! I reran various tests on both NERSC and LCRC and everything passes as expected. |
You can find the logs like this: click on the test, then "Upload log files" in the sequence of steps, then on the artifact download URL at the end of that section. It looks like the NLFAIL was due to my PR. But I thought I had run the bless process correctly Wed night after Luca explained how here: #7807 (comment). @bartgol is there a way to tell if I did it incorrectly? I do see my bless run with a green check here: https://github.com/E3SM-Project/E3SM/actions (workflow https://github.com/E3SM-Project/E3SM/actions/runs/19124007410/workflow). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
zm_conv_types.F90
Line 75: Could you change
logical :: old_snow = .true. ! switch to revert snow production in zm_conv_evap (i.e. before zm_micro additions)
to
logical :: old_snow = .true. ! switch to calculate snow production in zm_conv_evap (i.e. using the old treatment before zm_micro was implemented)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Line 138: "md" should be "cloud downdraft mass flux".
Line 140: "eu" should be "entrainment in updraft".
Lines 201-214: Should we assign real values to real arrays?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@DJFJJA it's much more helpful if you make these comments directly on the lines instead of referencing them like this.
DJFJJA
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
zm_conv.F90
Line 891:
real(r8), intent(in ) :: prdprec(pcols,pver)! precipitation production (kg/ks/s)
Change “kg/ks/s” to “kg/kg/s”
|
@DJFJJA I made the changes you suggested - but for future code review please make these types of comments directly on the lines they refer to. |
I think you ran it correctly. The log clearly shows the cime baseline generation command (the
Hopping on mappy, and checking
So I think your bless went through. |
The motivation for these changes was to enable GPU tests with EAMxx running the bridged ZM. However, the major clean-up of the ZM subroutine interfaces that facilitated this also led to more sprawling changes. In particular, many changes to the ZM microphysics routines were helpful in decoupling this capability from the primary ZM routines.
Other notable changes
set_grids()to prevent the variablephisfrom using packszm_microphysics_history_convert()that corrects a previous issue in which two variable modifications could occur in the wrong order depending on conditions. This situation does not seem to occur in our normal testing (i.e. atm_developer) - but a longer 1-month test on the ne4pg2 grid with monthly output was able to show an impact. The non-BFB change only affects a few history output variables associated with the ZM microphysics, so the simulation itself is still BFB. The change can be easily reverted by fusing the twokloops in the aforementioned subroutine.[BFB] (sort of... see # 3 above)