cam6_4_131: Performance improvements for CSLAM (from John Dennis, CISL, and Lauritzen) #1365

PeterHjortLauritzen · 2025-08-18T14:17:53Z

PeterHjortLauritzen · 2025-08-18T14:41:57Z

I am not getting B4B in this test:

ERP_D_Ln9.ne30pg3_ne30pg3_mg17.FHISTC_LTso.derecho_intel.cam-outfrq9s

 ./create_test --output-root /glade/derecho/scratch/pel/ --project P93300042 ERP_D_Ln9.ne30pg3_ne30pg3_mg17.FHISTC_LTso.derecho_intel.cam-outfrq9s

(low top FHISTC with CSLAM)

@johnmauff: Could these changes be round-off (order of operation changes?)

johnmauff · 2025-08-18T16:12:01Z

src/dynamics/se/dycore/fvm_reconstruction_mod.F90

      if (present(fotherpanel)) then
        fotherpanel (1-nht:nc+nht,1-nht:0 ,1)=fcube(1-nht:nc+nht,1-nht:0 )
        do halo=1,nhr
+          ftmp(:) = fcube(:,halo)


I just realized that the copy to ftmp is not needed here.

johnmauff · 2025-08-18T16:14:38Z

src/dynamics/se/dycore/fvm_reconstruction_mod.F90

        ! fill in "n" on Figure above
        !
        do halo=1,nhr
+          ftmp(:) = fcube(:,halo)


The ftmp(:) is not used.

johnmauff · 2025-08-18T16:22:59Z

Peter, I would guess that they are round-off changes. I have not check with a low top model. I only tested this with a high-top model, looking at the nstep output. John

…

On Mon, Aug 18, 2025 at 8:42 AM Peter Hjort Lauritzen < ***@***.***> wrote: *PeterHjortLauritzen* left a comment (ESCOMP/CAM#1365) <#1365 (comment)> I am not getting B4B in this test: ERP_D_Ln9.ne30pg3_ne30pg3_mg17.FHISTC_LTso.derecho_intel.cam-outfrq9s (low top FHISTC with CSLAM) @johnmauff <https://github.com/johnmauff>: Could these changes be round-off (order of operation changes?) — Reply to this email directly, view it on GitHub <#1365 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ADH7NUU3WXMFA2QRNJ45PRL3OHQ4XAVCNFSM6AAAAACEFIYUR2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZTCOJXGIZDINJRGQ> . You are receiving this because you were mentioned.Message ID: ***@***.***>

sjsprecious · 2025-08-18T17:11:44Z

src/dynamics/se/dycore/fvm_reconstruction_mod.F90

+    if(ns==3) then
+      dotproduct = DotProduct_3(w,f)
+    else
+      dotproduct = DotProduct_gen(w,f,ns)
+    endif


@johnmauff I guess the NBFB answer comes from this function? When ns = 3, the dot product is hard coded for performance optimization, but the truncation error will be different from the general version since they are now computed in a single line.

I tried the general algorithm but that did not change BFB differences with the baseline.

@PeterHjortLauritzen I was able to produce BFB result with the baseline after doing the following things:

Fixed the typo (confirm by @johnmauff ) at line 1797 (https://github.com/ESCOMP/CAM/pull/1365/files#diff-8beef36006cafdecbf26406cf4357fc0797eb6c7316d9c6aa0a2486fade88f25R1797).

Merged the latest cam_development branch into the dennis_perf_cslam1 branch (currently the dennis_perf_cslam1 branch is based on the tag cam6_4_089).

PeterHjortLauritzen · 2025-08-20T12:38:35Z

I am not getting B4B in this test:

ERP_D_Ln9.ne30pg3_ne30pg3_mg17.FHISTC_LTso.derecho_intel.cam-outfrq9s
 ./create_test --output-root /glade/derecho/scratch/pel/ --project P93300042 ERP_D_Ln9.ne30pg3_ne30pg3_mg17.FHISTC_LTso.derecho_intel.cam-outfrq9s
(low top FHISTC with CSLAM)

@johnmauff: Could these changes be round-off (order of operation changes?)

I ran the baroclinic wave test case (FKESSLER) and compared the optimized version against the baseline/trunk. Below is PS at day 10:

For comparison, here is a pertlim test (in this case perturbing PS by 1E-14):

The pertlim test produces errors about 100× smaller. It’s unclear whether it matters that the optimized code introduces round-off errors at every time step, whereas the pertlim test only introduces them at initialization. I'll keep looking/thinking ...

UPDATE: all tests were due to code bug ... all tests (I am running) are now BFB

sjsprecious · 2025-08-20T16:57:29Z

@PeterHjortLauritzen I am curious: I thought threading was not supported for the SE dycore (#941). Thus why the ERP test still works here?

sjsprecious · 2025-08-20T17:11:53Z

@PeterHjortLauritzen I ran the ERP_D_Ln9.ne30pg3_ne30pg3_mg17.FHISTC_LTso.derecho_intel.cam-outfrq9s test on Derecho and found that the second run cut the node number by half but did not change the thread number (still 1). Thus this is actually an ERS test?

sjsprecious · 2025-08-20T18:50:25Z

src/dynamics/se/dycore/fvm_reconstruction_mod.F90

         !
-         w = halo_interp_weight(:,:,:,2)
         do halo=1,nhr
+           ! ftmp(:) = fcube(nc+1-halo,:)   ! copy to a temporary


@PeterHjortLauritzen @johnmauff I think this line should not be commented out?

Good catch. I missed that.

Thanks @sjsprecious ! Now the results make sense ... see plots above (in a couple of minutes)

PeterHjortLauritzen · 2025-08-21T12:22:17Z

@PeterHjortLauritzen I ran the ERP_D_Ln9.ne30pg3_ne30pg3_mg17.FHISTC_LTso.derecho_intel.cam-outfrq9s test on Derecho and found that the second run cut the node number by half but did not change the thread number (still 1). Thus this is actually an ERS test?

Don't know; I just took the test from the CAM test list without thinking much about the actual test (correct! Threading is currently broken in the SE dycore). @nusbaume Do you know the answer to @sjsprecious 's question?

PeterHjortLauritzen · 2025-08-21T13:25:33Z

All the tests I am running are BFB now! Thanks @johnmauff and @sjsprecious ... @cacraigucar: This PR is ready to go ...

nusbaume · 2025-08-21T14:25:34Z

Hi @PeterHjortLauritzen @sjsprecious, an ERP test takes the default thread and task count from the first case and divides it by 2 for the second case if the number is greater than one. You can see that logic in the CIME code here:

https://github.com/ESMCI/cime/blob/master/CIME/SystemTests/erp.py#L32

Given that the default configuration for an SE dycore run is one thread but multiple MPI tasks, it ends up adjusting the tasks but not the threads (which is what allows the test to run with this CAM configuration in the first place).

Also, my understanding is that the difference between ERS and ERP is that an ERS test won't change the task layout at all and simply checks that the restart run is bit-for-bit, while the ERP test will halve the processor count for the restarted run before checking if the results are bit-for-bit.

Anyways, I hope that helps, and thanks again for getting these improvements into CAM!

sjsprecious · 2025-08-21T15:03:33Z

Hi @PeterHjortLauritzen @sjsprecious, an ERP test takes the default thread and task count from the first case and divides it by 2 for the second case if the number is greater than one. You can see that logic in the CIME code here:

https://github.com/ESMCI/cime/blob/master/CIME/SystemTests/erp.py#L32

Given that the default configuration for an SE dycore run is one thread but multiple MPI tasks, it ends up adjusting the tasks but not the threads (which is what allows the test to run with this CAM configuration in the first place).

Also, my understanding is that the difference between ERS and ERP is that an ERS test won't change the task layout at all and simply checks that the restart run is bit-for-bit, while the ERP test will halve the processor count for the restarted run before checking if the results are bit-for-bit.

Anyways, I hope that helps, and thanks again for getting these improvements into CAM!

Thanks @nusbaume for your clarification. Clearly I misunderstood the ERS and ERP tests before, but now it is clear to me.

johnmauff · 2025-08-25T19:09:46Z

src/dynamics/se/dycore/fvm_reconstruction_mod.F90

@PeterHjortLauritzen could you look at line 487 of this file? The denominator is in a form that is inconsistent with other polynomial evaluations. Is this missing a parens? I.e. should it be invtmp = 1.0_r8 / (recons(6,i,j) + spherecentroid(1,i,j))

@johnmauff thanks for noting this!

I re-did the math for computing the min/max values and I think you are right. The if (abs(recons(6,i,j)) > threshold) code block is buggy and actually redundant.

Details, if interested, below:

We wish to find the min/max values of a polynomial on the form

First we find the interior critical points:

Compute the gradient and set it to zero

If

is non-zero

then the extrema are

These formulas are correct in the code.

Evaluate on the Boundaries

The boundary consists of four line segments. On each, $f$ restricts to a 1D quadratic function, whose max/min occur at its critical point (if inside the segment) or at endpoints (corners).

All these formulas are correct in the code.

However, the code block with if (abs(recons(6,i,j)) > threshold) is buggy (as you notice) and I think redundant as the cases are already covered with the search above.

I have run some FKESSLER cases to verify what happens without the buggy if (abs(recons(6,i,j)) > threshold) section:

the non-linear correlation between cl1 and cl2 is still maintained: cly=constant!

the min(upper plot)/max(lower plot) values for the slotted cylinders are pretty much the same.

(initial value: 0.100000000000000E+00; after 10 days: 0.999999999993388E-01 )

(initial value: 0.333333333333333E+00 0.333333333210990E+00 )

So no new over or undershoots introduced hence the " if (abs(recons(6,i,j)) > threshold)" code block can be removed!

The solutions are bit4bit until time-step 114 ()

< nstep, te 114 0.26393936945915651E+10 0.26393936951915169E+10 0.32751870252974244E-07 0.99999998855001060E+05 0.20548077300190920E+03

nstep, te 114 0.26393936945915661E+10 0.26393936951915169E+10 0.32751818191091291E-07 0.99999998855001075E+05 0.20548077300190920E+03

i.e. bit4bit until day 2.375

johnmauff · 2025-09-29T13:00:31Z

@pel I just noticed that commit 6067381 appears to deleate all of my optimizations. Is there a reason for this?

PeterHjortLauritzen · 2025-09-29T13:11:10Z

@pel I just noticed that commit 6067381 appears to deleate all of my optimizations. Is there a reason for this?

apologies ... mistake ... reverted!

johnmauff · 2025-09-29T13:38:16Z

Peter, Thanks for reverting the commit. It looks like Rory's changes are only a couple of lines. John

…

On Mon, Sep 29, 2025 at 7:11 AM Peter Hjort Lauritzen < ***@***.***> wrote: *PeterHjortLauritzen* left a comment (ESCOMP/CAM#1365) <#1365 (comment)> @pel <https://github.com/pel> I just noticed that commit 6067381 <6067381> appears to deleate all of my optimizations. Is there a reason for this? apologies ... mistake ... reverted! — Reply to this email directly, view it on GitHub <#1365 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ADH7NUWB6PZUQRFKWHLX2LD3VEVYJAVCNFSM6AAAAACEFIYUR2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZTGNBWHA2TMMZZGE> . You are receiving this because you were mentioned.Message ID: ***@***.***>

cacraigucar · 2025-10-13T19:01:43Z

I heard at a meeting the optimization work may be completed, with no deliverable for CAM (other than something which Peter will include in a future PR). Is this correct, and should this PR be closed?

cacraigucar · 2025-10-14T15:34:12Z

never mind - @PeterHjortLauritzen is working on this

PeterHjortLauritzen · 2025-10-23T10:20:26Z

FYI: performance results on this Wiki

https://github.com/PeterHjortLauritzen/CAM/wiki/Performance-notes-(2025)

nusbaume

Thanks all! I found a few lines of commented-out code that could potentially be removed, but otherwise everything looks good to me.

nusbaume · 2025-11-22T23:01:20Z

src/dynamics/se/dycore/fvm_reconstruction_mod.F90

+!not b4b                      ex1 = (r6*r3 - 2.0_r8*r5*r2) / disc + scx
+!not b4b                      ex2 = (r6*r2 - 2.0_r8*r4*r3) / disc + scy


Remove commented-out code?

nusbaume · 2025-11-22T23:02:55Z

src/dynamics/se/dycore/fvm_reconstruction_mod.F90

+                   ! Top/bottom edge, y=const., du/dx=0
+                   !
+                   if (abs(r4) > threshold) then
+                      invtmp = 1.0_r8 / (2.0_r8 * r4)! + spherecentroid(1,i,j)


Remove commented-out code here?

Suggested change

invtmp = 1.0_r8 / (2.0_r8 * r4)! + spherecentroid(1,i,j)

invtmp = 1.0_r8 / (2.0_r8 * r4)

nusbaume · 2025-11-22T23:03:44Z

src/dynamics/se/dycore/fvm_reconstruction_mod.F90

+                   ! Top/bottom edge, y=const., du/dx=0
+                   !
+                   if (abs(r5) > threshold) then
+                      invtmp = 1.0_r8 / (2.0_r8 * r5)! + spherecentroid(1,i,j)


Remove commented-out code here?

Suggested change

invtmp = 1.0_r8 / (2.0_r8 * r5)! + spherecentroid(1,i,j)

invtmp = 1.0_r8 / (2.0_r8 * r5)

PeterHjortLauritzen · 2025-11-24T13:53:01Z

Thanks @nusbaume for the review. I have redone the math for finding extrema in CSLAM and found buggy code. I fixed it. Changes are not B4B but it takes over 2 days for differences to show up in FKESSLER. I did a science validation with FKESSLER (looking at traer properties) and things look good!

Update emission files for CMIP7

PeterHjortLauritzen added 2 commits August 18, 2025 08:06

performance improvement from John Dennis (CISL)

6c243ff

update ChangeLog

948926f

PeterHjortLauritzen mentioned this pull request Aug 18, 2025

Excessive data movement in extend_panel_interpolate #1360

Open

johnmauff reviewed Aug 18, 2025

View reviewed changes

sjsprecious reviewed Aug 18, 2025

View reviewed changes

cacraigucar marked this pull request as draft August 18, 2025 17:16

cacraigucar mentioned this pull request Aug 18, 2025

Optimization from John Dennis for SE dycore #1363

Closed

cacraigucar added this to CAM Development Aug 18, 2025

remove unused variables

e6d5b83

sjsprecious reviewed Aug 20, 2025

View reviewed changes

fix bug

9e7393b

johnmauff reviewed Aug 25, 2025

View reviewed changes

PeterHjortLauritzen force-pushed the dennis_perf_cslam1 branch from 6067381 to 9e7393b Compare September 29, 2025 13:09

PeterHjortLauritzen added 2 commits October 23, 2025 04:03

better vectorization for large NTRAC, ifdef for timers

8c08ece

better performance for large tracer counts

6aefddb

PeterHjortLauritzen changed the title ~~Performance improvements for CSLAM (from John Dennis, CISL)~~ Performance improvements for CSLAM (from John Dennis, CISL, and Lauritzen) Oct 23, 2025

nusbaume marked this pull request as ready for review November 4, 2025 20:58

nusbaume self-requested a review November 4, 2025 20:58

nusbaume approved these changes Nov 22, 2025

View reviewed changes

remove buggy code (answer changes!)

a5eab47

PeterHjortLauritzen added a commit to PeterHjortLauritzen/CAM-SIMA that referenced this pull request Nov 24, 2025

performance enhancements from ESCOMP/CAM#1365

0330613

PeterHjortLauritzen mentioned this pull request Nov 24, 2025

Cam sima dycore update ESCOMP/CAM-SIMA#344

Draft

nusbaume and others added 3 commits November 25, 2025 09:36

Remove dead config code originally used by Eulerian dycore.

7b690fe

Merge tag 'cam6_4_130' into dennis_perf_cslam1

032a03c

Update emission files for CMIP7

Update ChangeLog for cam6_4_131

85a3126

cacraigucar merged commit 11d0035 into ESCOMP:cam_development Nov 26, 2025
2 checks passed

github-project-automation bot moved this to Tag in CAM Development Nov 26, 2025

cacraigucar changed the title ~~Performance improvements for CSLAM (from John Dennis, CISL, and Lauritzen)~~ cam6_4_131: Performance improvements for CSLAM (from John Dennis, CISL, and Lauritzen) Dec 1, 2025

		!not b4b ex1 = (r6r3 - 2.0_r8r5*r2) / disc + scx
		!not b4b ex2 = (r6r2 - 2.0_r8r4*r3) / disc + scy

	invtmp = 1.0_r8 / (2.0_r8 * r4)! + spherecentroid(1,i,j)
	invtmp = 1.0_r8 / (2.0_r8 * r4)

	invtmp = 1.0_r8 / (2.0_r8 * r5)! + spherecentroid(1,i,j)
	invtmp = 1.0_r8 / (2.0_r8 * r5)

cam6_4_131: Performance improvements for CSLAM (from John Dennis, CISL, and Lauritzen) #1365

cam6_4_131: Performance improvements for CSLAM (from John Dennis, CISL, and Lauritzen) #1365

Uh oh!

Conversation

PeterHjortLauritzen commented Aug 18, 2025

Uh oh!

PeterHjortLauritzen commented Aug 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

johnmauff commented Aug 18, 2025 via email

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

PeterHjortLauritzen commented Aug 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sjsprecious commented Aug 20, 2025

Uh oh!

sjsprecious commented Aug 20, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

PeterHjortLauritzen commented Aug 21, 2025

Uh oh!

PeterHjortLauritzen commented Aug 21, 2025

Uh oh!

nusbaume commented Aug 21, 2025

Uh oh!

sjsprecious commented Aug 21, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

First we find the interior critical points:

Evaluate on the Boundaries

Uh oh!

PeterHjortLauritzen Nov 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

johnmauff commented Sep 29, 2025

Uh oh!

PeterHjortLauritzen commented Sep 29, 2025

Uh oh!

johnmauff commented Sep 29, 2025 via email

Uh oh!

cacraigucar commented Oct 13, 2025

Uh oh!

cacraigucar commented Oct 14, 2025

Uh oh!

PeterHjortLauritzen commented Oct 23, 2025

Uh oh!

nusbaume left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

PeterHjortLauritzen commented Nov 24, 2025

Uh oh!

Uh oh!

PeterHjortLauritzen commented Aug 18, 2025 •

edited

Loading

PeterHjortLauritzen commented Aug 20, 2025 •

edited

Loading

PeterHjortLauritzen Nov 24, 2025 •

edited

Loading