Improved config #1

edoyango · 2025-07-09T01:47:08Z

This is a pull request which tracks some of the config improvements that you could leverage (on top of the improvements to the exe done here). I'll populate this description as results are found.

Change	Result (SUs)	Comment
na	2937.88 (best of 1)	Base case - only change is using fastest binary from my PR
Choose core count for MOM6 for better IO_LAYOUT	2807.54 (best of 1)	Minghang pointed out that IO was being serialized onto a single core. This required guessing an appropriate core count. NB This could be faster if `PARALLEL_RESTART = True`, but that doesn't work with `payu` Minghang's looking at it. Is answer changing
Using Minghang's updated `nuopc.runseq`	2576.43 (best of 1)	Huge improvement!
Optimise CICE block sizes	notyet	doubling block size (`[x, y] = [60, 54]`) SU seemed to increase a tiny bit, but reported CICE time (from `ice.log` says CICE walltime went down by 5% to 478.34). For more ice scenarios, having larger block sizes seemed to improve performance significantly. See zulip comment for more details.
Try CICE `sectrobin` distribution	2603.24 (best of 1)	CICE docs says `distribution_type=sectrobin` is a bit better for PE layout in terms of neighbours/communication, but a bit worse for load balancing. This reduced CICE time (as per `ice.log` to 449.34s walltime).
Optimise CICE/nuopc cores	TODO	Waiting on Minghang's cool tool to see how much time each om3 component is waiting for eachother.

claireyung · 2025-07-21T07:39:29Z

Thanks so much @edoyango this is awesome!

Since I needed to modify my run a little after 8 years due to a salt restoring file bug I found, I decided to swap to SR with your current optimisation for the last 2 years of my spinup. I merged your commits in this PR into my spinup config (which I'd previously done on cascade lake) in this branch https://github.com/claireyung/access-om3-configs/tree/8km_jra_ryf_obc2-sapphirerapid-Charrassin-newparams-rerun-Wright-spinup-accessom2IC-yr9

The cost before (cascade lake, Helen's executable, DT=600) was 10600 SU/month and the new sim (sapphire rapid pr113-27, DT=600, Ed's improvements) is 8700 SU/month^ which is a great improvement! Thank you so much!!

Is this the kind of speed that was expected? Note 2603/10days x 31days ~ 8100 SU *

Naively, I guess the netcdf files are bigger at the end when you run for a month vs 10 days which maybe slows down the final steps of the model and makes it slightly more than the raw scaling?

*this is not really a fair comparison, because the config I gave you has DT 450 and I bumped mine up to DT 600, adding in 3 cice dynamic timesteps/mom timestep with ndtd = 3. So actually I'd expect my config with fewer ocean steps to be faster, unless CICE is now being waited for.... (I did get a significant speed up on Cascade Lake from DT 600 to DT 450 + ndtd=3, but they had a different PE_LAYOUT). I guess Minghang's tool will reveal all?

^all quoted numbers compare Januarys, but I haven't looked at the range

edoyango · 2025-07-21T22:21:01Z

Is this the kind of speed that was expected? Note 2603/10days x 31days ~ 8100 SU *

Naively, I guess the netcdf files are bigger at the end when you run for a month vs 10 days which maybe slows down the final steps of the model and makes it slightly more than the raw scaling?

Yes I think this is roughly what I would expect. Some of the changes, especially the IO_layout improvements, primarily affect the final dumping of restart files (which take a lot of time). Since your runs are much longer, you won't see as much of a benefit from those.

*this is not really a fair comparison, because the config I gave you has DT 450 and I bumped mine up to DT 600, adding in 3 cice dynamic timesteps/mom timestep with ndtd = 3. So actually I'd expect my config with fewer ocean steps to be faster, unless CICE is now being waited for.... (I did get a significant speed up on Cascade Lake from DT 600 to DT 450 + ndtd=3, but they had a different PE_LAYOUT). I guess Minghang's tool will reveal all?

This is an interesting change in the config! This will probably affect who's waiting for who. I'll do more testing - the PE_LAYOUT from before was assuming MOM was the bottleneck, but now that CICE has more dynamic steps, giving more cores to CICE might be beneficial?

edoyango · 2025-07-23T02:23:51Z

Hi @claireyung, just following up on this

This is an interesting change in the config! This will probably affect who's waiting for who. I'll do more testing - the PE_LAYOUT from before was assuming MOM was the bottleneck, but now that CICE has more dynamic steps, giving more cores to CICE might be beneficial?

Just to say that I couldn't get a conclusive answer on this - increasing/decreasing cores assigned to MOM didn't seem to change much.

claireyung · 2025-07-23T02:33:00Z

Hey @edoyango, thanks for looking at it and for the update! I guess maybe this means we are approaching an optimum model cost...

Edward Yang added 4 commits July 1, 2025 15:00

2025-07-01 15:00:29: Run 0

2987107

2025-07-01 19:29:30: Run 0

fd310d5

2025-07-02 07:59:10: Run 1

8e530ad

Add iolayout and runseq improvements

a023cb1

edoyango force-pushed the cpu-ratio-1428ocn branch from 9a0f58d to a023cb1 Compare July 9, 2025 07:44

Edward Yang added 2 commits July 10, 2025 10:53

2025-07-10 10:53:15: Run 0

ec89f3b

2025-07-10 13:10:29: Run 0

3cf97db

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Improved config #1

Improved config #1

Uh oh!

edoyango commented Jul 9, 2025 •

edited

Loading

Uh oh!

claireyung commented Jul 21, 2025

Uh oh!

edoyango commented Jul 21, 2025

Uh oh!

edoyango commented Jul 23, 2025 •

edited

Loading

Uh oh!

claireyung commented Jul 23, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Improved config #1

Are you sure you want to change the base?

Improved config #1

Uh oh!

Conversation

edoyango commented Jul 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

claireyung commented Jul 21, 2025

Uh oh!

edoyango commented Jul 21, 2025

Uh oh!

edoyango commented Jul 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

claireyung commented Jul 23, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

edoyango commented Jul 9, 2025 •

edited

Loading

edoyango commented Jul 23, 2025 •

edited

Loading