Skip to content

Conversation

@jedwards4b
Copy link
Contributor

@jedwards4b jedwards4b commented Jul 25, 2025

Description of changes

I submitted this before and it was reverted without explaination. Now submitting again to address #3351

Specific notes

Contributors other than yourself, if any:

CTSM Issues Fixed:

Are answers expected to change (and if so in what way)? No

Any User Interface Changes (namelist or namelist defaults changes)? No

Does this create a need to change or add documentation? Did you do so? No

Testing performed, if any: ERR.f09_t232.I1850Clm60BgcCrop.derecho_intel.drv-interim_restart

@samsrabin samsrabin added bug something is working incorrectly testing additions or changes to tests bfb bit-for-bit labels Jul 25, 2025
@samsrabin
Copy link
Member

samsrabin commented Jul 25, 2025

Thanks, Jim!

  • We should add the ERR.f09_t232.I1850Clm60BgcCrop.derecho_intel.drv-interim_restart test that Jim tested with (or at least some ERR test) to aux_clm as part of this PR.

↑ Moved to Issue #3359.

@samsrabin samsrabin added the next this should get some attention in the next week or two. Normally each Thursday SE meeting. label Jul 25, 2025
@github-project-automation github-project-automation bot moved this to Ready to start (or start again) in CTSM: Upcoming tags Jul 25, 2025
@samsrabin samsrabin moved this from Ready to start (or start again) to In progress - master in CTSM: Upcoming tags Jul 25, 2025
@jedwards4b
Copy link
Contributor Author

In order to add that test the cime and cmeps pr's will need to be merged and the ctsm externals updated to include them.

@samsrabin
Copy link
Member

Ah, thanks for the heads up. That's ESCOMP/CMEPS#576 and ESMCI/cime#4827, right? We're updating our submodules as we speak with PR #3353, so hopefully it will be simple to merge those new additional tags if they're made soon. Do you have a sense of when you or someone else will be able to get those done?

I'm also assuming it requires ESCOMP/MOSART#115, right? Or would that not be affected by the test?

@jedwards4b
Copy link
Contributor Author

That's correct, it also needs the mosart PR.

slevis-lmwg
slevis-lmwg previously approved these changes Jul 25, 2025
Copy link
Contributor

@slevis-lmwg slevis-lmwg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm approving, thanks @jedwards4b!

@slevis-lmwg
Copy link
Contributor

I will rebase to b4b-dev. If that's wrong, we can rebase back to master.

@slevis-lmwg slevis-lmwg changed the base branch from master to b4b-dev July 25, 2025 16:45
@samsrabin samsrabin changed the base branch from b4b-dev to master July 25, 2025 17:30
@samsrabin samsrabin dismissed slevis-lmwg’s stale review July 25, 2025 17:34

Task list incomplete.

@jedwards4b
Copy link
Contributor Author

The cime PR has been merged in cime6.1.110
and the cmeps is in cmeps1.1.5

@samsrabin samsrabin changed the base branch from master to interim-restart-fix-202507 July 28, 2025 17:25
@samsrabin
Copy link
Member

@jedwards4b:

That's correct, it also needs the mosart PR.

I tested ERR.f10_f10_mg37.I1850Clm60BgcCrop.derecho_intel.drv-interim_restart at 8a9c25d and it worked, even though I haven't brought in the MOSART update and it's still at mosart1.1.10. (I'm still waiting on the f09_t232 version of the test that you used, but I expect it to work too.) Do you know why that might be? Or better yet, do you know how I could set up a test that will fail due to MOSART not having the fix yet?

@jedwards4b
Copy link
Contributor Author

Can you point me to the test that you ran?

@samsrabin
Copy link
Member

Sure, it's at /glade/derecho/scratch/samrabin/tests_0728-133515de/ERR.f10_f10_mg37.I1850Clm60BgcCrop.derecho_intel.drv-interim_restart.0728-133515de/

@jedwards4b
Copy link
Contributor Author

It looks like it passed fortuitously. If you look at the directory
/glade/derecho/scratch/samrabin/archive/ERR.f10_f10_mg37.I1850Clm60BgcCrop.derecho_intel.drv-interim_restart.0728-133515de/rest you will see that the mosart rpointer files are all in the 0001-01-03-00000 directory and are incorrect at the other times. I think that you could make this test fail by either changing the REST_N to 1 or changing the STOP_N to 9. Either of these changes should cause the restart to fail unless the mosart fix is applied.

@samsrabin
Copy link
Member

Hmm. I can't get it to break with either of those modifications:

  • REST_N 1: /glade/derecho/scratch/samrabin/tests_0728-162542de/ERR.f10_f10_mg37.I1850Clm60BgcCrop.derecho_intel.drv-interim_restart.0728-162542de/
  • STOP_N 9: /glade/derecho/scratch/samrabin/tests_0728-165139de/ERR_Ld9.f10_f10_mg37.I1850Clm60BgcCrop.derecho_intel.drv-interim_restart.0728-165139de/

Looks like case2 just always uses the first restart, which is always okay. I can't figure out how that's getting set.

@jedwards4b
Copy link
Contributor Author

I found it - it's done in err.py, and yes it is always choosing the first one. I am trying with a modification in err.py and will make that change if it works (or rather fails).

@jedwards4b
Copy link
Contributor Author

jedwards4b commented Jul 29, 2025

@samsrabin I have made a change so that ERR_Ld9.f10_f10_mg37.I1850Clm60BgcCrop.derecho_intel.drv-interim_restart restarts on 0001-01-05-00000 instead of 0001-01-03-00000, however it turns out that the
rpointer.rof (with no date suffix) in that directory has the correct contents and so the test still passes. This fail over to the original file name was intentional and the case is doing the right thing. I will create a new PR with the change in cime to change the restart time on the ERR test, but I'm not sure if it's worth pursuing getting this test to fail.

@samsrabin
Copy link
Member

Makes sense, thanks!

@samsrabin
Copy link
Member

@jedwards4b Is it possible there's a race condition in the IRT SystemTest? I had one passing at 8a9c25d but then failing when I tried later. This prompted me to try five more replicates, of which only one passed. Here are the directories and their results:

  • /glade/derecho/scratch/samrabin/tests_0728-125516de/IRT_Ld11.f10_f10_mg37.IHistClm60BgcCrop.derecho_intel.clm-default.0728-125516de_int: PASS
  • /glade/derecho/scratch/samrabin/tests_0729-100052de/IRT_Ld11.f10_f10_mg37.IHistClm60BgcCrop.derecho_intel.clm-default.0729-100052de_int: FAIL
  • /glade/derecho/scratch/samrabin/tests_0729-102142de/IRT_Ld11.f10_f10_mg37.IHistClm60BgcCrop.derecho_intel.clm-default.0729-102142de: FAIL
  • /glade/derecho/scratch/samrabin/tests_0729-102202de/IRT_Ld11.f10_f10_mg37.IHistClm60BgcCrop.derecho_intel.clm-default.0729-102202de: FAIL
  • /glade/derecho/scratch/samrabin/tests_0729-102220de/IRT_Ld11.f10_f10_mg37.IHistClm60BgcCrop.derecho_intel.clm-default.0729-102220de: FAIL
  • /glade/derecho/scratch/samrabin/tests_0729-102240de/IRT_Ld11.f10_f10_mg37.IHistClm60BgcCrop.derecho_intel.clm-default.0729-102240de: FAIL
  • /glade/derecho/scratch/samrabin/tests_0729-102301de/IRT_Ld11.f10_f10_mg37.IHistClm60BgcCrop.derecho_intel.clm-default.0729-102301de: PASS

The failures all have this in run/case2run/drv.log*:

  read rpointer file = rpointer.cpl.1850-01-05-00000
 (esm_time_mod.F90:esm_time_clockInit) ERROR rpointer file rpointer.cpl.1850-01-
 05-00000 not found

@jedwards4b
Copy link
Contributor Author

It's not a race condition exactly, it's a faulty method of sorting the restart directories by using mtime. I am experimenting with an alternate method of sorting and will let you know if it's more consistent. By the way I noticed that the ERR test was doing this too, but wrote it off to human error until you said something - thanks.

@samsrabin
Copy link
Member

Awesome, thanks!

@samsrabin
Copy link
Member

I'm going to merge this PR for now and will open a new one once there's a CIME tag with a fix.

@samsrabin samsrabin merged commit 26d0c16 into ESCOMP:interim-restart-fix-202507 Jul 29, 2025
4 checks passed
@ekluzek ekluzek removed the next this should get some attention in the next week or two. Normally each Thursday SE meeting. label Jul 31, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bfb bit-for-bit bug something is working incorrectly testing additions or changes to tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

config_archive.xml needs fix for st_archive Unrevert rpointer changes in #3067 st_archive issues in ctsm5.3.041 with our testing

4 participants