-
Notifications
You must be signed in to change notification settings - Fork 208
Forecast jobs sometimes failing on reading ocean_geometry file #4490
Description
What is wrong?
I, Laura Slivinski and @JohnSteffen-NOAA have noticed that sometimes forecast jobs fail with:
204: FATAL from PE 0: Permission denied: netcdf_file_open:MOM6_OUTPUT/ocean_geometry.nc
Here is a specific example of my run where it happened. Experiment yaml is based on dev/ci/cases/gfsv17/C384mx025_hybAOWCDA.yaml.
- job enkfgdas_fcst_mem028 failed on the first attempt due to running out of walltime
- second attempt: job fails with the
Permission denied: netcdf_file_open:MOM6_OUTPUT/ocean_geometry.ncmessage. - manual reboot of the job: job fails with the
Permission denied: netcdf_file_open:MOM6_OUTPUT/ocean_geometry.ncmessage.
Laura reported that removing the offending RUNDIR and rebooting the job after that ocean_geometry failure solved the issue for her.
I checked that RUNDIR indeed contains unreadable ocean_geometry.nc file in RUNDIRS/<expname>/enkfgdas.2025101000/enkfgdasefcs028.2025101000/output/MOM6_OUTPUT/ocean_geometry.nc and manually removed it. The job succeeded when rebooting after that.
What should have happened?
I expect the job to rerun successfully without the need to manually remove files from run directories
What machines are impacted?
All or N/A
What global-workflow hash are you using?
Steps to reproduce
See above.
Additional information
No response
Do you have a proposed solution?
Perhaps some cleanup needs to happen when the forecast executable fails, I am not sure.