-
Notifications
You must be signed in to change notification settings - Fork 7
Description
What happened?
Related to #289 (comment)
Per Tony email "RE: Weird srun behavior -- Resolved?" on 4/25/25
It seems that the weird e3sm_to_cmip srun “hangs” (and also “completes” that silently fail to write output) were due to the e2c attempt to delete “temp files” after cmor.close().
Supposedly, after a “cmor.write()” returns, and “cmor.close()” is called, the latter should NOT return until output file(s) are successfully written. However, the subsequent (immediate) removal of temp files often leads to failed writes, and likely (on occasion) causes the job to hang, waiting for some resource.
In one run, where 17 parallel jobs processed 17 decades independently (1850s, 1860s, … 2010), almost ALL failed to write output, and several jobs hung for hours and had to be scancel’d.
When I forced a 2-minute pause (sleep(120)) before removing temp files, only 3 jobs failed to write output.
In subsequent tests where we forego deleting the temp files altogether, ALL decade jobs wrote their output files successfully, and no hangs were seen. These behaviors have been tested multiple times.
In my view, this indicates that cmor.write()/cmor.close() is not behaving correctly. We may need to implement an independent e2c function “confirm_cmor_write()” before deleting the cmor temp files. Presently, we (Tom Vo) have simply elided the “cleanup_temp_files()” call in e2c. Tom will likely issue a new e2c release version.
Subsequently, the dsm_generate_CMIP6.py “parallelize on decades” (year-segments, for ocean processing)) works like a charm.
What did you expect to happen? Are there are possible answers you came across?
No response
Minimal Complete Verifiable Example (MVCE)
Relevant log output
Anything else we need to know?
Can escalate to cmor devs if needed.
Environment
Latest e3sm_to_cmip
Metadata
Metadata
Assignees
Labels
Type
Projects
Status