Skip to content

[Bug]: Deleting tmp dir at the end of an e3sm_to_cmip results in some missing cmor outputs #295

@tomvothecoder

Description

@tomvothecoder

What happened?

Related to #289 (comment)

Per Tony email "RE: Weird srun behavior -- Resolved?" on 4/25/25

It seems that the weird e3sm_to_cmip srun “hangs” (and also “completes” that silently fail to write output) were due to the e2c attempt to delete “temp files” after cmor.close().

Supposedly, after a “cmor.write()” returns, and “cmor.close()” is called, the latter should NOT return until output file(s) are successfully written. However, the subsequent (immediate) removal of temp files often leads to failed writes, and likely (on occasion) causes the job to hang, waiting for some resource.

In one run, where 17 parallel jobs processed 17 decades independently (1850s, 1860s, … 2010), almost ALL failed to write output, and several jobs hung for hours and had to be scancel’d.

When I forced a 2-minute pause (sleep(120)) before removing temp files, only 3 jobs failed to write output.

In subsequent tests where we forego deleting the temp files altogether, ALL decade jobs wrote their output files successfully, and no hangs were seen. These behaviors have been tested multiple times.

In my view, this indicates that cmor.write()/cmor.close() is not behaving correctly. We may need to implement an independent e2c function “confirm_cmor_write()” before deleting the cmor temp files. Presently, we (Tom Vo) have simply elided the “cleanup_temp_files()” call in e2c. Tom will likely issue a new e2c release version.

Subsequently, the dsm_generate_CMIP6.py “parallelize on decades” (year-segments, for ocean processing)) works like a charm.

What did you expect to happen? Are there are possible answers you came across?

No response

Minimal Complete Verifiable Example (MVCE)

Relevant log output

Anything else we need to know?

Can escalate to cmor devs if needed.

Environment

Latest e3sm_to_cmip

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingloggingRelated to logging output, either in the console or log file.

    Type

    No type

    Projects

    Status

    To do

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions