Skip to content

prevent dumping diags too close from wall time limit #1128

@nicolasaunai

Description

@nicolasaunai

On supercomputers the job is killed when reaching the wall time limit, usually 24hrs.
It could be, if not lucky, that the run is killed while dumping diags in which case the whole h5/vtkhdf file can be corrupted.

A way to prevent this from happening could be:

  • measure the last diag dump performed and keep that time (we're talking about all diags to be safer but it could per file if peacky)
  • use last timing to compare to user time before wall time limit
  • if current time to wall time limit < last timing, then do not dump (and tell the user)
  • we might consider an option to propose preemptive kill if the user does not want to run something that's going to be killed before it dumps anything anymore (should be optional as users may still want to have logs or something something)

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    Status

    Do me  👋

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions