HPC: Monitoring Progress & Aborting

In unsupervised (or: _ideally they should be unsupervised, but I am checking the status every other minute_) runs such as batched HPC execution, it is preferable that the operator does not have to be synchronously monitoring a running job.

In real-world loads of HPC systems (changing loads on networks, filesystems, OoM scenarios, changing software, etc.), it is not uncommon that a batched job starts up outside of regular working hours and in some cases, causes a hang until walltime. In some cases, this can be costly.

We should establish a mechanism (in job scripts) that programatically monitors progress / health of a simulation and if a configurable timeout is reached, aborts the simulation, first with sigterm (for backtrace generation) and then sigkill.

## Possible Implementations

A very simple implementation could be to write some kind of status (e.g., the current time) into a file (e.g., from the I/O processor) every time step. In the batch job, a single polling process could check the time difference.

File-based I/O is of course far from ideal, e.g., due to sync, load, short time steps, or for I/O-free runs (e.g., optimization). Better might be to have a port open for health queries (could be later reused to query things like memory usage per MPI process, load, etc.) or to react on a POSIX signal and print something on a specific channel (e.g., stderr), like `dd` does.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

HPC: Monitoring Progress & Aborting #5584

Possible Implementations

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

HPC: Monitoring Progress & Aborting #5584

Description

Possible Implementations

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions