Kill jobs after a configurable number of heartbeats are skipped

Adaptive Scheduler has a "kill manager" that periodically checks the stdout/err of the worker jobs, and kills the job if it sees a special string in any lines of the output.

This works well when there is a small, known, set of error conditions to watch out for.

Sometimes, however, jobs may hang indefinitely _without_ printing anything to stdout/stderr (e.g. stuck in a system call waiting for a lock, filesystem slowness ...). Such failure modes cannot be detected by the existing kill manager implementation.

On the other hand, we know that the main process in a job should be logging its progress to JSON log file every N (default 300) seconds. The appearance of a log message every N seconds is a kind of "heartbeat" that the main process is OK.

I would like to propose that the kill manager monitor the job logfiles, and kill any jobs that have missed a (configurable, maybe non-integer) number of heartbeats (i.e. for which the last log message was greater than M times the logging period).

This mechanism would catch any hangs that occur in the main worker process, but would not catch any hangs that occur in any child processes (e.g. those launched by a SubprocessExecutor)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Kill jobs after a configurable number of heartbeats are skipped #292

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Kill jobs after a configurable number of heartbeats are skipped #292

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions