Skip to content

Kill jobs after a configurable number of heartbeats are skipped #292

@jbweston

Description

@jbweston

Adaptive Scheduler has a "kill manager" that periodically checks the stdout/err of the worker jobs, and kills the job if it sees a special string in any lines of the output.

This works well when there is a small, known, set of error conditions to watch out for.

Sometimes, however, jobs may hang indefinitely without printing anything to stdout/stderr (e.g. stuck in a system call waiting for a lock, filesystem slowness ...). Such failure modes cannot be detected by the existing kill manager implementation.

On the other hand, we know that the main process in a job should be logging its progress to JSON log file every N (default 300) seconds. The appearance of a log message every N seconds is a kind of "heartbeat" that the main process is OK.

I would like to propose that the kill manager monitor the job logfiles, and kill any jobs that have missed a (configurable, maybe non-integer) number of heartbeats (i.e. for which the last log message was greater than M times the logging period).

This mechanism would catch any hangs that occur in the main worker process, but would not catch any hangs that occur in any child processes (e.g. those launched by a SubprocessExecutor)

Metadata

Metadata

Assignees

No one assigned

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions