-
Notifications
You must be signed in to change notification settings - Fork 13
Description
Adaptive Scheduler has a "kill manager" that periodically checks the stdout/err of the worker jobs, and kills the job if it sees a special string in any lines of the output.
This works well when there is a small, known, set of error conditions to watch out for.
Sometimes, however, jobs may hang indefinitely without printing anything to stdout/stderr (e.g. stuck in a system call waiting for a lock, filesystem slowness ...). Such failure modes cannot be detected by the existing kill manager implementation.
On the other hand, we know that the main process in a job should be logging its progress to JSON log file every N (default 300) seconds. The appearance of a log message every N seconds is a kind of "heartbeat" that the main process is OK.
I would like to propose that the kill manager monitor the job logfiles, and kill any jobs that have missed a (configurable, maybe non-integer) number of heartbeats (i.e. for which the last log message was greater than M times the logging period).
This mechanism would catch any hangs that occur in the main worker process, but would not catch any hangs that occur in any child processes (e.g. those launched by a SubprocessExecutor)