Killable job queue

Hello,

Previously I worked on a cluster with a `killable` queue and was wondering if you would consider adding something similar on this cluster. The general idea was that once a job passed it's walltime, it could continue running for longer as long as there was no other job waiting for the compute resource.

This would be great as in long training runs, it happens frequently that jobs are killed and the nodes remain idle after. Although checkpointing and relaunching is of course possible, going through the initialization can be quite time consuming (easily 30min with optimized code or hours with default trainers).

Here is the documentation of the `killable` queue of the other cluster for reference:

> `killable` Queue Policy
> 
> The killable queue is a preemptable queue that allows jobs in bins 4 and 5 to request walltimes up to 24 hours. Jobs submitted to the killable queue will be preemptable once the job reaches the guaranteed runtime limit as shown in the table below. For example, a job in bin 5 submitted to the killable queue can request a walltime of 24 hours. The job will be preemptable after two hours of run time. Similarly, a job in bin 4 will be preemptable after six hours of run time. Once a job is preempted, the job will be resubmitted by default with the original limits as requested in the job script and will have the same JOBID.


Thank you for considering this!

Alexis

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Killable job queue #59

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Killable job queue #59

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions