-
Notifications
You must be signed in to change notification settings - Fork 4
Description
Hello,
Previously I worked on a cluster with a killable queue and was wondering if you would consider adding something similar on this cluster. The general idea was that once a job passed it's walltime, it could continue running for longer as long as there was no other job waiting for the compute resource.
This would be great as in long training runs, it happens frequently that jobs are killed and the nodes remain idle after. Although checkpointing and relaunching is of course possible, going through the initialization can be quite time consuming (easily 30min with optimized code or hours with default trainers).
Here is the documentation of the killable queue of the other cluster for reference:
killableQueue PolicyThe killable queue is a preemptable queue that allows jobs in bins 4 and 5 to request walltimes up to 24 hours. Jobs submitted to the killable queue will be preemptable once the job reaches the guaranteed runtime limit as shown in the table below. For example, a job in bin 5 submitted to the killable queue can request a walltime of 24 hours. The job will be preemptable after two hours of run time. Similarly, a job in bin 4 will be preemptable after six hours of run time. Once a job is preempted, the job will be resubmitted by default with the original limits as requested in the job script and will have the same JOBID.
Thank you for considering this!
Alexis