Skip to content

RFC: Dynamic Priority Queue for Batch Jobs #27927

@mismithhisler

Description

@mismithhisler

Proposal

We are looking for community feedback on the usefulness of implementing dynamic priorities for batch jobs. This issue will serve as an explanation of what this feature will be, and hopefully a consolidated place for feedback as development progresses.

What are dynamic priorities for batch jobs?

All jobs today can be assigned a priority. Jobs with a higher priority get scheduled before lower priority jobs. If more batch jobs are created than the cluster has capacity for, the jobs are blocked (put into a pending state), and jobs of the highest priority are picked "at random" to be scheduled next.

But what happens if Department A kicks off a ton of long running batch jobs.. Then some time later, Department B kicks off a batch job that should succeed fairly quickly!? Well, Dept. B's jobs may get scheduled soon, or may not. The ordering is not guaranteed as the underlying implementation is Golang's container/heap package (Side note, this could partially be solved by having Dept. A and B in separate namespaces with Quotas, but that could leave idle compute, which is something else we are looking into!).

Instead, we dynamically adjust the priority of a job based on some configurable parameters. In our first pass we are looking at various "weights" that can be applied to the base priority of a job depending on the time the job has spent in the "pending" state:

  1. Time in the queue: The longer a job has been in the queue, the more/less we add/subtract to the priority
  2. Job Size: A rough estimation of job size (CPU Mhz + Mem MB) that adjusts the base priority. If I want smaller jobs to have higher priority in the queue, I can configure that, and vice versa.
  3. Cluster Utilization: A configurable way to say "who has used cluster resources". This could be configured to use namespaces, or a field in a jobs metadata. In the previous example, this could be used to assign a higher weighting to Department B's job, because that department has used less cluster resources than Department A. The next time cluster resources become available, Department B's job will be guaranteed to run.

The various weights would all be combined into a single final adjusted priority. The calculation will look something like:

AdjustedPriority =  Base Priority + 
             (WorkloadAge / MaxAge * AgeWeight) + 
      ((1 - TenantUsage / TotalUsage) * UsageWeight) + 
           ((1 – JobSize / MaxSize) * SizeWeight) 

Example:

Going back to the previous Dept A and B, lets say the cluster is at capacity, is configured with weights of 10, and both submit jobs of priority 10 at the same time. However, Dept. A has been running batch jobs all days on the cluster so they have 100% of the utilization. Dept A's job would not see any increased priority from the utilization factor, while Dept B would see +((1 - 0 / 100 = 1) * 10) = 10 priority. However, lets say that Dept A has a job on the queue that has been there for the MaxAge factor of 24 hours. Then even though Dept A has used a lot of the cluster, that job still gets +( 24 / 24 * 10) = 10 added to it's priority.

Current caveats/limitations

This serializes the scheduling of batch jobs that are configured to use the queue. Also, this only affects batch job registrations, not other cluster events, which are still processed concurrently. Because of the serial scheduling, all the jobs in the queue should have the same, or very similar constraints. Otherwise Nomad may wait for cluster capacity for a job which could have been scheduled much earlier.

Use-cases

We think this could be useful for Nomad users who regularly launch more batch jobs than their cluster is capable of running, and who would like more advanced controls over which of those pending jobs is scheduled next.

Metadata

Metadata

Assignees

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions