Skip to content

spread configuration block at the cluster level #27879

@lukepalmer

Description

@lukepalmer

Proposal

Today, nomad jobs can specify spread such that allocations within a job are influenced by a spread score which is calculated in a configurable way.

I propose that there should additionally be a way to specify spread at the cluster level. When specified, this cluster spread would provide an additional score that is combined with others when determining placement. The format and logic of the job spec spread block could be reused exactly.

I am happy to submit a PR for this if maintainers like the idea.

Use-cases

I will describe a problem case with datacenter spread, but this could apply to any criteria representable in spread.

Suppose I have a nomad cluster with datacenters 'east' and 'west' that are about the same size and begin empty. We schedule some things:

  1. Jobs are scheduled that specify datacenters = [east]. They are of course scheduled only on east.
  2. Jobs are scheduled that specify datacenters = [east, west]. Bin packing draws these jobs into east to the extent that space is available on partially packed hosts there. Beyond that, placement is random-ish on available nodes in east and west.

This leaves the utilization of east and west unbalanced. Suppose I now have more jobs that specify datacenters = [east]: the imbalance will get even worse.

Suppose instead that I can use a cluster-level spread block to specify that I would like the utilization of east and west to each be about 50%. In the second step in my scenario, when jobs specifying datacenters = [east, west] are placed, the scheduler incorporates a spread score that initially prefers 'west' in order to balance the datacenters in the way I've specified.

Attempted Solutions

Affinities

I can do this badly with affinities. I can go around and on all of my hosts specify metadata 'node_priority' of 0, 1, 2, or 3 where those priorities are evenly distributed within each of my datacenters. I can then specify affinity on every job that looks like:

        {
          "LTarget": "${meta.node_priority}",
          "Operand": ">=",
          "RTarget": "1",
          "Weight": 40
        },
        {
          "LTarget": "${meta.node_priority}",
          "Operand": ">=",
          "RTarget": "2",
          "Weight": 40
        },
        {
          "LTarget": "${meta.node_priority}",
          "Operand": ">=",
          "RTarget": "3",
          "Weight": 40
        }

This is better than nothing but does not work very well. Problems include:

  • The buckets are coarse. Here I have 4 buckets so this doesn't do anything at all until imbalance exceeds 25%. I could make more buckets but...
  • Affinity gets diluted by the sum of affinity weight. Suppose I now want to specify an unrelated affinity for a good reason. It has to compete with the affinities I am using as a balancing hack, and this gets worse if I add more balancing buckets.
  • It's weird to insert this into every job when I am trying to express a cluster-level concept.

Just-in-time job spec

I could make this work by monitoring the state of my cluster and by modifying job specs on their way into nomad. I could look at metrics to realize there is an imbalance, and then (in my scenario) I could modify a job spec that will run in either datacenter to just say datacenter = [west] in order to resolve the imbalance.

Things that aren't good:

  • This is an open-loop control. You can make it do bad things by (for example) submitting jobs more rapidly than the metrics can keep up.
  • I'm writing logic outside the scheduler that seems to belong in the scheduler.

Note that I haven't actually tried this because I really don't like it.

Unknowns

Where could configuration for such a thing live? It seems like it could be part of scheduler-level configuration. This makes cluster-level spread logically owned by the cluster administrator rather than by owners of jobs.

What happens if there is a spread block in the job spec and also a cluster level spread? Do we treat those as separate scores that get averaged together with the rest of the scores, or does having a job spec spread make the cluster spread do nothing? The former seems preferable but it's not clear.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    Status

    Needs Roadmapping

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions