Skip to content

REP: Topology Aware Schedulling#66

Open
aaronscalene wants to merge 12 commits into
ray-project:mainfrom
aaronscalene:main
Open

REP: Topology Aware Schedulling#66
aaronscalene wants to merge 12 commits into
ray-project:mainfrom
aaronscalene:main

Conversation

@aaronscalene

Copy link
Copy Markdown

This REP proposes topology aware scheduling for placement groups. This allows functionality such as rack aware placement group scheduling, enabling rack aware fault tolerance. Motivated by new GB300 racks that are connected through the NVLink domain, necessitating topology aware scheduling to take advantage of faster memory bandwidth.

Requested reviewers: @edoakes @MengjinYan @Sparks0219

Comment thread reps/2026-06-18-topology-strategy/2026-05-18-topology-strategy.md Outdated
Comment thread reps/2026-06-18-topology-strategy/2026-05-18-topology-strategy.md
Comment thread reps/2026-06-18-topology-strategy/2026-05-18-topology-strategy.md Outdated
Comment thread reps/2026-06-18-topology-strategy/2026-05-18-topology-strategy.md Outdated
Comment thread reps/2026-06-18-topology-strategy/2026-05-18-topology-strategy.md Outdated
Let us say one node within this placement fails, since our placement group is not completely unplaced as in the previous example, we will still try to reschedule the unplaced bundle to no avail, marking this placement group as infeasible.
![](image5.png)

### Spread Across Availability Zone and Within Rack (Possible Future Steps)

@andrewsykim andrewsykim May 22, 2026

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For AZ topology scheduling, I think the two main scenarios are something like:

  • For training, STRICT_PACK in single AZ to avoid cross AZ network transfer (incurs additional cost)
  • For inference: STRICT_SPREAD across AZs for better fault tolerance

The inference case has a bit more nuance in that if you're doing multi-host serving, you want each group of nodes to be STRICT_PACK but each "replica" of the model to be STRICT_SPREAD. This is maybe irrelevant if we always run a placement group per model replica?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Either way, we should add the first scenario (strict pack in single AZ) as a use-case as it comes up a lot

@aaronscalene aaronscalene May 28, 2026

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good, I added an example of Strick Pack AZ + Strict Pack Rack hierarchical topology. For your inference case, I am not entirely sure about the model replica implementation, but I believe what you can do would be to define a group of bundles for each replica, and have a hierarchical scheduling to STRICT_SPREAD each group of bundles across AZs. You can then probably target each group of bundles using their bundle ids (probably should discuss more on this).

Comment thread reps/2026-06-18-topology-strategy/2026-05-18-topology-strategy.md
```
ray start --head --labels="rack_id=1"
ray start --labels="rack_id=1" # rack 1 nodes
ray start --labels="rack_id=2" # rack 2 nodes

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Worth nothing that in KubeRay we started to add better metadata for multi-host scenarios. Specifically if you set replicas > 0 and numOfHosts > 1, we set the following labels on every Pod:

ray.io/worker-group-replica-index=<replica index>
ray.io/replica-host-index=<host-index>

So in the NVL72 case, you set numOfHosts=18 and each "replica" is a rack. Then rack_id can just inherit whatever value is used for ray.io/worker-group-replica-index. This is a QoL improvement to the cluster operator because they don't need to manually set unique labels for every rack.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ryanaoleary do you know if it's possible now to set Ray labels based on Pod labels?

For example something like:

  workerGroupSpecs:
  - replicas: 1
    minReplicas: 1
    maxReplicas: 10
    numOfHosts: 18
    groupName: gb200
    labels:
      rack-id: ${WORKER_GROUP_REPLICA_INDEX}
    rayStartParams: {}
    template:
      spec:
        containers:
        - name: ray-worker
          image: rayproject/ray:nightly
          resources:
            limits:
              cpu: "2"
              memory: "4Gi"
            requests:
              cpu: "2"
              memory: "4Gi"
        env:
        - name: WORKER_GROUP_REPLICA_INDEX
          valueFrom:
            fieldRef:
              fieldPath: metadata.labels['worker-group-replica-index']

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The above example fails with:

Warning  InvalidRayClusterSpec  13s   raycluster-controller  The RayCluster spec is invalid default/raycluster-manual-label-test: invalid label value for key 'rack-id' in gb200 group: '${WORKER_GROUP_REPLICA_INDEX}', error: a valid label must be an empty string or consist of alphanumeric characters, '-', '_' or '.', and must start and end with an alphanumeric character (e.g. 'MyValue',  or 'my_value',  or '12345', regex used for validation is '(([A-Za-z0-9][-A-Za-z0-9_.]*)?[A-Za-z0-9])?')

I think it's not currently supported, if we wanted the label to be set by default when the env var is present we could add it here: https://github.com/ray-project/ray/blob/7ddc3faa761ab533eaff081be4db7dcea683ea56/python/ray/_private/resource_and_label_spec.py#L271. We do currently set a label based on the TPU_NAME var only for TPU, that serves a similar function (unique per multi-host replica): https://github.com/ray-project/ray/blob/7ddc3faa761ab533eaff081be4db7dcea683ea56/python/ray/_private/accelerators/tpu.py#L750.

I think we should set Ray node labels by default for k8s labels on the pod with the prefix ray.io/. I'll put out this change since we previously talked about adding this support.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is that only failing due to validation though? If we skip validation does the env var render correctly at runtime?


In the first strategy, both groups of bundles have to be in the same availability zone. However, these groups of bundles can be on the same / different racks. The bundles themselves within each group must be spread on different nodes of the rack. In the second strategy, both groups of bundles have to be in the same availability zone AND same rack. However, these groups of bundles have to then be spread across different nodes of that rack.

## Compatibility, Deprecation, and Migration Plan

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add a section discussing how autoscaler support will be implemented. I'm interested in how we plan to:

  1. Set default labels for multi-host / topology aware groups on the Ray node and in the autoscaling config.
  2. Integrate the label domain key into the autoscaler - will this be supported the same as bundle_label_selector? (i.e. after a bundle_label_selector has been chosen with a value that's the same across all bundles, does it just follow the same path as though a bundle_label_selector was provided explicitly?)

Unrelated but I'm also wondering if a future step will include supporting topology_strategy in the fallback_strategy argument.

```py
pg = ray.util.placement_group(
bundles = [{"CPU": 2}] * 4,
topology_strategy = [{"ray.io/node-id" : "STRICT_SPREAD", "rack_id" : "STRICT_PACK"}],

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is cool, I think these semantics would support multi-host TPU scheduling well, i.e.:
strict pack within ray.io/tpu-slice-name, strict spread ray.io/node-id or ray.io/tpu-worker-id.

However, one case I'm curious about is whether multi-slice semantics are supported here. This is where we want the placement group to span multiple TPU slices, each slice should be scheduled atomically (i.e. the PG claims all the Ray nodes assigned to the co-located TPU Pods). We can't do ray.io/tpu-slice-name STRICT_PACK because the name for each slice is different. This is a common use case for TPU training and potentially inference.

This is currently supported through the SlicePlacementGroup API by appending the ray.io/tpu-slice-name and related label selectors for each slice (after determining the labels associated with the free slices) to the bundle_label_selector that's used to create the worker PG spanning all the slices, i.e.:
bundle_label_selector: {'ray.io/tpu-slice-name": "X", ..., 'ray.io/tpu-slice-name": "Y"}

Does this API provide a way to do that? It seems similar to the Spread Across Availability Zone and Within Rack section but I want to make sure it'd work.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this would be supported once we extend functionality to hierarchical scheduling using STRICT_SPREAD on ray.io/tpu-slice-name for each group of bundles. Thus, each group of bundles will be STRICT_SPREAD across tpu slices, and the user can define what scheduling strategy is used within each tpu slice. A thing to note here is that a conversation would need to be had on the exact fault tolerance functionality:

Let us say we have spread groups of bundles across tpu slices. What should we do when one bundle of a group goes down? What should we do when a group of bundles goes down since its tpu slice goes down?

I am leaning towards similar fault tolerance functionality as STRICT_PACK, where if one bundle of a group goes down, we will still keep that group on that tpu slice and try to reschedule that. If the entire group of bundles goes down on some tpu slice, we will probably want to clear the assignment for that tpu slice and move that group of bundles onto another tpu slice.

pg = ray.util.placement_group(
bundles = [{"CPU": 2, "GPU": 4}] * 16,
# NEW FIELD
topology_strategy = [{"ray.io/node-id": "STRICT_PACK", "rack_id" : "STRICT_PACK"}],

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would likely be easier to use if we put the topology strategy itself in the bundle then have a way to have topology labels for a group of bundles.

Signed-off-by: aaronscalene <aaron.li@anyscale.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants