REP: Topology Aware Schedulling#66
Conversation
| Let us say one node within this placement fails, since our placement group is not completely unplaced as in the previous example, we will still try to reschedule the unplaced bundle to no avail, marking this placement group as infeasible. | ||
|  | ||
|
|
||
| ### Spread Across Availability Zone and Within Rack (Possible Future Steps) |
There was a problem hiding this comment.
For AZ topology scheduling, I think the two main scenarios are something like:
- For training, STRICT_PACK in single AZ to avoid cross AZ network transfer (incurs additional cost)
- For inference: STRICT_SPREAD across AZs for better fault tolerance
The inference case has a bit more nuance in that if you're doing multi-host serving, you want each group of nodes to be STRICT_PACK but each "replica" of the model to be STRICT_SPREAD. This is maybe irrelevant if we always run a placement group per model replica?
There was a problem hiding this comment.
Either way, we should add the first scenario (strict pack in single AZ) as a use-case as it comes up a lot
There was a problem hiding this comment.
Sounds good, I added an example of Strick Pack AZ + Strict Pack Rack hierarchical topology. For your inference case, I am not entirely sure about the model replica implementation, but I believe what you can do would be to define a group of bundles for each replica, and have a hierarchical scheduling to STRICT_SPREAD each group of bundles across AZs. You can then probably target each group of bundles using their bundle ids (probably should discuss more on this).
| ``` | ||
| ray start --head --labels="rack_id=1" | ||
| ray start --labels="rack_id=1" # rack 1 nodes | ||
| ray start --labels="rack_id=2" # rack 2 nodes |
There was a problem hiding this comment.
Worth nothing that in KubeRay we started to add better metadata for multi-host scenarios. Specifically if you set replicas > 0 and numOfHosts > 1, we set the following labels on every Pod:
ray.io/worker-group-replica-index=<replica index>
ray.io/replica-host-index=<host-index>
So in the NVL72 case, you set numOfHosts=18 and each "replica" is a rack. Then rack_id can just inherit whatever value is used for ray.io/worker-group-replica-index. This is a QoL improvement to the cluster operator because they don't need to manually set unique labels for every rack.
There was a problem hiding this comment.
@ryanaoleary do you know if it's possible now to set Ray labels based on Pod labels?
For example something like:
workerGroupSpecs:
- replicas: 1
minReplicas: 1
maxReplicas: 10
numOfHosts: 18
groupName: gb200
labels:
rack-id: ${WORKER_GROUP_REPLICA_INDEX}
rayStartParams: {}
template:
spec:
containers:
- name: ray-worker
image: rayproject/ray:nightly
resources:
limits:
cpu: "2"
memory: "4Gi"
requests:
cpu: "2"
memory: "4Gi"
env:
- name: WORKER_GROUP_REPLICA_INDEX
valueFrom:
fieldRef:
fieldPath: metadata.labels['worker-group-replica-index']
There was a problem hiding this comment.
The above example fails with:
Warning InvalidRayClusterSpec 13s raycluster-controller The RayCluster spec is invalid default/raycluster-manual-label-test: invalid label value for key 'rack-id' in gb200 group: '${WORKER_GROUP_REPLICA_INDEX}', error: a valid label must be an empty string or consist of alphanumeric characters, '-', '_' or '.', and must start and end with an alphanumeric character (e.g. 'MyValue', or 'my_value', or '12345', regex used for validation is '(([A-Za-z0-9][-A-Za-z0-9_.]*)?[A-Za-z0-9])?')
I think it's not currently supported, if we wanted the label to be set by default when the env var is present we could add it here: https://github.com/ray-project/ray/blob/7ddc3faa761ab533eaff081be4db7dcea683ea56/python/ray/_private/resource_and_label_spec.py#L271. We do currently set a label based on the TPU_NAME var only for TPU, that serves a similar function (unique per multi-host replica): https://github.com/ray-project/ray/blob/7ddc3faa761ab533eaff081be4db7dcea683ea56/python/ray/_private/accelerators/tpu.py#L750.
I think we should set Ray node labels by default for k8s labels on the pod with the prefix ray.io/. I'll put out this change since we previously talked about adding this support.
There was a problem hiding this comment.
Is that only failing due to validation though? If we skip validation does the env var render correctly at runtime?
|
|
||
| In the first strategy, both groups of bundles have to be in the same availability zone. However, these groups of bundles can be on the same / different racks. The bundles themselves within each group must be spread on different nodes of the rack. In the second strategy, both groups of bundles have to be in the same availability zone AND same rack. However, these groups of bundles have to then be spread across different nodes of that rack. | ||
|
|
||
| ## Compatibility, Deprecation, and Migration Plan |
There was a problem hiding this comment.
Can we add a section discussing how autoscaler support will be implemented. I'm interested in how we plan to:
- Set default labels for multi-host / topology aware groups on the Ray node and in the autoscaling config.
- Integrate the label domain key into the autoscaler - will this be supported the same as
bundle_label_selector? (i.e. after abundle_label_selectorhas been chosen with a value that's the same across all bundles, does it just follow the same path as though abundle_label_selectorwas provided explicitly?)
Unrelated but I'm also wondering if a future step will include supporting topology_strategy in the fallback_strategy argument.
| ```py | ||
| pg = ray.util.placement_group( | ||
| bundles = [{"CPU": 2}] * 4, | ||
| topology_strategy = [{"ray.io/node-id" : "STRICT_SPREAD", "rack_id" : "STRICT_PACK"}], |
There was a problem hiding this comment.
This is cool, I think these semantics would support multi-host TPU scheduling well, i.e.:
strict pack within ray.io/tpu-slice-name, strict spread ray.io/node-id or ray.io/tpu-worker-id.
However, one case I'm curious about is whether multi-slice semantics are supported here. This is where we want the placement group to span multiple TPU slices, each slice should be scheduled atomically (i.e. the PG claims all the Ray nodes assigned to the co-located TPU Pods). We can't do ray.io/tpu-slice-name STRICT_PACK because the name for each slice is different. This is a common use case for TPU training and potentially inference.
This is currently supported through the SlicePlacementGroup API by appending the ray.io/tpu-slice-name and related label selectors for each slice (after determining the labels associated with the free slices) to the bundle_label_selector that's used to create the worker PG spanning all the slices, i.e.:
bundle_label_selector: {'ray.io/tpu-slice-name": "X", ..., 'ray.io/tpu-slice-name": "Y"}
Does this API provide a way to do that? It seems similar to the Spread Across Availability Zone and Within Rack section but I want to make sure it'd work.
There was a problem hiding this comment.
I believe this would be supported once we extend functionality to hierarchical scheduling using STRICT_SPREAD on ray.io/tpu-slice-name for each group of bundles. Thus, each group of bundles will be STRICT_SPREAD across tpu slices, and the user can define what scheduling strategy is used within each tpu slice. A thing to note here is that a conversation would need to be had on the exact fault tolerance functionality:
Let us say we have spread groups of bundles across tpu slices. What should we do when one bundle of a group goes down? What should we do when a group of bundles goes down since its tpu slice goes down?
I am leaning towards similar fault tolerance functionality as STRICT_PACK, where if one bundle of a group goes down, we will still keep that group on that tpu slice and try to reschedule that. If the entire group of bundles goes down on some tpu slice, we will probably want to clear the assignment for that tpu slice and move that group of bundles onto another tpu slice.
| pg = ray.util.placement_group( | ||
| bundles = [{"CPU": 2, "GPU": 4}] * 16, | ||
| # NEW FIELD | ||
| topology_strategy = [{"ray.io/node-id": "STRICT_PACK", "rack_id" : "STRICT_PACK"}], |
There was a problem hiding this comment.
This would likely be easier to use if we put the topology strategy itself in the bundle then have a way to have topology labels for a group of bundles.
This REP proposes topology aware scheduling for placement groups. This allows functionality such as rack aware placement group scheduling, enabling rack aware fault tolerance. Motivated by new GB300 racks that are connected through the NVLink domain, necessitating topology aware scheduling to take advantage of faster memory bandwidth.
Requested reviewers: @edoakes @MengjinYan @Sparks0219