PodGroup / Workload API Integration with ComputeDomain

Today, the ComputeDomain CR is created by the user, and in turn it creates the necessary ResourceClaimTemplates. For workloads like need a ComputeDomain for a group of Pods (e.g., LWS, JobSet, TrainJob), this means that there is no way to match the lifecycle of the ComputeDomain to the lifecycle of those groups of Pods (e.g., an LWS replica).

SIG Scheduling has been working on a PodGroup concept, which models a group of related Pods. The PodGroup provides a few things:
- Gang scheduling policy for the pods in the group
- Topology constraints for the pods in the group
- Resource claims for resources that are shared across pods in a PodGroup
- A single integration point for workload controllers that operate on groups of pods

One idea that I think we should explore is that we can think of a ComputeDomain as a logical multi-node device that is allocatable via a ResourceClaim. This would allow us to lifecycle the ComputeDomain along with those PodGroups. This is a totally different flow from today. For example, with LWS it would work something like this:

* *User* creates:
  - ResourceClaimTemplate asking for a ComputeDomain (this would be a new DeviceClass). All it really needs is a name and the request for the DeviceClass.
  - Workload object, which contains a PodGroupTemplate that references the ResourceClaimTemplate
  - A LWS that references the Workload object / PodGroupTemplate
* *LWS controller* creates a replica, which means it creates a PodGroup using the PodGroupTemplate
* *ResourceClaimController* sees the PodGroup with the ResourceClaimTemplate, and uses it to create a ResourceClaim for the ComputeDomain
* *ComputeDomainController* sees that and either creates a ComputeDomain CR and just follows today's flow, or reads the RC directly and bypasses the CR as it may no longer be needed

Later, when the LWS is scaled down:
* *LWS controller* deletes the replica - meaning the PodGroup will be deleted, which is an explicit signal from the LWS controller that the group is no longer needed
*  *ResourceClaimController* can de-allocate the RC and delete it
* *ComputeDomainContrller* sees that and deletes the CR and/or any RCTs it created

Overall the idea is to have these upstream abstractions (PodGroup, ResourceClaim) which workload controllers can create and use to represent relationships between them, and then controllers/drivers can actuate the real things following the lifecycle of those abstractions.

The RC has the "config" section directly in it as well as in the DeviceClass. That is a way to stuff implementation-specific details needed by those controllers into that object. So upstream things just understand the relationships/lifecycles of the abstractions and the implementation details are opaque to them.

This enables things like ComputeDomains and TPU slices to be lifecycled along with the groups of pods that use them.

See also:
- https://github.com/kubernetes/enhancements/issues/4671
- https://github.com/kubernetes/enhancements/issues/5832
- https://github.com/kubernetes/enhancements/issues/5729
- https://github.com/kubernetes/enhancements/issues/5732
- https://github.com/kubernetes/enhancements/issues/5732
- https://github.com/kubeflow/trainer/issues/3264
- https://docs.google.com/document/d/1GKhH-dDlMziun5p-fdTLXpNj-K9UK1Hi2nuk0SxztVg/edit?resourcekey=0-ZztAemssXwSJyK5bsn0tgQ&tab=t.0#heading=h.rsx94rglc48a
- https://docs.google.com/document/d/1Fg9ughIRMtt1HmDqiGWV-w9OKdrcKf_PsH4TjuP8Y40/edit?tab=t.0#heading=h.fd6kgam8o1zo

cc @mortent @klueska @nojnhuh @kannon92 @helayoty @wojtek-t @andreyvelich

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PodGroup / Workload API Integration with ComputeDomain #934

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

PodGroup / Workload API Integration with ComputeDomain #934

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions