Skip to content

PodGroup / Workload API Integration with ComputeDomain #934

@johnbelamaric

Description

@johnbelamaric

Today, the ComputeDomain CR is created by the user, and in turn it creates the necessary ResourceClaimTemplates. For workloads like need a ComputeDomain for a group of Pods (e.g., LWS, JobSet, TrainJob), this means that there is no way to match the lifecycle of the ComputeDomain to the lifecycle of those groups of Pods (e.g., an LWS replica).

SIG Scheduling has been working on a PodGroup concept, which models a group of related Pods. The PodGroup provides a few things:

  • Gang scheduling policy for the pods in the group
  • Topology constraints for the pods in the group
  • Resource claims for resources that are shared across pods in a PodGroup
  • A single integration point for workload controllers that operate on groups of pods

One idea that I think we should explore is that we can think of a ComputeDomain as a logical multi-node device that is allocatable via a ResourceClaim. This would allow us to lifecycle the ComputeDomain along with those PodGroups. This is a totally different flow from today. For example, with LWS it would work something like this:

  • User creates:
    • ResourceClaimTemplate asking for a ComputeDomain (this would be a new DeviceClass). All it really needs is a name and the request for the DeviceClass.
    • Workload object, which contains a PodGroupTemplate that references the ResourceClaimTemplate
    • A LWS that references the Workload object / PodGroupTemplate
  • LWS controller creates a replica, which means it creates a PodGroup using the PodGroupTemplate
  • ResourceClaimController sees the PodGroup with the ResourceClaimTemplate, and uses it to create a ResourceClaim for the ComputeDomain
  • ComputeDomainController sees that and either creates a ComputeDomain CR and just follows today's flow, or reads the RC directly and bypasses the CR as it may no longer be needed

Later, when the LWS is scaled down:

  • LWS controller deletes the replica - meaning the PodGroup will be deleted, which is an explicit signal from the LWS controller that the group is no longer needed
  • ResourceClaimController can de-allocate the RC and delete it
  • ComputeDomainContrller sees that and deletes the CR and/or any RCTs it created

Overall the idea is to have these upstream abstractions (PodGroup, ResourceClaim) which workload controllers can create and use to represent relationships between them, and then controllers/drivers can actuate the real things following the lifecycle of those abstractions.

The RC has the "config" section directly in it as well as in the DeviceClass. That is a way to stuff implementation-specific details needed by those controllers into that object. So upstream things just understand the relationships/lifecycles of the abstractions and the implementation details are opaque to them.

This enables things like ComputeDomains and TPU slices to be lifecycled along with the groups of pods that use them.

See also:

cc @mortent @klueska @nojnhuh @kannon92 @helayoty @wojtek-t @andreyvelich

Metadata

Metadata

Assignees

Labels

kind/featureCategorizes issue or PR as related to a new feature.

Type

No type

Projects

Status

Backlog

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions