Skip to content

Commit bcc4121

Browse files
authored
Rename PodGangSet to PodCliqueSet (#186)
* Update operator/api to now use PodCliqueSet instead of PodGangSet. * Update doc strings for scheduler/api to use PodCliqueSet instead of PodGangSet. * Regenerated CRDS, clientset and deepcopy functions. * Regerated API docs. * Initial commit on adaption of PodGangSet to PodCliqueSet in the rest of the code base. * adapted charts to PodCliqueSet and the changed OperatorConfig * adapted operator/samples/simple to use PodCliqueSet * Adapted operator/internal/components to use PodCliqueSet * Changed docs/installation.md to reflect PodCliqueSet changes. * Renamed component/podgangset to component/podcliqueset * Adapted operator/internal/controller to use PodCliqueSet. * Adapted rest of the operator/internal packages to use PodCliqueSet. * Adapted operator/test to use PodCliqueSet * Adapted docs to use PodCliqueSet * Corrected charts to now use PodCliqueSet * Corrected Dockerfile label to use PodCliqueSet * Changed docs/assets to use PodCliqueSet --------- Signed-off-by: madhav bhargava <madhav.bhargava@sap.com>
1 parent 7274227 commit bcc4121

File tree

128 files changed

+2589
-2591
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

128 files changed

+2589
-2591
lines changed

README.md

Lines changed: 11 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@
44
55
# Grove
66

7-
Grove is a Kubernetes API purpose-built for orchestrating AI workloads in GPU clusters, where a single custom resource allows you to hierarchically compose multiple AI components with flexible gang-scheduling and auto-scaling specfications at multiple levels. Through native support for network topology-aware gang scheduling, multi-dimensional auto-scaling and prescriptive startup ordering, Grove enables developers to define complex AI stacks in a concise, declarative, and framework-agnostic manner.
7+
Grove is a Kubernetes API purpose-built for orchestrating AI workloads in GPU clusters, where a single custom resource allows you to hierarchically compose multiple AI components with flexible gang-scheduling and auto-scaling specification at multiple levels. Through native support for network topology-aware gang scheduling, multidimensional auto-scaling and prescriptive startup ordering, Grove enables developers to define complex AI stacks in a concise, declarative, and framework-agnostic manner.
88

99
Grove was originally motivated by the challenges of orchestrating multinode, disaggregated inference systems. It provides a consistent and unified API that allows users to define, configure, and scale prefill, decode, and any other components like routing within a single custom resource. However, it is flexible enough to map naturally to the roles, scaling behaviors, and dependencies of any real-world inference systems, from "traditional" single node aggregated inference to agentic pipelines with multiple models.
1010

@@ -15,24 +15,24 @@ Modern inference systems are often no longer single-pod workloads. They involve
1515

1616
## Core Concepts
1717

18-
The Grove API consists of a user API and a scheduling API. While the user API (`PodGangSet`, `PodClique`, `PodCliqueScalingGroup`) allows users to represent their AI workloads, the scheduling API (`PodGang`) enables scheduler integration to support the network topology-optimized gang-scheduling and auto-scaling requirements of the workload.
18+
The Grove API consists of a user API and a scheduling API. While the user API (`PodCliqueSet`, `PodClique`, `PodCliqueScalingGroup`) allows users to represent their AI workloads, the scheduling API (`PodGang`) enables scheduler integration to support the network topology-optimized gang-scheduling and auto-scaling requirements of the workload.
1919

20-
| Concept | Description |
21-
| ------------------------------------------------------------ | ------------------------------------------------------------ |
22-
| [PodGangSet](operator/api/core/v1alpha1/podgangset.go) | The top-level Grove object that defines a group of components managed and colocated together. Also supports autoscaling with topology aware spread of PodGangSet replicas for availability. |
23-
| [PodClique](operator/api/core/v1alpha1/podclique.go) | A group of pods representing a specific role (e.g., leader, worker, frontend). Each clique has an independent configuration and supports custom scaling logic. |
24-
| [PodCliqueScalingGroup](operator/api/core/v1alpha1/scalinggroup.go) | A set of PodCliques that scale and are scheduled together. Ideal for tightly coupled roles like prefill leader and worker. |
25-
| [PodGang](scheduler/api/core/v1alpha1/podgang.go) | The scheduler API that defines a unit of gang-scheduling. A PodGang is a collection of groups of similar pods, where each pod group defines a minimum number of replicas guaranteed for gang-scheduling. |
20+
| Concept | Description |
21+
|---------------------------------------------------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
22+
| [PodCliqueSet](operator/api/core/v1alpha1/podcliqueset.go) | The top-level Grove object that defines a group of components managed and colocated together. Also supports autoscaling with topology aware spread of PodCliqueSet replicas for availability. |
23+
| [PodClique](operator/api/core/v1alpha1/podclique.go) | A group of pods representing a specific role (e.g., leader, worker, frontend). Each clique has an independent configuration and supports custom scaling logic. |
24+
| [PodCliqueScalingGroup](operator/api/core/v1alpha1/scalinggroup.go) | A set of PodCliques that scale and are scheduled together. Ideal for tightly coupled roles like prefill leader and worker. |
25+
| [PodGang](scheduler/api/core/v1alpha1/podgang.go) | The scheduler API that defines a unit of gang-scheduling. A PodGang is a collection of groups of similar pods, where each pod group defines a minimum number of replicas guaranteed for gang-scheduling. |
2626

2727

2828
## Key Capabilities
2929

3030
- **Declarative composition of Role-Based Pod Groups**
31-
`PodGangSet` API provides users a capability to declaratively compose tightly coupled group of pods with explicit role based logic, e.g. disaggregated roles in a model serving stack such as `prefill`, `decode` and `routing`.
31+
`PodCliqueSet` API provides users a capability to declaratively compose tightly coupled group of pods with explicit role based logic, e.g. disaggregated roles in a model serving stack such as `prefill`, `decode` and `routing`.
3232
- **Flexible Gang Scheduling**
33-
`PodClique`'s and `PodCliqueScalingGroup`s allow users to specify flexible gang-scheduling requirements at multiple levels within a `PodGangSet` to prevent resource deadlocks.
33+
`PodClique`'s and `PodCliqueScalingGroup`s allow users to specify flexible gang-scheduling requirements at multiple levels within a `PodCliqueSet` to prevent resource deadlocks.
3434
- **Multi-level Horizontal Auto-Scaling**
35-
Supports pluggable horizontal auto-scaling solutions to scale `PodGangSet`, `PodClique` and `PodCliqueScalingGroup` custom resources.
35+
Supports pluggable horizontal auto-scaling solutions to scale `PodCliqueSet`, `PodClique` and `PodCliqueScalingGroup` custom resources.
3636
- **Network Topology-Aware Scheduling**
3737
Allows specifying network topology pack and spread constraints to optimize for both network performance and service availability.
3838
- **Custom Startup Dependencies**

0 commit comments

Comments
 (0)