Skip to content

Commit 2ae5347

Browse files
committed
Added grove example
1 parent 6b2d466 commit 2ae5347

File tree

3 files changed

+214
-0
lines changed

3 files changed

+214
-0
lines changed

examples/grove/README.md

Lines changed: 91 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,91 @@
1+
# Grove Integration with KAI Scheduler
2+
3+
## What is Grove?
4+
5+
[Grove](https://github.com/ai-dynamo/grove) is a Kubernetes-native workload orchestrator designed for AI/ML inference workloads. It introduces the concept of **PodCliqueSets** - a higher-level abstraction for managing groups of related pods with topology-aware scheduling capabilities.
6+
7+
Key concepts:
8+
- **PodCliqueSet (PCS)**: A collection of pod groups (cliques) that are deployed and scaled together
9+
- **PodCliqueScalingGroup (PCSG)**: Logical groupings within a PCS that share scaling behavior
10+
- **Clique**: A group of pods with a specific role (e.g., worker, leader, router)
11+
- **Topology Constraints**: Pack pods within specific topology domains (block, rack) for optimal network locality
12+
13+
Grove is particularly useful for disaggregated inference workloads where different components (prefill workers, decode workers, routers) need to be co-located for performance.
14+
15+
## Installation
16+
17+
Install Grove using Helm with topology-aware scheduling enabled:
18+
19+
```sh
20+
helm upgrade -i grove oci://ghcr.io/ai-dynamo/grove/grove-charts:v0.1.0-alpha.5 \
21+
--set topologyAwareScheduling.enabled=true
22+
```
23+
24+
For more installation options and configuration details, see the [Grove Installation Guide](https://github.com/ai-dynamo/grove/blob/main/docs/installation.md#installation).
25+
26+
## Integrating Grove with KAI Scheduler
27+
28+
To use Grove workloads with KAI Scheduler, you need to configure each clique in your PodCliqueSet with two key settings:
29+
30+
### 1. Set the Scheduler Name
31+
32+
In each clique's `podSpec`, set the `schedulerName` to `kai-scheduler`:
33+
34+
```yaml
35+
cliques:
36+
- name: my-clique
37+
spec:
38+
podSpec:
39+
schedulerName: kai-scheduler # Use KAI Scheduler
40+
containers:
41+
- name: worker
42+
# ...
43+
```
44+
45+
### 2. Add the Queue Label
46+
47+
Add the `kai.scheduler/queue` label to each clique to specify which KAI queue the pods should be scheduled to:
48+
49+
```yaml
50+
cliques:
51+
- name: my-clique
52+
labels:
53+
kai.scheduler/queue: default-queue # Assign to KAI queue
54+
spec:
55+
# ...
56+
```
57+
58+
### Complete Example
59+
60+
The [podcliqueset.yaml](./podcliqueset.yaml) file in this directory demonstrates a complete disaggregated inference setup with:
61+
- A PodCliqueSet with topology constraint at the `block` level
62+
- Two PodCliqueScalingGroups (`decoder` and `prefill`) constrained at the `rack` level
63+
- Multiple cliques (workers, leaders, router) all configured for KAI Scheduler
64+
65+
Each clique follows the integration pattern:
66+
67+
```yaml
68+
cliques:
69+
- name: dworker
70+
labels:
71+
kai.scheduler/queue: default-queue # KAI queue assignment
72+
spec:
73+
roleName: dworker
74+
replicas: 1
75+
minAvailable: 1
76+
podSpec:
77+
schedulerName: kai-scheduler # KAI scheduler
78+
containers:
79+
- name: worker
80+
image: nginx:alpine-slim
81+
resources:
82+
requests:
83+
memory: 30Mi
84+
```
85+
86+
## Summary of Required Changes
87+
88+
| Setting | Location | Value | Purpose |
89+
|---------|----------|-------|---------|
90+
| `schedulerName` | `cliques[*].spec.podSpec` | `kai-scheduler` | Route pods to KAI Scheduler |
91+
| `kai.scheduler/queue` | `cliques[*].labels` | Queue name (e.g., `default-queue`) | Assign workload to a KAI queue for resource management and fair-sharing |

examples/grove/podcliqueset.yaml

Lines changed: 107 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,107 @@
1+
# Workload: Disaggregated Inference - PCS with PCSG and multiple cliques
2+
# Test scenario: PCS (block) with 2 PCSGs (rack) containing disaggregated inference components
3+
---
4+
apiVersion: grove.io/v1alpha1
5+
kind: PodCliqueSet
6+
metadata:
7+
name: tas-disagg-inference
8+
labels:
9+
app: tas-disagg-inference
10+
spec:
11+
replicas: 1
12+
template:
13+
topologyConstraint:
14+
packDomain: block
15+
podCliqueScalingGroups:
16+
- name: decoder
17+
replicas: 2
18+
minAvailable: 1
19+
topologyConstraint:
20+
packDomain: rack
21+
cliqueNames:
22+
- dworker
23+
- dleader
24+
- name: prefill
25+
replicas: 2
26+
minAvailable: 1
27+
topologyConstraint:
28+
packDomain: rack
29+
cliqueNames:
30+
- pworker
31+
- pleader
32+
cliques:
33+
- name: dworker
34+
labels:
35+
kai.scheduler/queue: default-queue
36+
spec:
37+
roleName: dworker
38+
replicas: 1
39+
minAvailable: 1
40+
podSpec:
41+
schedulerName: kai-scheduler
42+
containers:
43+
- name: worker
44+
image: registry:5001/nginx:alpine-slim
45+
resources:
46+
requests:
47+
memory: 30Mi
48+
- name: dleader
49+
labels:
50+
kai.scheduler/queue: default-queue
51+
spec:
52+
roleName: dleader
53+
replicas: 1
54+
minAvailable: 1
55+
podSpec:
56+
schedulerName: kai-scheduler
57+
containers:
58+
- name: leader
59+
image: registry:5001/nginx:alpine-slim
60+
resources:
61+
requests:
62+
memory: 30Mi
63+
- name: pworker
64+
labels:
65+
kai.scheduler/queue: default-queue
66+
spec:
67+
roleName: pworker
68+
replicas: 1
69+
minAvailable: 1
70+
podSpec:
71+
schedulerName: kai-scheduler
72+
containers:
73+
- name: worker
74+
image: registry:5001/nginx:alpine-slim
75+
resources:
76+
requests:
77+
memory: 30Mi
78+
- name: pleader
79+
labels:
80+
kai.scheduler/queue: default-queue
81+
spec:
82+
roleName: pleader
83+
replicas: 1
84+
minAvailable: 1
85+
podSpec:
86+
schedulerName: kai-scheduler
87+
containers:
88+
- name: leader
89+
image: registry:5001/nginx:alpine-slim
90+
resources:
91+
requests:
92+
memory: 30Mi
93+
- name: router
94+
labels:
95+
kai.scheduler/queue: default-queue
96+
spec:
97+
roleName: router
98+
replicas: 2
99+
minAvailable: 2
100+
podSpec:
101+
schedulerName: kai-scheduler
102+
containers:
103+
- name: router
104+
image: registry:5001/nginx:alpine-slim
105+
resources:
106+
requests:
107+
memory: 30Mi

examples/grove/topology.yaml

Lines changed: 16 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,16 @@
1+
# Topology resource defining the hierarchy for topology-aware scheduling
2+
# Levels are ordered from top (broadest) to bottom (narrowest)
3+
# Your nodes must have labels matching these nodeLabel values
4+
---
5+
apiVersion: kai.scheduler/v1alpha1
6+
kind: Topology
7+
metadata:
8+
name: default
9+
spec:
10+
levels:
11+
# Top level - block (e.g., a group of racks)
12+
- nodeLabel: topology.kai.io/block
13+
# Second level - rack
14+
- nodeLabel: topology.kai.io/rack
15+
# Lowest level - individual node (optional, but recommended)
16+
- nodeLabel: kubernetes.io/hostname

0 commit comments

Comments
 (0)