Added grove example

itsomri · itsomri · commit 2ae5347aa8fb · 2026-01-26T13:24:04.000+02:00
diff --git a/examples/grove/README.md b/examples/grove/README.md
@@ -0,0 +1,91 @@
+# Grove Integration with KAI Scheduler
+
+## What is Grove?
+
+[Grove](https://github.com/ai-dynamo/grove) is a Kubernetes-native workload orchestrator designed for AI/ML inference workloads. It introduces the concept of **PodCliqueSets** - a higher-level abstraction for managing groups of related pods with topology-aware scheduling capabilities.
+
+Key concepts:
+- **PodCliqueSet (PCS)**: A collection of pod groups (cliques) that are deployed and scaled together
+- **PodCliqueScalingGroup (PCSG)**: Logical groupings within a PCS that share scaling behavior
+- **Clique**: A group of pods with a specific role (e.g., worker, leader, router)
+- **Topology Constraints**: Pack pods within specific topology domains (block, rack) for optimal network locality
+
+Grove is particularly useful for disaggregated inference workloads where different components (prefill workers, decode workers, routers) need to be co-located for performance.
+
+## Installation
+
+Install Grove using Helm with topology-aware scheduling enabled:
+
+```sh
+helm upgrade -i grove oci://ghcr.io/ai-dynamo/grove/grove-charts:v0.1.0-alpha.5 \
+  --set topologyAwareScheduling.enabled=true
+```
+
+For more installation options and configuration details, see the [Grove Installation Guide](https://github.com/ai-dynamo/grove/blob/main/docs/installation.md#installation).
+
+## Integrating Grove with KAI Scheduler
+
+To use Grove workloads with KAI Scheduler, you need to configure each clique in your PodCliqueSet with two key settings:
+
+### 1. Set the Scheduler Name
+
+In each clique's `podSpec`, set the `schedulerName` to `kai-scheduler`:
+
+```yaml
+cliques:
+  - name: my-clique
+    spec:
+      podSpec:
+        schedulerName: kai-scheduler  # Use KAI Scheduler
+        containers:
+          - name: worker
+            # ...
+```
+
+### 2. Add the Queue Label
+
+Add the `kai.scheduler/queue` label to each clique to specify which KAI queue the pods should be scheduled to:
+
+```yaml
+cliques:
+  - name: my-clique
+    labels:
+      kai.scheduler/queue: default-queue  # Assign to KAI queue
+    spec:
+      # ...
+```
+
+### Complete Example
+
+The [podcliqueset.yaml](./podcliqueset.yaml) file in this directory demonstrates a complete disaggregated inference setup with:
+- A PodCliqueSet with topology constraint at the `block` level
+- Two PodCliqueScalingGroups (`decoder` and `prefill`) constrained at the `rack` level
+- Multiple cliques (workers, leaders, router) all configured for KAI Scheduler
+
+Each clique follows the integration pattern:
+
+```yaml
+cliques:
+  - name: dworker
+    labels:
+      kai.scheduler/queue: default-queue  # KAI queue assignment
+    spec:
+      roleName: dworker
+      replicas: 1
+      minAvailable: 1
+      podSpec:
+        schedulerName: kai-scheduler       # KAI scheduler
+        containers:
+          - name: worker
+            image: nginx:alpine-slim
+            resources:
+              requests:
+                memory: 30Mi
+```
+
+## Summary of Required Changes
+
+| Setting | Location | Value | Purpose |
+|---------|----------|-------|---------|
+| `schedulerName` | `cliques[*].spec.podSpec` | `kai-scheduler` | Route pods to KAI Scheduler |
+| `kai.scheduler/queue` | `cliques[*].labels` | Queue name (e.g., `default-queue`) | Assign workload to a KAI queue for resource management and fair-sharing |
diff --git a/examples/grove/podcliqueset.yaml b/examples/grove/podcliqueset.yaml
@@ -0,0 +1,107 @@
+# Workload: Disaggregated Inference - PCS with PCSG and multiple cliques
+# Test scenario: PCS (block) with 2 PCSGs (rack) containing disaggregated inference components
+---
+apiVersion: grove.io/v1alpha1
+kind: PodCliqueSet
+metadata:
+  name: tas-disagg-inference
+  labels:
+    app: tas-disagg-inference
+spec:
+  replicas: 1
+  template:
+    topologyConstraint:
+      packDomain: block
+    podCliqueScalingGroups:
+      - name: decoder
+        replicas: 2
+        minAvailable: 1
+        topologyConstraint:
+          packDomain: rack
+        cliqueNames:
+          - dworker
+          - dleader
+      - name: prefill
+        replicas: 2
+        minAvailable: 1
+        topologyConstraint:
+          packDomain: rack
+        cliqueNames:
+          - pworker
+          - pleader
+    cliques:
+      - name: dworker
+        labels:
+          kai.scheduler/queue: default-queue
+        spec:
+          roleName: dworker
+          replicas: 1
+          minAvailable: 1
+          podSpec:
+            schedulerName: kai-scheduler
+            containers:
+              - name: worker
+                image: registry:5001/nginx:alpine-slim
+                resources:
+                  requests:
+                    memory: 30Mi
+      - name: dleader
+        labels:
+          kai.scheduler/queue: default-queue
+        spec:
+          roleName: dleader
+          replicas: 1
+          minAvailable: 1
+          podSpec:
+            schedulerName: kai-scheduler
+            containers:
+              - name: leader
+                image: registry:5001/nginx:alpine-slim
+                resources:
+                  requests:
+                    memory: 30Mi
+      - name: pworker
+        labels:
+          kai.scheduler/queue: default-queue
+        spec:
+          roleName: pworker
+          replicas: 1
+          minAvailable: 1
+          podSpec:
+            schedulerName: kai-scheduler
+            containers:
+              - name: worker
+                image: registry:5001/nginx:alpine-slim
+                resources:
+                  requests:
+                    memory: 30Mi
+      - name: pleader
+        labels:
+          kai.scheduler/queue: default-queue
+        spec:
+          roleName: pleader
+          replicas: 1
+          minAvailable: 1
+          podSpec:
+            schedulerName: kai-scheduler
+            containers:
+              - name: leader
+                image: registry:5001/nginx:alpine-slim
+                resources:
+                  requests:
+                    memory: 30Mi
+      - name: router
+        labels:
+          kai.scheduler/queue: default-queue
+        spec:
+          roleName: router
+          replicas: 2
+          minAvailable: 2
+          podSpec:
+            schedulerName: kai-scheduler
+            containers:
+              - name: router
+                image: registry:5001/nginx:alpine-slim
+                resources:
+                  requests:
+                    memory: 30Mi
diff --git a/examples/grove/topology.yaml b/examples/grove/topology.yaml
@@ -0,0 +1,16 @@
+# Topology resource defining the hierarchy for topology-aware scheduling
+# Levels are ordered from top (broadest) to bottom (narrowest)
+# Your nodes must have labels matching these nodeLabel values
+---
+apiVersion: kai.scheduler/v1alpha1
+kind: Topology
+metadata:
+  name: default
+spec:
+  levels:
+    # Top level - block (e.g., a group of racks)
+    - nodeLabel: topology.kai.io/block
+    # Second level - rack
+    - nodeLabel: topology.kai.io/rack
+    # Lowest level - individual node (optional, but recommended)
+    - nodeLabel: kubernetes.io/hostname