Skip to content
Open
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -53,7 +53,7 @@ Grove introduces four simple concepts:
| [PodGang](scheduler/api/core/v1alpha1/podgang.go) | The scheduler API that defines a unit of gang-scheduling. A PodGang is a collection of groups of similar pods, where each pod group defines a minimum number of replicas guaranteed for gang-scheduling. |

Get started with a step-by-step hands-on Grove tutorial here
**→ [Core Concepts Overview](docs/user-guide/core-concepts/overview.md)**
**→ [Core Concepts Overview](docs/user-guide/01_core-concepts/01_overview.md)**

Refer to all Grove APIs here
**→ [API Reference](docs/api-reference/operator-api.md)**
Expand Down
2 changes: 1 addition & 1 deletion docs/quickstart.md
Original file line number Diff line number Diff line change
Expand Up @@ -186,7 +186,7 @@ Only the Grove operator pod should remain.
Now that you understand the basics, explore:

- **[Installation Guide](installation.md)** - Learn more about local and remote cluster deployment
- **[Core Concepts Tutorial](user-guide/core-concepts/overview.md)** - Step-by-step hands-on tutorial on Grove application development
- **[Core Concepts Tutorial](user-guide/01_core-concepts/01_overview.md)** - Step-by-step hands-on tutorial on Grove application development
- **[API Reference](api-reference/operator-api.md)** - Deep dive into all configuration options
- **[Samples](../operator/samples/)** - Explore more examples

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -32,4 +32,4 @@ Grove provides three levels of scaling to match different operational needs:

- **Scale PodClique replicas** (`kubectl scale pclq ...`) - Adjust the number of pods in a specific role. Use this for fine-tuning individual components (e.g., add more workers to an existing leader-worker group).

In the [next guide](./pcs_and_pclq_intro.md) we go through some examples showcasing PodCliqueSet and PodClique
In the [next guide](./02_pcs_and_pclq_intro.md) we go through some examples showcasing PodCliqueSet and PodClique
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@

In this guide we go over some hands-on examples showcasing how to use a PodCliqueSet and PodCliques.

Refer to [Overview](./overview.md) for instructions on how to run the examples in this guide.
Refer to [Overview](./01_overview.md) for instructions on how to run the examples in this guide.

## Example 1: Single-Node Aggregated Inference

Expand Down Expand Up @@ -31,11 +31,11 @@ spec:
- name: model-worker
image: nginx:latest
command: ["/bin/sh"]
args: ["-c", "echo 'Model Worker (Aggregated) on node:' && hostname && sleep 3600"]
args: ["-c", "echo 'Model Worker (Aggregated) on node:' && hostname && sleep infinity"]
resources:
requests:
cpu: "1"
memory: "2Gi"
cpu: "10m"
memory: "32Mi"
```

### **Key Points:**
Expand All @@ -46,11 +46,11 @@ spec:

### **Deploy:**

In this example, we will deploy the file: [single-node-aggregated.yaml](../../../operator/samples/user-guide/concept-overview/single-node-aggregated.yaml)
In this example, we will deploy the file: [single-node-aggregated.yaml](../../../operator/samples/user-guide/01_core-concepts/single-node-aggregated.yaml)
```bash
# NOTE: Run the following commands from the `/path/to/grove/operator` directory,
# where `/path/to/grove` is the root of your cloned Grove repository.
kubectl apply -f samples/user-guide/concept-overview/single-node-aggregated.yaml
kubectl apply -f samples/user-guide/01_core-concepts/single-node-aggregated.yaml
kubectl get pods -l app.kubernetes.io/part-of=single-node-aggregated -o wide
```

Expand Down Expand Up @@ -135,11 +135,11 @@ spec:
- name: prefill
image: nginx:latest
command: ["/bin/sh"]
args: ["-c", "echo 'Prefill Worker on node:' && hostname && sleep 3600"]
args: ["-c", "echo 'Prefill Worker on node:' && hostname && sleep infinity"]
resources:
requests:
cpu: "2"
memory: "4Gi"
cpu: "10m"
memory: "32Mi"
- name: decode
spec:
roleName: decode
Expand All @@ -154,11 +154,11 @@ spec:
- name: decode
image: nginx:latest
command: ["/bin/sh"]
args: ["-c", "echo 'Decode Worker on node:' && hostname && sleep 3600"]
args: ["-c", "echo 'Decode Worker on node:' && hostname && sleep infinity"]
resources:
requests:
cpu: "1"
memory: "2Gi"
cpu: "10m"
memory: "32Mi"
```

### **Key Points:**
Expand All @@ -168,11 +168,11 @@ spec:

### **Deploy**

In this example, we will deploy the file: [single-node-disaggregated.yaml](../../../operator/samples/user-guide/concept-overview/single-node-disaggregated.yaml)
In this example, we will deploy the file: [single-node-disaggregated.yaml](../../../operator/samples/user-guide/01_core-concepts/single-node-disaggregated.yaml)
```bash
# NOTE: Run the following commands from the `/path/to/grove/operator` directory,
# where `/path/to/grove` is the root of your cloned Grove repository.
kubectl apply -f samples/user-guide/concept-overview/single-node-disaggregated.yaml
kubectl apply -f samples/user-guide/01_core-concepts/single-node-disaggregated.yaml
kubectl get pods -l app.kubernetes.io/part-of=single-node-disaggregated -o wide
```

Expand All @@ -193,7 +193,7 @@ You can scale the `prefill` and `decode` PodCliques the same way the [`model-wor

Additionally, the `single-node-disaggregated` PodCliqueSet can be scaled the same way the `single-node-aggregated` PodCliqueSet was scaled in the previous example. We show an example to demonstrate how when PodCliqueSets are scaled, all constituent PodCliques are replicated, underscoring why scaling PodCliqueSets should be treated as scaling the entire system (useful for canary deployments, A/B testing, or high availability across zones).
```bash
kubectl scale pcs single-node-aggregated --replicas=2
kubectl scale pcs single-node-disaggregated --replicas=2
```
After running this you will observe
```bash
Expand All @@ -219,4 +219,4 @@ To teardown the example delete the `single-node-disaggregated` PodCliqueSet, the
kubectl delete pcs single-node-disaggregated
```

In the [next guide](./pcsg_intro.md) we showcase how to use PodCliqueScalingGroup to represent multi-node components
In the [next guide](./03_pcsg_intro.md) we showcase how to use PodCliqueScalingGroup to represent multi-node components
Original file line number Diff line number Diff line change
@@ -1,8 +1,8 @@
# PodCliqueScalingGroup

In the [previous guide](./pcs_and_pclq_intro.md) we covered some hands on examples on how to use PodCliqueSet and PodCliques. In this guide we go over some hands-on examples on how to use PodCliqueScalingGroup to represent multinode components.
In the [previous guide](./02_pcs_and_pclq_intro.md) we covered some hands on examples on how to use PodCliqueSet and PodCliques. In this guide we go over some hands-on examples on how to use PodCliqueScalingGroup to represent multinode components.

Refer to [Overview](./overview.md) for instructions on how to run the examples in this guide.
Refer to [Overview](./01_overview.md) for instructions on how to run the examples in this guide.

## Example 3: Multi-Node Aggregated Inference

Expand Down Expand Up @@ -36,11 +36,11 @@ spec:
- name: model-leader
image: nginx:latest
command: ["/bin/sh"]
args: ["-c", "echo 'Model Leader (Aggregated) on node:' && hostname && sleep 3600"]
args: ["-c", "echo 'Model Leader (Aggregated) on node:' && hostname && sleep infinity"]
resources:
requests:
cpu: "2"
memory: "4Gi"
cpu: "10m"
memory: "32Mi"
- name: worker
spec:
roleName: worker
Expand All @@ -55,11 +55,11 @@ spec:
- name: model-worker
image: nginx:latest
command: ["/bin/sh"]
args: ["-c", "echo 'Model Worker (Aggregated) on node:' && hostname && sleep 3600"]
args: ["-c", "echo 'Model Worker (Aggregated) on node:' && hostname && sleep infinity"]
resources:
requests:
cpu: "4"
memory: "8Gi"
cpu: "10m"
memory: "32Mi"
podCliqueScalingGroups:
- name: model-instance
cliqueNames: [leader, worker]
Expand All @@ -74,11 +74,11 @@ spec:

### **Deploy:**

In this example, we will deploy the file: [multi-node-aggregated.yaml](../../../operator/samples/user-guide/concept-overview/multi-node-aggregated.yaml)
In this example, we will deploy the file: [multi-node-aggregated.yaml](../../../operator/samples/user-guide/01_core-concepts/multi-node-aggregated.yaml)
```bash
# NOTE: Run the following commands from the `/path/to/grove/operator` directory,
# where `/path/to/grove` is the root of your cloned Grove repository.
kubectl apply -f samples/user-guide/concept-overview/multi-node-aggregated.yaml
kubectl apply -f samples/user-guide/01_core-concepts/multi-node-aggregated.yaml
kubectl get pods -l app.kubernetes.io/part-of=multinode-aggregated -o wide
```

Expand Down Expand Up @@ -207,11 +207,11 @@ spec:
- name: prefill-leader
image: nginx:latest
command: ["/bin/sh"]
args: ["-c", "echo 'Prefill Leader on node:' && hostname && sleep 3600"]
args: ["-c", "echo 'Prefill Leader on node:' && hostname && sleep infinity"]
resources:
requests:
cpu: "2"
memory: "4Gi"
cpu: "10m"
memory: "32Mi"
- name: pworker
spec:
roleName: pworker
Expand All @@ -226,11 +226,11 @@ spec:
- name: prefill-worker
image: nginx:latest
command: ["/bin/sh"]
args: ["-c", "echo 'Prefill Worker on node:' && hostname && sleep 3600"]
args: ["-c", "echo 'Prefill Worker on node:' && hostname && sleep infinity"]
resources:
requests:
cpu: "4"
memory: "8Gi"
cpu: "10m"
memory: "32Mi"
- name: dleader
spec:
roleName: dleader
Expand All @@ -245,11 +245,11 @@ spec:
- name: decode-leader
image: nginx:latest
command: ["/bin/sh"]
args: ["-c", "echo 'Decode Leader on node:' && hostname && sleep 3600"]
args: ["-c", "echo 'Decode Leader on node:' && hostname && sleep infinity"]
resources:
requests:
cpu: "1"
memory: "2Gi"
cpu: "10m"
memory: "32Mi"
- name: dworker
spec:
roleName: dworker
Expand All @@ -264,11 +264,11 @@ spec:
- name: decode-worker
image: nginx:latest
command: ["/bin/sh"]
args: ["-c", "echo 'Decode Worker on node:' && hostname && sleep 3600"]
args: ["-c", "echo 'Decode Worker on node:' && hostname && sleep infinity"]
resources:
requests:
cpu: "2"
memory: "4Gi"
cpu: "10m"
memory: "32Mi"
podCliqueScalingGroups:
- name: prefill
cliqueNames: [pleader, pworker]
Expand All @@ -288,11 +288,11 @@ spec:

### **Deploy**

In this example, we will deploy the file: [multi-node-disaggregated.yaml](../../../operator/samples/user-guide/concept-overview/multi-node-disaggregated.yaml)
In this example, we will deploy the file: [multi-node-disaggregated.yaml](../../../operator/samples/user-guide/01_core-concepts/multi-node-disaggregated.yaml)
```bash
# NOTE: Run the following commands from the `/path/to/grove/operator` directory,
# where `/path/to/grove` is the root of your cloned Grove repository.
kubectl apply -f samples/user-guide/concept-overview/multi-node-disaggregated.yaml
kubectl apply -f samples/user-guide/01_core-concepts/multi-node-disaggregated.yaml
kubectl get pods -l app.kubernetes.io/part-of=multinode-disaggregated -o wide
```

Expand Down Expand Up @@ -325,4 +325,4 @@ To teardown the example delete the `multinode-disaggregated` PodCliqueSet, the o
```bash
kubectl delete pcs multinode-disaggregated
```
In the [next guide](./takeaways.md) we showcase how Grove can represent an arbitrary number of components and summarize the key takeaways.
In the [next guide](./04_takeaways.md) we showcase how Grove can represent an arbitrary number of components and summarize the key takeaways.
Original file line number Diff line number Diff line change
@@ -1,10 +1,10 @@
# Takeaways

Refer to [Overview](./overview.md) for instructions on how to run the examples in this guide.
Refer to [Overview](./01_overview.md) for instructions on how to run the examples in this guide.

## Example 5: Complete Inference Pipeline

The [previous examples](./pcsg_intro.md) have focused on mapping various inference workloads into Grove primitives, focusing on the model instances. However, the primitives are generic and the point of Grove is to allow the user to represent as many components as they'd like. To illustrate this point we now provide an example where we represent additional components such as a frontend and vision encoder. To add additional components you simply add additional PodCliques and PodCliqueScalingGroups into the PodCliqueSet
The [previous examples](./03_pcsg_intro.md) have focused on mapping various inference workloads into Grove primitives, focusing on the model instances. However, the primitives are generic and the point of Grove is to allow the user to represent as many components as they'd like. To illustrate this point we now provide an example where we represent additional components such as a frontend and vision encoder. To add additional components you simply add additional PodCliques and PodCliqueScalingGroups into the PodCliqueSet

```yaml
apiVersion: grove.io/v1alpha1
Expand All @@ -31,11 +31,11 @@ spec:
- name: frontend
image: nginx:latest
command: ["/bin/sh"]
args: ["-c", "echo 'Frontend Service on node:' && hostname && sleep 3600"]
args: ["-c", "echo 'Frontend Service on node:' && hostname && sleep infinity"]
resources:
requests:
cpu: "0.5"
memory: "1Gi"
cpu: "10m"
memory: "32Mi"
- name: vision-encoder
spec:
roleName: vision-encoder
Expand All @@ -50,11 +50,11 @@ spec:
- name: vision-encoder
image: nginx:latest
command: ["/bin/sh"]
args: ["-c", "echo 'Vision Encoder on node:' && hostname && sleep 3600"]
args: ["-c", "echo 'Vision Encoder on node:' && hostname && sleep infinity"]
resources:
requests:
cpu: "3"
memory: "6Gi"
cpu: "10m"
memory: "32Mi"
# Multi-node components
- name: pleader
spec:
Expand All @@ -70,11 +70,11 @@ spec:
- name: prefill-leader
image: nginx:latest
command: ["/bin/sh"]
args: ["-c", "echo 'Prefill Leader on node:' && hostname && sleep 3600"]
args: ["-c", "echo 'Prefill Leader on node:' && hostname && sleep infinity"]
resources:
requests:
cpu: "2"
memory: "4Gi"
cpu: "10m"
memory: "32Mi"
- name: pworker
spec:
roleName: pworker
Expand All @@ -89,11 +89,11 @@ spec:
- name: prefill-worker
image: nginx:latest
command: ["/bin/sh"]
args: ["-c", "echo 'Prefill Worker on node:' && hostname && sleep 3600"]
args: ["-c", "echo 'Prefill Worker on node:' && hostname && sleep infinity"]
resources:
requests:
cpu: "4"
memory: "8Gi"
cpu: "10m"
memory: "32Mi"
- name: dleader
spec:
roleName: dleader
Expand All @@ -108,11 +108,11 @@ spec:
- name: decode-leader
image: nginx:latest
command: ["/bin/sh"]
args: ["-c", "echo 'Decode Leader on node:' && hostname && sleep 3600"]
args: ["-c", "echo 'Decode Leader on node:' && hostname && sleep infinity"]
resources:
requests:
cpu: "1"
memory: "2Gi"
cpu: "10m"
memory: "32Mi"
- name: dworker
spec:
roleName: dworker
Expand All @@ -127,11 +127,11 @@ spec:
- name: decode-worker
image: nginx:latest
command: ["/bin/sh"]
args: ["-c", "echo 'Decode Worker on node:' && hostname && sleep 3600"]
args: ["-c", "echo 'Decode Worker on node:' && hostname && sleep infinity"]
resources:
requests:
cpu: "2"
memory: "4Gi"
cpu: "10m"
memory: "32Mi"
podCliqueScalingGroups:
- name: prefill
cliqueNames: [pleader, pworker]
Expand All @@ -149,11 +149,11 @@ spec:

**Deploy and explore:**

In this example, we will deploy the file: [complete-inference-pipeline.yaml](../../../operator/samples/user-guide/concept-overview/complete-inference-pipeline.yaml)
In this example, we will deploy the file: [complete-inference-pipeline.yaml](../../../operator/samples/user-guide/01_core-concepts/complete-inference-pipeline.yaml)
```bash
# NOTE: Run the following commands from the `/path/to/grove/operator` directory,
# where `/path/to/grove` is the root of your cloned Grove repository.
kubectl apply -f samples/user-guide/concept-overview/complete-inference-pipeline.yaml
kubectl apply -f samples/user-guide/01_core-concepts/complete-inference-pipeline.yaml
kubectl get pods -l app.kubernetes.io/part-of=comp-inf-ppln -o wide
```

Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
# Pod and Resource Naming Conventions

This section explains Grove's hierarchical naming scheme for pods and resources. Grove's naming convention is designed to be **self-documenting**: when you run `kubectl get pods`, the pod names immediately tell you which PodCliqueSet, PodCliqueScalingGroup (if applicable), and PodClique each pod belongs to.

## Prerequisites

Before starting this section:
- Review the [core concepts tutorial](../01_core-concepts/01_overview.md) to understand Grove's primitives
- Set up a cluster following the [installation guide](../../installation.md), the two options are:
- [A local KIND demo cluster](../../installation.md#local-kind-cluster-set-up): Create the cluster with `make kind-up FAKE_NODES=40`, set `KUBECONFIG` env variable as directed, and run `make deploy`
- [A remote Kubernetes cluster](../../installation.md#remote-cluster-set-up) with [Grove installed from package](../../installation.md#install-grove-from-package)

## Guides in This Section

1. **[Naming Conventions](./02_naming-conventions.md)**: Learn the naming patterns, best practices, and how to plan names for your resources.

2. **[Hands-On Example](./03_hands-on-example.md)**: Deploy an example system with the structure of a multi-node disaggregated inference system and observe the naming hierarchy in action.
Loading
Loading