ai-dynamo
diff --git a/‎docs/user-guide/environment-variables.md‎
Lines changed: 1 addition & 1 deletion b/‎docs/user-guide/environment-variables.md‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎docs/user-guide/pod-and-resource-naming-conventions/01_overview.md‎
Lines changed: 10 additions & 0 deletions b/‎docs/user-guide/pod-and-resource-naming-conventions/01_overview.md‎
Lines changed: 10 additions & 0 deletions
diff --git a/‎docs/user-guide/pod-and-resource-naming-conventions/02_naming-conventions.md‎
Lines changed: 188 additions & 0 deletions b/‎docs/user-guide/pod-and-resource-naming-conventions/02_naming-conventions.md‎
Lines changed: 188 additions & 0 deletions
@@ -6,7 +6,7 @@ This guide explains the environment variables that Grove automatically injects i
 
 Before starting this guide:
 - Review the [core concepts tutorial](./core-concepts/overview.md) to understand Grove's primitives
-- Read the [Pod Naming guide](./pod-naming.md) to understand Grove's naming conventions
+- Read the [Pod Naming guide](./pod-and-resource-naming-conventions/01_overview.md) to understand Grove's naming conventions
 - Set up a cluster following the [installation guide](../installation.md), the two options are:
   - [A local KIND demo cluster](../installation.md#local-kind-cluster-set-up): Create the cluster with `make kind-up FAKE_NODES=40`, set `KUBECONFIG` env variable as directed, and run `make deploy`
   - [A remote Kubernetes cluster](../installation.md#remote-cluster-set-up) with [Grove installed from package](../installation.md#install-grove-from-package)
 
@@ -0,0 +1,10 @@
+# Pod and Resource Naming Conventions
+
+This section explains Grove's hierarchical naming scheme for pods and resources. Grove's naming convention is designed to be **self-documenting**: when you run `kubectl get pods`, the pod names immediately tell you which PodCliqueSet, PodCliqueScalingGroup (if applicable), and PodClique each pod belongs to.
+
+## Guides in This Section
+
+1. **[Naming Conventions](./02_naming-conventions.md)**: Learn the naming patterns, best practices, and how to plan names for your resources.
+
+2. **[Hands-On Example](./03_hands-on-example.md)**: Deploy an example with the structure of a multi-node disaggregated inference system and observe the naming hierarchy in action.
+
@@ -0,0 +1,188 @@
+# Naming Conventions
+
+This guide explains Grove's hierarchical pod and resource naming scheme and best practices for naming your resources.
+
+## Why Hierarchical Naming Matters
+
+Grove's naming scheme serves two critical purposes:
+
+1. **Immediate Visual Understanding**: Pod names encode the complete hierarchy, so `kubectl get pods` output is self-explanatory. You can instantly see which pods belong together and how they're organized.
+
+2. **Programmatic Service Discovery**: The hierarchical structure enables pods to discover and communicate with each other using fully qualified domain names (FQDNs). The [Environment Variables guide](../environment-variables.md) demonstrates how to programmatically construct these FQDNs using Grove's injected environment variables.
+
+## Prerequisites
+
+Before starting this guide:
+- Review the [core concepts tutorial](../core-concepts/overview.md) to understand Grove's primitives
+- Set up a cluster following the [installation guide](../../installation.md), the two options are:
+  - [A local KIND demo cluster](../../installation.md#local-kind-cluster-set-up) Create the cluster with `make kind-up FAKE_NODES=40`, set `KUBECONFIG` env variable as directed, and run `make deploy`
+  - [A remote Kubernetes cluster](../../installation.md#remote-cluster-set-up) with [Grove installed from package](../../installation.md#install-grove-from-package)
+
+## Pod Naming Patterns
+
+### Standalone PodCliques
+
+For PodCliques that are **not** part of a PodCliqueScalingGroup, the pod naming follows this pattern:
+
+```
+<pcs-name>-<pcs-replica-index>-<pclq-name>-<random-suffix>
+```
+
+**Components:**
+- `<pcs-name>`: The name of the PodCliqueSet
+- `<pcs-replica-index>`: The replica index of the PodCliqueSet (0-based)
+- `<pclq-name>`: The name of the PodClique template defined in the PodCliqueSet spec
+- `<random-suffix>`: A random 5-character suffix generated by Kubernetes
+
+**Example:** `multinode-disaggregated-0-frontend-a7b3c`
+
+Looking at this name, you can immediately tell:
+- It belongs to the `multinode-disaggregated` PodCliqueSet
+- It's part of PodCliqueSet replica 0
+- It's from the `frontend` PodClique
+
+### PodCliques in a PodCliqueScalingGroup
+
+For PodCliques that **are** part of a PodCliqueScalingGroup, the pod naming includes the PCSG information:
+
+```
+<pcs-name>-<pcs-replica-index>-<pcsg-name>-<pcsg-replica-index>-<pclq-name>-<random-suffix>
+```
+
+**Components:**
+- `<pcs-name>`: The name of the PodCliqueSet
+- `<pcs-replica-index>`: The replica index of the PodCliqueSet (0-based)
+- `<pcsg-name>`: The name of the PodCliqueScalingGroup template
+- `<pcsg-replica-index>`: The replica index of the PodCliqueScalingGroup (0-based)
+- `<pclq-name>`: The name of the PodClique template within the PCSG
+- `<random-suffix>`: A random 5-character suffix generated by Kubernetes
+
+**Example:** `multinode-disaggregated-0-prefill-1-pworker-m9n0o`
+
+Looking at this name, you can immediately tell:
+- It belongs to the `multinode-disaggregated` PodCliqueSet (replica 0)
+- It's part of the `prefill` PodCliqueScalingGroup (replica 1)
+- It's from the `pworker` PodClique (prefill worker)
+
+## Naming Best Practices
+
+### Kubernetes Name Length Limit
+
+Kubernetes has a **63-character limit** for resource names. Since Grove constructs full pod names by combining multiple components, you need to be mindful of name lengths when choosing names for your resources.
+
+**How Grove constructs names:**
+
+For standalone PodCliques, the final pod name is:
+```
+<pcs-name>-<pcs-replica-idx>-<pclq-name>-<5-char-suffix>
+```
+
+For PodCliques in a PCSG, the final pod name is:
+```
+<pcs-name>-<pcs-replica-idx>-<pcsg-name>-<pcsg-replica-idx>-<pclq-name>-<5-char-suffix>
+```
+
+**Character budget breakdown:**
+- `<5-char-suffix>`: 5 characters (fixed by Kubernetes)
+- `-` separators: 3-5 characters depending on structure
+- Replica indices: 1+ characters each (single digit for 0-9, two digits for 10-99, etc.)
+- Your chosen names: Remaining characters
+
+### Naming Guidelines
+
+1. **Use Short, Descriptive Names**: Choose concise but meaningful names
+   - ✅ Good: `frontend`, `api`, `db`, `cache`
+   - ❌ Avoid: `frontend-service-component`, `api-gateway-server` (too long)
+   - ❌ Avoid: `f`, `a`, `d`, `c` (too cryptic)
+
+2. **Use Abbreviations for Multi-Component Systems**: When you have multiple PodCliqueScalingGroups with similar roles, use prefixes or abbreviations
+   - ✅ Good: `pleader`, `pworker` (prefill), `dleader`, `dworker` (decode)
+   - ❌ Avoid: `prefill-leader`, `prefill-worker`, `decode-leader`, `decode-worker`
+
+3. **Keep PodCliqueSet Names Short**: Remember that the PCS name is included in every pod name
+   - ✅ Good: `ml-inference`, `web-app`, `data-pipeline`
+   - ❌ Avoid: `machine-learning-inference-service`, `web-application-stack`
+
+4. **Plan for Scaling**: Consider whether you'll need double-digit replica indices (adds 1 character per additional digit)
+   - If you plan to scale to 10+ or 100+ or 1000+ replicas, budget accordingly
+
+5. **Unique PodClique Names Within a PodCliqueSet**: All PodClique names must be unique within a PodCliqueSet. We explain the rationale for this further in the [Hands-On Example](./03_hands-on-example.md#why-unique-podclique-names-matter).
+   - If you have leader/worker patterns in multiple PCSGs, you **must** use different names (e.g., `pleader`/`pworker` and `dleader`/`dworker`)
+
+### Example: Planning Names for a Complex System
+
+Let's plan names for a multi-node disaggregated inference system with a frontend:
+
+**Requirements:**
+- 1 standalone frontend component
+- 2 multi-node components: prefill and decode
+- Each multi-node component has leader/worker roles
+- All PodClique names must be unique within the PodCliqueSet
+- Names should be short to allow for scaling headroom while remaining descriptive 
+
+**Name choices:**
+- PodCliqueSet: `mn-disagg` (short, 9 chars)
+- Standalone PodClique: `frontend` (8 chars)
+- PCSG for prefill: `prefill` (7 chars)
+  - Leader PodClique: `pleader` (7 chars)
+  - Worker PodClique: `pworker` (7 chars)
+- PCSG for decode: `decode` (6 chars)
+  - Leader PodClique: `dleader` (7 chars)
+  - Worker PodClique: `dworker` (7 chars)
+
+**Resulting pod names:**
+- Frontend: `mn-disagg-0-frontend-a7b3c` (26 chars) ✅
+- Prefill leader: `mn-disagg-0-prefill-0-pleader-a7b3c` (35 chars) ✅
+- Prefill worker: `mn-disagg-0-prefill-0-pworker-a7b3c` (35 chars) ✅
+- Decode leader: `mn-disagg-0-decode-0-dleader-a7b3c` (34 chars) ✅
+- Decode worker: `mn-disagg-0-decode-0-dworker-a7b3c` (34 chars) ✅
+
+**Scaling headroom:** The longest name (`mn-disagg-0-prefill-0-pworker-a7b3c`) is 35 characters, leaving 28 characters of headroom. Each additional digit in a replica index adds 1 character:
+- 2-digit indices for PCS and PCSG (10-99): 37 chars → scales to 99 PCS replicas × 99 PCSG replicas ✅
+- 3-digit indices for PCS and PCSG (100-999): 39 chars → scales to 999 × 999 replicas ✅
+- 7-digit indices for PCS and PCSG: 47 chars → scales to millions of replicas ✅
+
+With these name choices, you could scale to millions of replicas on both dimensions without hitting the limit. All names are well under the 63-character limit with room for scaling growth while remaining descriptive!
+
+To deploy a PodCliqueSet with this structure and explore the naming hierarchy through `kubectl`, continue to the [Hands-On Example](./03_hands-on-example.md).
+
+---
+
+## Resource Naming Reference
+
+### Grove Resources and Their Naming
+
+| Resource | You Name | Grove Generates | Pattern |
+|----------|----------|-----------------|---------|
+| **PodCliqueSet** | ✅ | - | `<your-pcs-name>` |
+| **PodClique (template)** | ✅ | - | `<your-pclq-name>` (in spec.template.cliques) |
+| **PCSG (template)** | ✅ | - | `<your-pcsg-name>` (in spec.template.podCliqueScalingGroups) |
+| **PodClique (resource, standalone)** | - | ✅ | `<pcs-name>-<pcs-idx>-<pclq-name>` |
+| **PodClique (resource, in PCSG)** | - | ✅ | `<pcs-name>-<pcs-idx>-<pcsg-name>-<pcsg-idx>-<pclq-name>` |
+| **PCSG (resource)** | - | ✅ | `<pcs-name>-<pcs-idx>-<pcsg-name>` |
+| **Pod (standalone)** | - | ✅ | `<pcs-name>-<pcs-idx>-<pclq-name>-<suffix>` |
+| **Pod (in PCSG)** | - | ✅ | `<pcs-name>-<pcs-idx>-<pcsg-name>-<pcsg-idx>-<pclq-name>-<suffix>` |
+
+**You control:** PodCliqueSet name, PodClique template names, PCSG template names  
+**Grove generates:** All resource instances with hierarchical naming
+
+## Key Takeaways
+
+1. **Self-Documenting Hierarchy**: Pod names encode the complete hierarchy from PodCliqueSet → PCSG (if applicable) → PodClique → Pod, making `kubectl get pods` output immediately understandable.
+
+2. **63-Character Limit**: Kubernetes enforces a 63-character limit on resource names. Use short, meaningful names for your resources, especially PodCliqueSet and PCSG names which appear in every generated name.
+
+3. **Unique PodClique Names**: All PodClique names must be unique within a PodCliqueSet. When you have similar roles in multiple PCSGs (e.g., leader/worker in both prefill and decode), use prefixes or abbreviations (e.g., `pleader`/`pworker` and `dleader`/`dworker`).
+
+4. **Predictable Patterns**: The naming scheme is consistent whether you're using standalone PodCliques or PodCliqueScalingGroups, making it easy to understand your system at a glance.
+
+5. **Planning is Key**: Before creating resources, plan your names considering the full hierarchy and potential scaling needs.
+
+## Next Steps
+
+Now that you understand Grove's naming scheme and best practices:
+
+- **See it in action**: Continue to the [Hands-On Example](./03_hands-on-example.md) to deploy an example system and observe the naming hierarchy firsthand.
+
+- **Learn programmatic discovery**: Head to the [Environment Variables guide](../environment-variables.md) to learn how to use these names programmatically for service discovery, including how Grove injects environment variables and how to construct FQDNs for pod-to-pod communication.
+