doc: Add pod-naming and environment-variables docs #355

nvrohanv · 2026-01-20T04:37:31Z

Add Pod Naming and Environment Variables Documentation

Summary

This PR adds comprehensive documentation for Grove's pod naming conventions and environment variables for pod discovery. The documentation is organized into two new user guide sections:

Pod and Resource Naming Conventions (docs/user-guide/02_pod-and-resource-naming-conventions/) - Explains Grove's hierarchical naming scheme and best practices for naming resources
Environment Variables for Pod Discovery (docs/user-guide/03_environment-variables-for-pod-discovery/) - Documents the environment variables Grove injects for pod discovery and coordination and shows examples of how to use them to construct pod fully qualified domain names (FQDN).

Changes

New Documentation Structure

docs/user-guide/02_pod-and-resource-naming-conventions/

01_overview.md - Prerequisites and guide overview
02_naming-conventions.md - Complete guide covering:
- Hierarchical pod naming patterns for standalone PodCliques and PodCliqueScalingGroups
- Kubernetes 63-character name length considerations with scaling headroom calculations
- Best practices for naming resources (short prefixes like pleader/pworker and dleader/dworker)
- Why PodClique names must be unique within a PodCliqueSet (reflects Grove's philosophy that each PodClique represents a distinct component with a unique role)
- Worked example planning names for a complex system
03_hands-on-example.md - Deployable multi-node disaggregated inference example demonstrating the naming hierarchy

docs/user-guide/03_environment-variables-for-pod-discovery/

01_overview.md - Prerequisites and guide overview
02_env_var_reference.md - Reference guide covering:
- Key distinction between pod names (random suffix) and hostnames (deterministic, DNS-resolvable)
- How Grove automatically creates headless services for each PodCliqueSet replica
- Complete reference of all Grove-injected environment variables (GROVE_PCS_NAME, GROVE_PCLQ_NAME, GROVE_PCSG_NAME, etc.)
03_hands-on-examples.md - Two worked examples with deployable YAMLs:
- Standalone PodClique environment variables
- PCSG with leader-worker pod discovery
04_common-patterns-and-takeaways.md - Practical patterns for constructing FQDNs, finding leaders, discovering peers, and key takeaways

New Sample Files

operator/samples/user-guide/02_pod-and-resource-naming-conventions/multinode-disaggregated-with-frontend.yaml
operator/samples/user-guide/03_environment-variables-for-pod-discovery/standalone-env-vars.yaml
operator/samples/user-guide/03_environment-variables-for-pod-discovery/pcsg-env-vars.yaml

Reorganized Existing Files

Documentation Directories - Added numeric prefixes for ordering:

docs/user-guide/core-concepts/ → docs/user-guide/01_core-concepts/
docs/user-guide/pod-and-resource-naming-conventions/ → docs/user-guide/02_pod-and-resource-naming-conventions/
docs/user-guide/environment-variables-for-pod-discovery/ → docs/user-guide/03_environment-variables-for-pod-discovery/

Core Concepts Files - Renamed with numeric prefixes:

overview.md → 01_overview.md
pcs_and_pclq_intro.md → 02_pcs_and_pclq_intro.md
pcsg_intro.md → 03_pcsg_intro.md
takeaways.md → 04_takeaways.md

Sample YAML Directories - Reorganized to match documentation structure:

operator/samples/user-guide/concept-overview/ → operator/samples/user-guide/01_core-concepts/
operator/samples/user-guide/pod-and-resource-naming-conventions/ → operator/samples/user-guide/02_pod-and-resource-naming-conventions/
operator/samples/user-guide/environment-variables-for-pod-discovery/ → operator/samples/user-guide/03_environment-variables-for-pod-discovery/

All sample YAMLs updated with:

Documentation reference comments pointing to new paths
Minimal resource requirements (cpu: 10m, memory: 32Mi) to work on a single real node in KIND clusters

Updated References

Updated README.md and docs/quickstart.md with links to new documentation structure
Updated all cross-references between documentation files
Updated all sample YAML documentation comments

Testing

All commands documented in the guides were tested end-to-end:

Created KIND cluster with make kind-up FAKE_NODES=40
Deployed Grove with make deploy
Successfully ran all kubectl apply, kubectl get pods/pclq/pcsg, kubectl logs, and kubectl exec -- nslookup commands
Verified environment variables are correctly injected
Verified DNS resolution works for pod discovery
Verified all internal documentation links resolve correctly

…s in user guide, update resource requirements in samples to work on one real node Signed-off-by: Rohan Varma <rohanv@nvidia.com>

shayasoolin

LGTM, added one comment

docs/user-guide/pod-naming.md

docs/user-guide/environment-variables.md

docs/user-guide/pod-naming.md

gflarity · 2026-01-20T19:25:30Z

Just a few minor things, otherwise looks good.

Co-authored-by: Geoff Flarity <geoff.flarity@gmail.com> Signed-off-by: Rohan Varma <rohanv@nvidia.com>

athreesh · 2026-01-21T06:12:41Z

Docs are quite long but it's not a bad thing. @nvrohanv would double check if there's any room to consolidate.

also, keep in mind 63-char Kubernetes limit

otherwise LGTM!

renormalize

A few nits, but overall the changes look good to me.

renormalize · 2026-01-22T12:44:14Z

docs/user-guide/environment-variables.md

+- Example: `my-service-0-model-instance`
+- **Note:** This does NOT include the PCSG replica index. To construct a sibling PodClique name within the same PCSG replica, use: `$GROVE_PCSG_NAME-$GROVE_PCSG_INDEX-<pclq-template-name>`
+
+**Note:** `GROVE_PCSG_TEMPLATE_NUM_PODS` represents the total number of pods defined in the PodCliqueScalingGroup template, calculated as the sum of replicas across all PodCliques in the PCSG. For example, if a PCSG has 1 leader replica and 3 worker replicas, this value would be 4. This value does not change based on scaling, so is only guaranteed to be accurate at startup.


Why is it only guaranteed to be accurate at startup if the value does not change due to scaling?

This is the sum of the number of pods across all PodCliques that belong to a particular PodCliqueScalingGroup. It is guaranteed to be accurate throughout. This value doesn't change when a PodCliqueScalingGroup scales either.

Suggested change

**Note:** `GROVE_PCSG_TEMPLATE_NUM_PODS` represents the total number of pods defined in the PodCliqueScalingGroup template, calculated as the sum of replicas across all PodCliques in the PCSG. For example, if a PCSG has 1 leader replica and 3 worker replicas, this value would be 4. This value does not change based on scaling, so is only guaranteed to be accurate at startup.

**Note:** `GROVE_PCSG_TEMPLATE_NUM_PODS` represents the total number of pods defined in the PodCliqueScalingGroup template, calculated as the sum of replicas across all PodCliques in the PCSG. For example, if a PCSG has 1 leader replica and 3 worker replicas, this value would be 4. This value does not change based on scaling, so is guaranteed to be accurate through the entirety of the pod's lifecycle.

It would change if you scale the PodCliques within a PCSG replica right?

I can explain that it would only change if the PodClique within PCSG replica is scaled if thats correct.

docs/user-guide/environment-variables.md

operator/samples/user-guide/naming-and-env-vars/standalone-env-vars.yaml

docs/user-guide/pod-naming.md

Signed-off-by: Rohan Varma <rohanv@nvidia.com>

…adability Signed-off-by: Rohan Varma <rohanv@nvidia.com>

Signed-off-by: Rohan Varma <rohanv@nvidia.com>

shayasoolin · 2026-01-26T13:12:58Z

docs/user-guide/02_pod-and-resource-naming-conventions/02_naming-conventions.md

+   - ✅ Good: `pleader`, `pworker` (prefill), `dleader`, `dworker` (decode)
+   - ❌ Avoid: `prefill-leader`, `prefill-worker`, `decode-leader`, `decode-worker`
+
+3. **Keep PodCliqueSet Names Short**: Remember that the PCS name is included in every pod name


I wonder if the above example of multinode-disaggregated as PCS name is short enough as an example.. maybe mn-disagg is better.

agreed will put mn-disagg everywhere

shayasoolin · 2026-01-26T13:14:45Z

docs/user-guide/02_pod-and-resource-naming-conventions/03_hands-on-example.md

+apiVersion: grove.io/v1alpha1
+kind: PodCliqueSet
+metadata:
+  name: multinode-disaggregated


If mn-disagg is recommended above, let's use it here.

shayasoolin · 2026-01-26T13:16:44Z

docs/user-guide/02_pod-and-resource-naming-conventions/03_hands-on-example.md

+
+PodClique names **must be unique within a PodCliqueSet**. Reusing generic names like `leader` and `worker` across multiple PodCliqueScalingGroups is therefore **not allowed**.
+
+More importantly, this reflects **Grove's core philosophy**.


I think the philosophy part is more suitable to the main readme file, not the hands-on example.

I'm not sure the uniqueness issue requires so much attention. Due to the nature of the PCS yaml, where PCLQs are all specified in the same hierarchy level, it's almost self-explanatory that names are expected to be unique. You rarely expect a list to allow repeating the same names.
So worth mentioning, but I'd consider shortening it.

Agreed that it is really calling it out. Still the first reason I'm wanting to go into such detail is that when people see a name like prefill-pleader they tend to wonder why its not just prefill-leader so I wanted to go into detail about why we are requiring this. The second reason is that a good chunk of users will probably just ask AI about things these days so in general i'm putting in stuff like so that the AI picks up a clear explanation when it searches grove repo. Let me know if you have strong opinion on this.

Regarding the philosophy part I agree that main readme probably should have general philosophy. I'm thinking i can maybe keep this for now and then in a separate pr (since this one is already so big) reference the core philosophy in a previous doc and then just call it out again.

gflarity · 2026-01-26T14:58:43Z

For next update:

If I was looking to find the exact names of the various PCS components I'd probably just use labels rather than futz around with constructing a name:

Grove Labels Reference

This document summarizes the Kubernetes labels that Grove sets on pods and resources, enabling programmatic queries via kubectl label selectors.

Source: operator/api/common/labels.go

Grove-Specific Labels (`grove.io/`)

Label Key	Description
`grove.io/podclique`	The name of the PodClique the pod belongs to
`grove.io/podcliqueset-replica-index`	The replica index of the PodCliqueSet (0-based)
`grove.io/podcliquescalinggroup`	The name of the PodCliqueScalingGroup (if applicable)
`grove.io/podcliquescalinggroup-replica-index`	The replica index of the PCSG (0-based)
`grove.io/podgang`	The name of the PodGang for gang scheduling
`grove.io/base-podgang`	Base PodGang name for scaled PodGangs (beyond MinAvailable)
`grove.io/pod-template-hash`	Hash of the PodSpec (used for rolling updates)

Standard Kubernetes Labels (`app.kubernetes.io/`)

Label Key	Value	Description
`app.kubernetes.io/part-of`	PodCliqueSet name	All resources belonging to a PodCliqueSet share this label
`app.kubernetes.io/managed-by`	`grove-operator`	Identifies resources managed by Grove
`app.kubernetes.io/component`	varies	Component type (see below)
`app.kubernetes.io/name`	resource name	Name of the specific resource

Component Values (`app.kubernetes.io/component`)

Value	Description
`pcs-podclique`	PodClique owned directly by PodCliqueSet (standalone)
`pcsg-podclique`	PodClique owned by a PodCliqueScalingGroup
`pcs-podcliquescalinggroup`	PodCliqueScalingGroup resource
`pcs-headless-service`	Headless service for a PodCliqueSet replica
`podgang`	PodGang resource
`pod-role`	Role for pods in the PodCliqueSet
`pod-role-binding`	RoleBinding for pods
`pod-service-account`	ServiceAccount for pods
`pcs-hpa`	HorizontalPodAutoscaler for scaling

Usage Examples

Get All Pods in a PodCliqueSet

kubectl get pods -l app.kubernetes.io/part-of=my-inference-app

Get Pods in a Specific PodClique

kubectl get pods -l grove.io/podclique=prefill-0-pleader

Get Pods in a PodCliqueScalingGroup

kubectl get pods -l grove.io/podcliquescalinggroup=prefill

Get Pods in a Specific PCSG Replica

kubectl get pods -l grove.io/podcliquescalinggroup=prefill,grove.io/podcliquescalinggroup-replica-index=0

Get Pods in a Specific PodCliqueSet Replica

kubectl get pods -l app.kubernetes.io/part-of=my-inference-app,grove.io/podcliqueset-replica-index=0

Get All Grove-Managed Resources

kubectl get pods -l app.kubernetes.io/managed-by=grove-operator

Get All Standalone PodCliques (not in a PCSG)

kubectl get pclq -l app.kubernetes.io/component=pcs-podclique

Get All PodCliques in PodCliqueScalingGroups

kubectl get pclq -l app.kubernetes.io/component=pcsg-podclique

Combining Labels with Environment Variables

Grove also injects environment variables into pods for programmatic discovery from within containers. See the Environment Variables documentation for details on:

GROVE_PCS_NAME - PodCliqueSet name
GROVE_PCS_INDEX - PodCliqueSet replica index
GROVE_PCLQ_NAME - PodClique name
GROVE_PCSG_NAME - PodCliqueScalingGroup name
GROVE_PCSG_INDEX - PodCliqueScalingGroup replica index
GROVE_HEADLESS_SERVICE - Headless service for DNS resolution
GROVE_PCLQ_POD_INDEX - Pod index within the PodClique

Labels vs Environment Variables:

Use labels for external queries (kubectl, monitoring, alerting)
Use environment variables for internal pod-to-pod discovery and coordination

gflarity

I think using the labels I mentioned might make it a lot easier for those who need to find/communication with different components. So I threw them in there for consideration. Regardless this is a huge improvement 👍

nvrohanv · 2026-01-27T05:51:43Z

I think using the labels I mentioned might make it a lot easier for those who need to find/communication with different components. So I threw them in there for consideration. Regardless this is a huge improvement 👍

Yes this is great, I'll add this in as a separate pr since this one is already quite large.

add pod-naming and environment-variables docs, add docs to all sample…

db60826

…s in user guide, update resource requirements in samples to work on one real node Signed-off-by: Rohan Varma <rohanv@nvidia.com>

nvrohanv requested review from Ronkahn21, gflarity, sanjaychatterjee, shayasoolin and unmarshall as code owners January 20, 2026 04:37

nvrohanv requested a review from athreesh January 20, 2026 04:37

nvrohanv added the area/documentation Categorizes the Issue/PR as related to documentation label Jan 20, 2026

shayasoolin reviewed Jan 20, 2026

View reviewed changes

docs/user-guide/pod-naming.md Outdated Show resolved Hide resolved