llm-d · lionelvillard · Apr 10, 2026 · Apr 10, 2026
diff --git a/.gitignore b/.gitignore
@@ -35,3 +35,6 @@ llmd-infra/
 
 *.tgz
 actionlint
+
+# AI
+.claude
diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md
@@ -56,7 +56,7 @@ The repository uses AI-powered workflows to automate repetitive tasks:
 - **Workflow Creation**: Interactive designer for new workflows
 - **Workflow Debugging**: Assists with troubleshooting
 
-Learn more in the [Agentic Workflows Guide](docs/developer-guide/agentic-workflows.md).
+Learn more in the [Developer Guide](docs/developer-guide/development.md).
 
 ## WVA Project Structure
 

diff --git a/README.md b/README.md
@@ -8,7 +8,7 @@ The Workload Variant Autoscaler (WVA) is a Kubernetes-based global autoscaler fo
 
 ### What is a variant?
 
-In WVA, a **variant** is a way of serving a given model: a scale target (Deployment, StatefulSet, or LWS) with a particular combination of hardware, runtimes, and serving approach. Variants for the same model share the same base model (e.g. meta/llama-3.1-8b); LoRA adapters can differ per variant. Each variant is a distinct setup—e.g. different accelerators (A100, H100, L4), parallelism, or performance requirements. Create one `VariantAutoscaling` per variant; when several variants serve the same model, WVA chooses which to scale (e.g. add capacity on the cheapest variant, remove it from the most expensive). See [Configuration](docs/user-guide/configuration.md) and [Saturation Analyzer](docs/saturation-analyzer.md) for details.
+In WVA, a **variant** is a way of serving a given model: a scale target (Deployment, StatefulSet, or LWS) with a particular combination of hardware, runtimes, and serving approach. Variants for the same model share the same base model (e.g. meta/llama-3.1-8b); LoRA adapters can differ per variant. Each variant is a distinct setup—e.g. different accelerators (A100, H100, L4), parallelism, or performance requirements. Create one `VariantAutoscaling` per variant; when several variants serve the same model, WVA chooses which to scale (e.g. add capacity on the cheapest variant, remove it from the most expensive). See [Configuration](docs/user-guide/configuration.md) and [Saturation Analyzer](docs/user-guide/saturation-analyzer.md) for details.
 
 <!--
 <![Architecture](docs/design/diagrams/inferno-WVA-design.png)>
@@ -29,16 +29,9 @@ In WVA, a **variant** is a way of serving a given model: a scale target (Deploym
 - [CRD Reference](docs/user-guide/crd-reference.md)
 - [Multi-Controller Isolation](docs/user-guide/multi-controller-isolation.md)
 
-<!-- 
-
-### Tutorials
-- [Quick Start Demo](docs/tutorials/demo.md)
-- [Parameter Estimation](docs/tutorials/parameter-estimation.md)
-- [vLLM Server Setup](docs/tutorials/vllm-samples.md)
--->
 ### Integrations
-- [HPA Integration](docs/integrations/hpa-integration.md)
-- [KEDA Integration](docs/integrations/keda-integration.md)
+- [HPA Integration](docs/user-guide/hpa-integration.md)
+- [KEDA Integration](docs/user-guide/keda-integration.md)
 - [Prometheus Metrics](docs/integrations/prometheus.md)
 
 <!-- 

diff --git a/charts/workload-variant-autoscaler/README.md b/charts/workload-variant-autoscaler/README.md
@@ -248,7 +248,7 @@ HPA_STABILIZATION_SECONDS=120 ./deploy/install.sh
 - **Development**: Use 30-60 seconds for faster iteration
 - **E2E Tests**: Use 30 seconds for rapid validation
 
-See [HPA Integration Guide](../../docs/integrations/hpa-integration.md) for detailed information.
+See [HPA Integration Guide](../../docs/user-guide/hpa-integration.md) for detailed information.
 
 ### Usage Examples
 

diff --git a/config/samples/hpa-integration.yaml → config/samples/hpa/hpa.yaml b/config/samples/hpa-integration.yaml → config/samples/hpa/hpa.yaml
diff --git a/config/samples/hpa/kustomization.yaml b/config/samples/hpa/kustomization.yaml
@@ -0,0 +1,7 @@
+apiVersion: kustomize.config.k8s.io/v1beta1
+kind: Kustomization
+metadata:
+  name: hpa-sample
+resources:
+- va.yaml
+- hpa.yaml
diff --git a/config/samples/hpa/va.yaml b/config/samples/hpa/va.yaml
@@ -0,0 +1,14 @@
+# Example VariantAutoscaling for HPA/KEDA integration.
+# Ensure a Deployment named sample-deployment exists in llm-d-sim (e.g. from kind-emulator or e2e).
+apiVersion: llmd.ai/v1alpha1
+kind: VariantAutoscaling
+metadata:
+  name: sample-deployment
+  namespace: llm-d-sim
+  labels:
+    inference.optimization/acceleratorName: A100
+spec:
+  scaleTargetRef:
+    kind: Deployment
+    name: sample-deployment
+  modelID: default/default
diff --git a/config/samples/keda/kustomization.yaml b/config/samples/keda/kustomization.yaml
@@ -0,0 +1,7 @@
+apiVersion: kustomize.config.k8s.io/v1beta1
+kind: Kustomization
+metadata:
+  name: keda-sample
+resources:
+- va.yaml
+- scaledobject.yaml
diff --git a/config/samples/keda-scaled-object.yaml → config/samples/keda/scaledobject.yaml b/config/samples/keda-scaled-object.yaml → config/samples/keda/scaledobject.yaml
diff --git a/config/samples/keda/va.yaml b/config/samples/keda/va.yaml
@@ -0,0 +1,14 @@
+# Example VariantAutoscaling for HPA/KEDA integration.
+# Ensure a Deployment named sample-deployment exists in llm-d-sim (e.g. from kind-emulator or e2e).
+apiVersion: llmd.ai/v1alpha1
+kind: VariantAutoscaling
+metadata:
+  name: sample-deployment
+  namespace: llm-d-sim
+  labels:
+    inference.optimization/acceleratorName: A100
+spec:
+  scaleTargetRef:
+    kind: Deployment
+    name: sample-deployment
+  modelID: default/default
diff --git a/deploy/README.md b/deploy/README.md
@@ -706,7 +706,7 @@ The `VLLM_MAX_NUM_SEQS` variable controls the maximum number of concurrent seque
 
 **Use cases:**
 - **E2E Testing**: Set to low values (e.g., `8` or `16`) to quickly trigger saturation and test autoscaling
-- **Parameter Estimation**: Match this to your desired maximum batch size (see [Parameter Estimation Guide](../docs/tutorials/parameter-estimation.md))
+- **Parameter Estimation**: Match this to your desired maximum batch size (see [Configuration Guide](../docs/user-guide/configuration.md))
 - **Production**: Leave unset to use vLLM's default based on available GPU memory
 
 **Example:**

diff --git a/docs/CHANGELOG-v0.5.0.md b/docs/CHANGELOG-v0.5.0.md
@@ -41,7 +41,7 @@ T+90s: variant-1 ready (PendingReplicas=0), eligible again
 - ✅ Maintains cost-optimized scaling across variants
 
 **Documentation:**
-- [Saturation Analyzer - Cascade Scaling Prevention](saturation-analyzer.md#cascade-scaling-prevention)
+- [Saturation Analyzer - Cascade Scaling Prevention](user-guide/saturation-analyzer.md#cascade-scaling-prevention)
 - [Saturation Scaling Config](saturation-scaling-config.md#how-scale-up-triggers-work)
 
 ### 2. Prometheus Configuration via Environment Variables
@@ -250,6 +250,6 @@ E2E tests updated:
 ---
 
 For detailed implementation, see:
-- [Saturation Analyzer Documentation](saturation-analyzer.md)
+- [Saturation Analyzer Documentation](user-guide/saturation-analyzer.md)
 - [Prometheus Integration](integrations/prometheus.md)
 - [Configuration Guide](user-guide/configuration.md)
diff --git a/docs/README.md b/docs/README.md
@@ -14,21 +14,12 @@ Getting started and using WVA:
 - **[Multi-Controller Isolation](user-guide/multi-controller-isolation.md)** - Running multiple WVA controller instances
 - **[LeaderWorkerSet Support](user-guide/LeaderWorkerSet-support.md)** - Supporting LeaderWorkerSets as scale targets
 
-### Tutorials
-
-Step-by-step guides:
-
-- **[Quick Start Demo](tutorials/demo.md)** - Getting started with WVA
-- **[Parameter Estimation](tutorials/parameter-estimation.md)** - Estimating model parameters
-- **[vLLM Samples](tutorials/vllm-samples.md)** - Working with vLLM servers
-- **[GuideLLM Sample](tutorials/guidellm-sample.md)** - Using GuideLLM for benchmarking
-
 ### Integrations
 
 Integration with other systems:
 
-- **[HPA Integration](integrations/hpa-integration.md)** - Using WVA with Horizontal Pod Autoscaler
-- **[KEDA Integration](integrations/keda-integration.md)** - Using WVA with KEDA
+- **[HPA Integration](user-guide/hpa-integration.md)** - Using WVA with Horizontal Pod Autoscaler
+- **[KEDA Integration](user-guide/keda-integration.md)** - Using WVA with KEDA
 - **[Prometheus Integration](integrations/prometheus.md)** - Custom metrics and monitoring
 
 ### Design & Architecture
@@ -45,7 +36,6 @@ Contributing to WVA:
 
 - **[Development Setup](developer-guide/development.md)** - Setting up your dev environment
 - **[Testing](developer-guide/testing.md)** - Running tests and CI workflows
-- **[Agentic Workflows](developer-guide/agentic-workflows.md)** - AI-powered automation workflows
 - **[Debugging](developer-guide/debugging.md)** - Debugging techniques and tools
 - **[Contributing](../CONTRIBUTING.md)** - How to contribute to the project
 
@@ -71,4 +61,3 @@ Contributing to WVA:
 ---
 
 **Note:** Documentation is continuously being improved. If you find errors or have suggestions, please open an issue or submit a PR!
-
-Original file line number
+Diff line change
@@ Expand Up / @@ -35,3 +35,6 @@ llmd-infra/ @@
     *.tgz
     actionlint
+    # AI
+    .claude