Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -35,3 +35,6 @@ llmd-infra/

*.tgz
actionlint

# AI
.claude
2 changes: 1 addition & 1 deletion CONTRIBUTING.md
Original file line number Diff line number Diff line change
Expand Up @@ -56,7 +56,7 @@ The repository uses AI-powered workflows to automate repetitive tasks:
- **Workflow Creation**: Interactive designer for new workflows
- **Workflow Debugging**: Assists with troubleshooting

Learn more in the [Agentic Workflows Guide](docs/developer-guide/agentic-workflows.md).
Learn more in the [Developer Guide](docs/developer-guide/development.md).

## WVA Project Structure

Expand Down
13 changes: 3 additions & 10 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@ The Workload Variant Autoscaler (WVA) is a Kubernetes-based global autoscaler fo

### What is a variant?

In WVA, a **variant** is a way of serving a given model: a scale target (Deployment, StatefulSet, or LWS) with a particular combination of hardware, runtimes, and serving approach. Variants for the same model share the same base model (e.g. meta/llama-3.1-8b); LoRA adapters can differ per variant. Each variant is a distinct setup—e.g. different accelerators (A100, H100, L4), parallelism, or performance requirements. Create one `VariantAutoscaling` per variant; when several variants serve the same model, WVA chooses which to scale (e.g. add capacity on the cheapest variant, remove it from the most expensive). See [Configuration](docs/user-guide/configuration.md) and [Saturation Analyzer](docs/saturation-analyzer.md) for details.
In WVA, a **variant** is a way of serving a given model: a scale target (Deployment, StatefulSet, or LWS) with a particular combination of hardware, runtimes, and serving approach. Variants for the same model share the same base model (e.g. meta/llama-3.1-8b); LoRA adapters can differ per variant. Each variant is a distinct setup—e.g. different accelerators (A100, H100, L4), parallelism, or performance requirements. Create one `VariantAutoscaling` per variant; when several variants serve the same model, WVA chooses which to scale (e.g. add capacity on the cheapest variant, remove it from the most expensive). See [Configuration](docs/user-guide/configuration.md) and [Saturation Analyzer](docs/user-guide/saturation-analyzer.md) for details.

<!--
<![Architecture](docs/design/diagrams/inferno-WVA-design.png)>
Expand All @@ -29,16 +29,9 @@ In WVA, a **variant** is a way of serving a given model: a scale target (Deploym
- [CRD Reference](docs/user-guide/crd-reference.md)
- [Multi-Controller Isolation](docs/user-guide/multi-controller-isolation.md)

<!--

### Tutorials
- [Quick Start Demo](docs/tutorials/demo.md)
- [Parameter Estimation](docs/tutorials/parameter-estimation.md)
- [vLLM Server Setup](docs/tutorials/vllm-samples.md)
-->
### Integrations
- [HPA Integration](docs/integrations/hpa-integration.md)
- [KEDA Integration](docs/integrations/keda-integration.md)
- [HPA Integration](docs/user-guide/hpa-integration.md)
- [KEDA Integration](docs/user-guide/keda-integration.md)
- [Prometheus Metrics](docs/integrations/prometheus.md)

<!--
Expand Down
2 changes: 1 addition & 1 deletion charts/workload-variant-autoscaler/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -248,7 +248,7 @@ HPA_STABILIZATION_SECONDS=120 ./deploy/install.sh
- **Development**: Use 30-60 seconds for faster iteration
- **E2E Tests**: Use 30 seconds for rapid validation

See [HPA Integration Guide](../../docs/integrations/hpa-integration.md) for detailed information.
See [HPA Integration Guide](../../docs/user-guide/hpa-integration.md) for detailed information.

### Usage Examples

Expand Down
File renamed without changes.
7 changes: 7 additions & 0 deletions config/samples/hpa/kustomization.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
metadata:
name: hpa-sample
resources:
- va.yaml
- hpa.yaml
14 changes: 14 additions & 0 deletions config/samples/hpa/va.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
# Example VariantAutoscaling for HPA/KEDA integration.
# Ensure a Deployment named sample-deployment exists in llm-d-sim (e.g. from kind-emulator or e2e).
apiVersion: llmd.ai/v1alpha1
kind: VariantAutoscaling
metadata:
name: sample-deployment
namespace: llm-d-sim
labels:
inference.optimization/acceleratorName: A100
spec:
scaleTargetRef:
kind: Deployment
name: sample-deployment
modelID: default/default
7 changes: 7 additions & 0 deletions config/samples/keda/kustomization.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
metadata:
name: keda-sample
resources:
- va.yaml
- scaledobject.yaml
14 changes: 14 additions & 0 deletions config/samples/keda/va.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
# Example VariantAutoscaling for HPA/KEDA integration.
# Ensure a Deployment named sample-deployment exists in llm-d-sim (e.g. from kind-emulator or e2e).
apiVersion: llmd.ai/v1alpha1
kind: VariantAutoscaling
metadata:
name: sample-deployment
namespace: llm-d-sim
labels:
inference.optimization/acceleratorName: A100
spec:
scaleTargetRef:
kind: Deployment
name: sample-deployment
modelID: default/default
2 changes: 1 addition & 1 deletion deploy/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -706,7 +706,7 @@ The `VLLM_MAX_NUM_SEQS` variable controls the maximum number of concurrent seque

**Use cases:**
- **E2E Testing**: Set to low values (e.g., `8` or `16`) to quickly trigger saturation and test autoscaling
- **Parameter Estimation**: Match this to your desired maximum batch size (see [Parameter Estimation Guide](../docs/tutorials/parameter-estimation.md))
- **Parameter Estimation**: Match this to your desired maximum batch size (see [Configuration Guide](../docs/user-guide/configuration.md))
- **Production**: Leave unset to use vLLM's default based on available GPU memory

**Example:**
Expand Down
4 changes: 2 additions & 2 deletions docs/CHANGELOG-v0.5.0.md
Original file line number Diff line number Diff line change
Expand Up @@ -41,7 +41,7 @@ T+90s: variant-1 ready (PendingReplicas=0), eligible again
- ✅ Maintains cost-optimized scaling across variants

**Documentation:**
- [Saturation Analyzer - Cascade Scaling Prevention](saturation-analyzer.md#cascade-scaling-prevention)
- [Saturation Analyzer - Cascade Scaling Prevention](user-guide/saturation-analyzer.md#cascade-scaling-prevention)
- [Saturation Scaling Config](saturation-scaling-config.md#how-scale-up-triggers-work)

### 2. Prometheus Configuration via Environment Variables
Expand Down Expand Up @@ -250,6 +250,6 @@ E2E tests updated:
---

For detailed implementation, see:
- [Saturation Analyzer Documentation](saturation-analyzer.md)
- [Saturation Analyzer Documentation](user-guide/saturation-analyzer.md)
- [Prometheus Integration](integrations/prometheus.md)
- [Configuration Guide](user-guide/configuration.md)
15 changes: 2 additions & 13 deletions docs/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,21 +14,12 @@ Getting started and using WVA:
- **[Multi-Controller Isolation](user-guide/multi-controller-isolation.md)** - Running multiple WVA controller instances
- **[LeaderWorkerSet Support](user-guide/LeaderWorkerSet-support.md)** - Supporting LeaderWorkerSets as scale targets

### Tutorials

Step-by-step guides:

- **[Quick Start Demo](tutorials/demo.md)** - Getting started with WVA
- **[Parameter Estimation](tutorials/parameter-estimation.md)** - Estimating model parameters
- **[vLLM Samples](tutorials/vllm-samples.md)** - Working with vLLM servers
- **[GuideLLM Sample](tutorials/guidellm-sample.md)** - Using GuideLLM for benchmarking

### Integrations

Integration with other systems:

- **[HPA Integration](integrations/hpa-integration.md)** - Using WVA with Horizontal Pod Autoscaler
- **[KEDA Integration](integrations/keda-integration.md)** - Using WVA with KEDA
- **[HPA Integration](user-guide/hpa-integration.md)** - Using WVA with Horizontal Pod Autoscaler
- **[KEDA Integration](user-guide/keda-integration.md)** - Using WVA with KEDA
- **[Prometheus Integration](integrations/prometheus.md)** - Custom metrics and monitoring

### Design & Architecture
Expand All @@ -45,7 +36,6 @@ Contributing to WVA:

- **[Development Setup](developer-guide/development.md)** - Setting up your dev environment
- **[Testing](developer-guide/testing.md)** - Running tests and CI workflows
- **[Agentic Workflows](developer-guide/agentic-workflows.md)** - AI-powered automation workflows
- **[Debugging](developer-guide/debugging.md)** - Debugging techniques and tools
- **[Contributing](../CONTRIBUTING.md)** - How to contribute to the project

Expand All @@ -71,4 +61,3 @@ Contributing to WVA:
---

**Note:** Documentation is continuously being improved. If you find errors or have suggestions, please open an issue or submit a PR!

Loading
Loading