Helm chart for Workload-Variant-Autoscaler (WVA) - GPU-aware autoscaler for LLM inference workloads
The chart is published to GitHub Container Registry under the llm-d org (not llm-d-incubation). Use this OCI URL in Helm or Helmfile:
- OCI URL:
oci://ghcr.io/llm-d/workload-variant-autoscaler - Example:
helm pull oci://ghcr.io/llm-d/workload-variant-autoscaler --version 0.5.1
Helm is the recommended installation method. Before running, be sure to delete all previous helm installations for workload-variant-autoscaler and prometheus-adapter. To list all helm charts installed in the cluster run helm ls -A.
export OWNER="llm-d"
export WVA_PROJECT="llm-d-workload-variant-autoscaler"
export WVA_RELEASE="v0.5.1"
export WVA_NS="workload-variant-autoscaler-system"
export MON_NS="openshift-user-workload-monitoring"
kubectl get secret thanos-querier-tls -n openshift-monitoring -o jsonpath='{.data.tls\.crt}' | base64 -d > /tmp/prometheus-ca.crt
git clone -b $WVA_RELEASE -- https://github.com/$OWNER/$WVA_PROJECT.git $WVA_PROJECT
cd $WVA_PROJECT
export WVA_PROJECT=$PWD
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update
Important: The following helm upgrade command updates the global prometheus-adapter
configmap. If this is a shared cluster then you might want to get the current
settings, manually append the values in config/samples/prometheus-adapter-values-ocp.yaml
then run helm upgrade with the appended values. Here's an example how to get the current
values: kubectl get configmap prometheus-adapter -n $MON_NS -o yaml
helm upgrade -i prometheus-adapter prometheus-community/prometheus-adapter \
-n $MON_NS \
-f config/samples/prometheus-adapter-values-ocp.yaml
kubectl apply -f - <<EOF
apiVersion: v1
kind: Namespace
metadata:
name: $WVA_NS
labels:
app.kubernetes.io/name: workload-variant-autoscaler
control-plane: controller-manager
openshift.io/user-monitoring: "true"
EOF
cd $WVA_PROJECT/charts
helm upgrade -i workload-variant-autoscaler ./workload-variant-autoscaler \
-n $WVA_NS \
--set-file wva.prometheus.caCert=/tmp/prometheus-ca.crt \
--set controller.enabled=true \
--set va.enabled=false \
--set hpa.enabled=false \
--set vllmService.enabled=false
After a WVA controller has been installed,
you can add one or more models running in LLMD namespaces as scale targets to the WVA controller. As an example, the following command adds model name my-model-a with model ID meta-llama/Llama-3.1-8 running in team-a LLMD namespace. This command creates the corresponding VA, HPA resources in team-a namespace.
helm install wva-model-a ./workload-variant-autoscaler \
-n $WVA_NS \
--set controller.enabled=false \
--set va.enabled=true \
--set hpa.enabled=true \
--set llmd.namespace=team-a \
--set llmd.modelName=my-model-a \
--set llmd.modelID="meta-llama/Llama-3.1-8"
Here is an example to add another model to the same WVA controller:
helm install wva-model-b ./workload-variant-autoscaler \
-n $WVA_NS \
--set controller.enabled=false \
--set va.enabled=true \
--set hpa.enabled=true \
--set llmd.namespace=team-a \
--set llmd.modelName=my-model-b \
--set llmd.modelID="Qwen/Qwen3-0.6B"
Notes:
- When there are multiple WVA controllers installed in different namespaces, there's the possibility of adding models in a LLMD namespace as scale targets using the same
release name. Ifhelm installwas used to add then there will be a clear message such as:However, ifINSTALLATION FAILED: cannot re-use a name that is still in usehelm upgrade -i(combine upgrade and install) was used then the message is less clear as shown below. In this case, different release names should be used:Error: UPGRADE FAILED: Unable to continue with update: Service "workload-variant-autoscaler-vllm" in namespace "xyz" exists and cannot be imported into the current release: invalid ownership metadata; annotation validation error: key "meta.helm.sh/release-namespace" must equal "abc": current value is "xyz"
| Key | Type | Default | Description |
|---|---|---|---|
| hpa.behavior.scaleDown.policies | list | [{"periodSeconds":150,"type":"Pods","value":10}] |
Scale-down policies |
| hpa.behavior.scaleDown.selectPolicy | string | "Max" |
Scale-down policy selection |
| hpa.behavior.scaleDown.stabilizationWindowSeconds | int | 240 |
Scale-down stabilization window |
| hpa.behavior.scaleUp.policies | list | [{"periodSeconds":150,"type":"Pods","value":10}] |
Scale-up policies |
| hpa.behavior.scaleUp.selectPolicy | string | "Max" |
Scale-up policy selection |
| hpa.behavior.scaleUp.stabilizationWindowSeconds | int | 240 |
Scale-up stabilization window |
| hpa.enabled | bool | true |
|
| hpa.maxReplicas | int | 10 |
|
| hpa.targetAverageValue | string | "1" |
|
| llmd.modelID | string | "unsloth/Meta-Llama-3.1-8B" |
|
| llmd.modelName | string | "ms-inference-scheduling-llm-d-modelservice" |
|
| llmd.namespace | string | "llm-d-autoscaler" |
|
| va.accelerator | string | "H100" |
|
| va.enabled | bool | true |
|
| va.sloTpot | int | 10 |
|
| va.sloTtft | int | 1000 |
|
| vllmService.enabled | bool | true |
|
| vllmService.interval | string | "15s" |
|
| vllmService.nodePort | int | 30000 |
|
| vllmService.scheme | string | "http" |
|
| wva.configMap.immutable | bool | false |
If true, makes the main controller ConfigMap ({release-name}-variantautoscaling-config) immutable (cannot be updated after creation). Provides security benefits by preventing accidental or malicious configuration changes, but disables runtime config updates. Note: This only affects the main ConfigMap; other ConfigMaps (saturation scaling, scale-to-zero) are not affected. See Configuration Guide |
| wva.controllerInstance | string | "" |
Controller instance label for multi-controller isolation. When set, adds controller_instance label to all metrics and filters VariantAutoscaling resources by matching label. Use for parallel testing or multi-tenant environments. See Multi-Controller Isolation |
| wva.enabled | bool | true |
|
| wva.image.repository | string | "ghcr.io/llm-d/llm-d-workload-variant-autoscaler" |
|
| wva.image.tag | string | "latest" |
|
| wva.imagePullPolicy | string | "Always" |
|
| wva.metrics.enabled | bool | true |
|
| wva.metrics.port | int | 8443 |
|
| wva.metrics.secure | bool | true |
|
| wva.prometheus.baseURL | string | "https://thanos-querier.openshift-monitoring.svc.cluster.local:9091" |
|
| wva.prometheus.monitoringNamespace | string | "openshift-user-workload-monitoring" |
|
| wva.prometheus.tls.caCertPath | string | "/etc/ssl/certs/prometheus-ca.crt" |
|
| wva.prometheus.tls.insecureSkipVerify | bool | true |
|
| wva.reconcileInterval | string | "60s" |
|
| wva.scaleToZero | bool | false |
Autogenerated from chart metadata using helm-docs v1.14.2
The Helm chart provides different configuration files for different environments:
- TLS Verification: Enabled (
insecureSkipVerify: false) - Logging Level: Production (
LOG_LEVEL: info) - Security: Strict security settings for production use
- Saturation-based Scaling: Conservative thresholds for production stability
- TLS Verification: Relaxed (
insecureSkipVerify: true) for easier development - Logging Level: Debug (
LOG_LEVEL: debug) for detailed development logging - Security: Relaxed settings for development and testing
- Saturation Scaling: Aggressive thresholds for faster iteration
The chart includes saturation-based scaling thresholds that determine when replicas are saturated and when to scale up:
Global Defaults (applied to all models):
wva:
capacityScaling:
default:
kvCacheThreshold: 0.80 # Replica saturated if KV cache ≥ 80%
queueLengthThreshold: 5 # Replica saturated if queue ≥ 5 requests
kvSpareTrigger: 0.1 # Scale-up if spare KV capacity < 10%
queueSpareTrigger: 3 # Scale-up if spare queue < 3Per-Model Overrides (customize specific models):
wva:
capacityScaling:
overrides:
llm-d:
modelID: "Qwen/Qwen3-0.6B"
namespace: "llm-d-autoscaler"
kvCacheThreshold: 0.70 # Lower threshold for production
kvSpareTrigger: 0.35 # Avg spare KV <10% → scale-upSee docs/saturation-scaling-config.md for detailed configuration documentation.
The chart provides full control over HPA scaling behavior through the hpa.behavior section. This allows you to configure stabilization windows and scaling policies without post-deployment patching.
Default Configuration:
hpa:
behavior:
scaleUp:
stabilizationWindowSeconds: 240 # Wait 240s before scaling up
selectPolicy: Max
policies:
- type: Pods
value: 10
periodSeconds: 150
scaleDown:
stabilizationWindowSeconds: 240 # Wait 240s before scaling down
selectPolicy: Max
policies:
- type: Pods
value: 10
periodSeconds: 150You may want to set scaleUp.stabilizationWindowSeconds to a low number to trigger quicker scale-up, especially for models with long startup times. Similarly, set scaleDown.stabilizationWindowSeconds to a high number to slow down scale-down (reducing capacity), since once scaled down, it takes longer to restore capacity for slow-starting models. Another setting also affects scale-down is pod termination grace period as described here https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle/#pod-termination.
Configuration via Helm:
# Production: Conservative scaling (240s stabilization)
helm install workload-variant-autoscaler ./workload-variant-autoscaler \
--set hpa.behavior.scaleUp.stabilizationWindowSeconds=240 \
--set hpa.behavior.scaleDown.stabilizationWindowSeconds=240
# E2E Testing: Fast scaling (30s stabilization)
helm install workload-variant-autoscaler ./workload-variant-autoscaler \
--set hpa.behavior.scaleUp.stabilizationWindowSeconds=30 \
--set hpa.behavior.scaleDown.stabilizationWindowSeconds=30Configuration via install.sh:
# Set stabilization window via environment variable
HPA_STABILIZATION_SECONDS=120 ./deploy/install.sh
# Production default (240s)
./deploy/install.shKey Parameters:
- stabilizationWindowSeconds: Time to wait before applying scaling decisions (prevents flapping)
- selectPolicy: How to choose from multiple policies (
Max,Min,Disabled) - policies: List of scaling policies defining rate limits
Best Practices:
- Production: Use 120-300 seconds for stability
- Development: Use 30-60 seconds for faster iteration
- E2E Tests: Use 30 seconds for rapid validation
See HPA Integration Guide for detailed information.
# Use production values (secure by default)
helm install workload-variant-autoscaler ./workload-variant-autoscaler \
-n workload-variant-autoscaler-system \
--values values.yaml# Use development values (relaxed security)
helm install workload-variant-autoscaler ./workload-variant-autoscaler \
-n workload-variant-autoscaler-system \
--values values-dev.yaml# Override specific values
helm install workload-variant-autoscaler ./workload-variant-autoscaler \
-n workload-variant-autoscaler-system \
--values values.yaml \
--set wva.prometheus.tls.insecureSkipVerify=true \
--set wva.image.tag=v0.0.1-dev# Enable immutable ConfigMap for enhanced security
# This prevents accidental or malicious configuration changes
# Note: Disables runtime config updates (requires ConfigMap recreation for changes)
helm install workload-variant-autoscaler ./workload-variant-autoscaler \
-n workload-variant-autoscaler-system \
--values values.yaml \
--set wva.configMap.immutable=trueSecurity Benefits:
- Prevents accidental configuration changes
- Protects against malicious modifications
- Ensures configuration integrity
- Reduces attack surface
Trade-offs:
- Runtime config updates are disabled
- Configuration changes require:
- Deleting the ConfigMap
- Updating Helm values and upgrading
- Restarting the controller pod
When running multiple WVA controllers in the same cluster (e.g., for parallel e2e tests or multi-tenant environments), use the controllerInstance configuration to prevent metrics conflicts between controllers. See Multi-Controller Isolation for detailed configuration.
For parallel e2e tests, each test run can use a unique controller instance:
# Each PR/run uses its namespace as the controller instance
CONTROLLER_INSTANCE="llm-d-autoscaler-pr-123" ./deploy/install.shThis ensures that:
- Metrics from PR-123's controller have
controller_instance="llm-d-autoscaler-pr-123" - HPA for PR-123 only considers metrics with that label
- Stale controllers from other PRs don't affect the HPA decisions
When wva.controllerInstance is not set (empty string):
- No
controller_instancelabel is added to metrics - HPA selector does not filter by
controller_instance - Behavior is identical to previous versions
export MON_NS="openshift-user-workload-monitoring"
export WVA_NS="workload-variant-autoscaler-system"
helm delete prometheus-adapter -n $MON_NS
helm delete workload-variant-autoscaler -n $WVA_NS
kubectl delete ns $WVA_NS
- Check for 'error' in workload-variant-autoscaler-controller-manager-xxxxx in the workload-variant-autoscaler-system namespace
kubectl logs pod workload-variant-autoscaler-controller-manager-xxxxx -n workload-variant-autoscaler-system | grep error
- Check for '404' in prometheus-adapter in the openshift-user-workload-monitoring namespace
kubectl logs pod prometheus-adapter-xxxxx -n openshift-user-workload-monitoring | grep 404
- Check, after a few minutes following installation, for metric collection
kubectl get --raw "/apis/external.metrics.k8s.io/v1beta1/namespaces/$NAMESPACE/wva_desired_replicas" | jq