This guide explains how to configure Workload Variant Autoscaler for your workloads.
- Enabling Autoscaling for a Model Deployment
- Operating mode overview
- ConfigMaps
- Configuration options
- Cost configuration
- Advanced options
- Best practices
- Monitoring configuration
- Multi-controller environments
- Troubleshooting configuration
- Next steps
You can enable autoscaling for your model deployment by creating a VariantAutoscaling resource that references your deployment and model ID and a backend autoscaler (HPA or KEDA).
Choose between the following approaches:
- With HPA - Use Kubernetes HPA for autoscaling based on WVA's custom metrics
- With KEDA - Use KEDA for autoscaling based on WVA's custom metrics
WVA operates in saturation mode.
- Behavior: Reactive scaling based on saturation detection
- How It Works: Monitors KV cache usage and queue lengths, scales when thresholds exceeded
- Configuration: Uses
capacity-scaling-configConfigMap - Pros: Fast response (<30s), predictable, no model training needed
- Cons: Reactive (scales after saturation detected)
See Saturation Analyzer Documentation for configuration details.
WVA uses ConfigMaps for cluster-wide configuration.
Configuration values are resolved with following precedence (highest to lowest):
- CLI Flags — only when explicitly set on the command line (highest priority)
- Environment Variables
- ConfigMap (in
workload-variant-autoscaler-systemnamespace) - Defaults (lowest priority)
Note: CLI flag defaults do not override environment variables or ConfigMap values. Only flags that are explicitly passed on the command line take precedence. For example, if
--leader-electis not passed butLEADER_ELECT=trueis set in the environment, the environment value (true) is used.
Example:
# CLI flag explicitly set (highest priority)
--metrics-bind-address=":8443"
# Environment variable (used when flag is not explicitly set)
export METRICS_BIND_ADDRESS=":8080"
# ConfigMap (used when neither flag nor env is set)
# wva-variantautoscaling-config
data:
METRICS_BIND_ADDRESS: ":9090"
# Default (used if none of the above are set)
# Default: "0" (disabled)These settings cannot be changed at runtime via ConfigMap updates. Attempts to change them will:
- Be rejected by the controller
- Emit a Warning Kubernetes event
- Require a controller restart to take effect
Immutable Parameters:
PROMETHEUS_BASE_URL- Prometheus connection endpointMETRICS_BIND_ADDRESS- Metrics bind addressHEALTH_PROBE_BIND_ADDRESS- Health probe bind addressLEADER_ELECTION_ID- Leader election coordination ID- TLS certificate paths (webhook and metrics certificates)
Example - Attempting to Change Immutable Parameter:
# This will be rejected and emit a Warning event
apiVersion: v1
kind: ConfigMap
metadata:
name: wva-variantautoscaling-config
namespace: workload-variant-autoscaler-system
data:
PROMETHEUS_BASE_URL: "https://new-prometheus:9090" # Requires restartCheck for Rejected Changes:
# View Warning events
kubectl get events -n workload-variant-autoscaler-system \
--field-selector reason=ImmutableConfigChangeRejected
# Controller logs
kubectl logs -n workload-variant-autoscaler-system \
deployment/workload-variant-autoscaler-controller-manager | \
grep "Attempted to change immutable parameters"These settings can be changed at runtime via ConfigMap updates without restarting the controller:
Mutable Parameters:
GLOBAL_OPT_INTERVAL- Optimization interval (default:60s)- Saturation scaling configuration (via
wva-saturation-scaling-configConfigMap) - Scale-to-zero configuration (via
wva-model-scale-to-zero-configConfigMap) - Prometheus cache settings
Example - Runtime Configuration Update:
# This will be applied immediately without restart
apiVersion: v1
kind: ConfigMap
metadata:
name: wva-variantautoscaling-config
namespace: workload-variant-autoscaler-system
data:
GLOBAL_OPT_INTERVAL: "120s" # Applied immediatelyFor enhanced security, you can make the entire ConfigMap immutable using the Helm chart option wva.configMap.immutable: true. This provides additional protection beyond the controller's runtime validation.
Security Benefits:
- Prevents accidental changes: Kubernetes will reject any update attempts
- Protects against malicious modifications: Even with RBAC access, the ConfigMap cannot be modified
- Ensures configuration integrity: Configuration can only be changed through controlled Helm upgrades
- Reduces attack surface: Eliminates runtime configuration as a potential attack vector
Trade-offs:
- Runtime updates disabled: All configuration changes (including mutable parameters) require ConfigMap recreation
- Change process: To update configuration:
- Delete the ConfigMap:
kubectl delete configmap <name> -n <namespace> - Update Helm values and upgrade:
helm upgrade ... --set wva.configMap.immutable=false ... - Restart the controller pod
- Delete the ConfigMap:
Enable Immutable ConfigMap:
# Via Helm values
helm install workload-variant-autoscaler ./charts/workload-variant-autoscaler \
-n workload-variant-autoscaler-system \
--set wva.configMap.immutable=trueWhen to Use:
- Production environments with strict security requirements
- Multi-tenant clusters where configuration tampering is a concern
- Compliance requirements that mandate immutable infrastructure
- High-security deployments where configuration changes should be audited and controlled
When NOT to Use:
- Development environments where rapid iteration is needed
- Scenarios requiring frequent runtime config updates (e.g., A/B testing, dynamic tuning)
- Environments where ConfigMap updates are part of normal operations
WVA supports namespace-local ConfigMap overrides that allow different namespaces to have different configuration settings without requiring separate controller instances. This provides a middle ground between global configuration and full multi-controller isolation.
Use Cases:
- Different teams sharing a cluster with different SLO requirements
- Staging vs production namespaces with different scaling thresholds
- Gradual rollout of new thresholds in one namespace before applying cluster-wide
- Environment-specific tuning without operational overhead
How It Works:
- Global ConfigMap (in controller namespace): Provides default configuration for all namespaces
- Namespace-Local ConfigMap (in target namespace): Overrides global settings for that namespace only
- Resolution Order: Namespace-local > Global (automatic fallback if namespace-local doesn't exist)
Well-Known ConfigMap Names:
The following ConfigMap names are recognized for namespace-local overrides:
wva-saturation-scaling-config- Saturation scaling thresholdswva-model-scale-to-zero-config- Scale-to-zero configuration
Example: Namespace-Local Saturation Config
# Global ConfigMap (in workload-variant-autoscaler-system namespace)
apiVersion: v1
kind: ConfigMap
metadata:
name: wva-saturation-scaling-config
namespace: workload-variant-autoscaler-system
data:
default: |
kvCacheThreshold: 0.80
queueLengthThreshold: 5
kvSpareTrigger: 0.10
queueSpareTrigger: 3# Namespace-Local Override (in production namespace)
apiVersion: v1
kind: ConfigMap
metadata:
name: wva-saturation-scaling-config # Same well-known name
namespace: production # Different namespace
data:
default: |
kvCacheThreshold: 0.70 # More aggressive for production
queueLengthThreshold: 3
kvSpareTrigger: 0.20
queueSpareTrigger: 5Result: VAs in the production namespace use production thresholds (0.70), while VAs in other namespaces use global defaults (0.80).
Example: Namespace-Local Scale-to-Zero Config
# Global ConfigMap
apiVersion: v1
kind: ConfigMap
metadata:
name: wva-model-scale-to-zero-config
namespace: workload-variant-autoscaler-system
data:
model1: |
model_id: model1
enable_scale_to_zero: true
retention_period: 10m# Namespace-Local Override
apiVersion: v1
kind: ConfigMap
metadata:
name: wva-model-scale-to-zero-config
namespace: staging
data:
model1: |
model_id: model1
enable_scale_to_zero: false # Disable scale-to-zero in staging
retention_period: 5mConfigMap Deletion:
When a namespace-local ConfigMap is deleted, WVA automatically falls back to the global configuration. No restart required - the fallback happens immediately.
# Delete namespace-local ConfigMap
kubectl delete configmap wva-saturation-scaling-config -n production
# VAs in production namespace now use global configNamespace Discovery:
WVA uses a hybrid approach to discover namespaces for namespace-local ConfigMap watching:
-
Automatic (VA-based): WVA automatically tracks namespaces that have VariantAutoscaling resources. This is the default behavior - no configuration needed.
-
Explicit Opt-in (Label-based): You can opt-in namespaces by adding the label
wva.llmd.ai/config-enabled=trueto a namespace. This enables namespace-local ConfigMap watching even before VariantAutoscaling resources are created, avoiding race conditions.
Example: Opt-in a namespace for namespace-local ConfigMaps:
# Label a namespace to enable namespace-local ConfigMap watching
kubectl label namespace production wva.llmd.ai/config-enabled=true
# Now you can create namespace-local ConfigMaps before VAs exist
kubectl apply -f - <<EOF
apiVersion: v1
kind: ConfigMap
metadata:
name: wva-saturation-scaling-config
namespace: production
data:
default: |
kvCacheThreshold: 0.70
queueLengthThreshold: 3
EOFWhen to use label-based opt-in:
- Creating namespace-local ConfigMaps before VariantAutoscaling resources exist
- Explicitly controlling which namespaces can have overrides (security/audit)
- Multi-controller isolation (each controller can watch different label values)
Limitations:
- Main ConfigMap (
wva-variantautoscaling-config) is only supported globally, not as namespace-local override - Optimization interval (
GLOBAL_OPT_INTERVAL) is global only - Prometheus cache settings are global only
Relationship with Multi-Controller Isolation:
Namespace-local ConfigMaps are complementary to multi-controller isolation:
- Namespace-local ConfigMaps: Single controller, configuration isolation only
- Multi-controller isolation: Multiple controllers, complete operational isolation
They can be used together - you can have multiple controller instances, each using namespace-local configs within their scope.
The main configuration ConfigMap (wva-variantautoscaling-config) supports both static and dynamic settings:
apiVersion: v1
kind: ConfigMap
metadata:
name: wva-variantautoscaling-config
namespace: workload-variant-autoscaler-system
data:
# Mutable: Optimization interval (can be changed at runtime)
GLOBAL_OPT_INTERVAL: "60s"
# Immutable: Prometheus connection (requires restart if changed)
PROMETHEUS_BASE_URL: "https://prometheus:9090"
# Immutable: Feature flags (require restart if changed)
WVA_SCALE_TO_ZERO: "true"
WVA_LIMITED_MODE: "false"Note: The ConfigMap name is auto-generated by Helm based on the release name. For Kustomize deployments, set the CONFIG_MAP_NAME environment variable in the deployment manifest.
Many settings can be configured via environment variables (useful for containerized deployments):
# Deployment manifest
env:
# Prometheus connection (immutable - requires restart to change)
- name: PROMETHEUS_BASE_URL
value: "https://prometheus:9090"
# Optional: Override ConfigMap name
- name: CONFIG_MAP_NAME
value: "my-custom-config"
# Optional: Override namespace
- name: POD_NAMESPACE
value: "workload-variant-autoscaler-system"See: Prometheus Integration for complete Prometheus configuration options.
Infrastructure settings can be configured via CLI flags. Only flags explicitly passed on the command line take highest precedence; unset flags fall through to environment variables, ConfigMap, and then defaults.
# Start controller with custom settings
./manager \
--metrics-bind-address=":8443" \
--health-probe-bind-address=":8081" \
--leader-elect \
--leader-election-lease-duration=60s \
--leader-election-renew-deadline=50s \
--leader-election-retry-period=10s \
--rest-client-timeout=60sThe following table lists all static configuration parameters with their CLI flag, environment variable, ConfigMap key, type, and default value. All three sources share the same key name (except CLI flags which use kebab-case).
Note: CLI flags are typically set in the Helm chart or deployment manifest, not directly.
| Parameter | CLI Flag | Env Var / ConfigMap Key | Type | Default | Description |
|---|---|---|---|---|---|
| Metrics bind address | --metrics-bind-address |
METRICS_BIND_ADDRESS |
string | 0 |
Metrics endpoint bind address (:8443 for HTTPS, :8080 for HTTP, 0 to disable) |
| Health probe address | --health-probe-bind-address |
HEALTH_PROBE_BIND_ADDRESS |
string | :8081 |
Health probe endpoint bind address |
| Leader election | --leader-elect |
LEADER_ELECT |
bool | false |
Enable leader election for HA |
| Leader election ID | — | LEADER_ELECTION_ID |
string | 72dd1cf1.llm-d.ai |
Leader election coordination ID |
| Lease duration | --leader-election-lease-duration |
LEADER_ELECTION_LEASE_DURATION |
duration | 60s |
Duration non-leaders wait before force-acquiring leadership |
| Renew deadline | --leader-election-renew-deadline |
LEADER_ELECTION_RENEW_DEADLINE |
duration | 50s |
Duration the leader retries refreshing before giving up |
| Retry period | --leader-election-retry-period |
LEADER_ELECTION_RETRY_PERIOD |
duration | 10s |
Duration between retry attempts |
| REST timeout | --rest-client-timeout |
REST_CLIENT_TIMEOUT |
duration | 60s |
Timeout for Kubernetes API server REST calls |
| Secure metrics | --metrics-secure |
METRICS_SECURE |
bool | true |
Serve metrics endpoint via HTTPS |
| Enable HTTP/2 | --enable-http2 |
ENABLE_HTTP2 |
bool | false |
Enable HTTP/2 for metrics and webhook servers |
| Watch namespace | --watch-namespace |
WATCH_NAMESPACE |
string | "" |
Namespace to watch (empty = all namespaces) |
| Log verbosity | -v |
V |
int | 2 |
Log level verbosity |
| Webhook cert path | --webhook-cert-path |
WEBHOOK_CERT_PATH |
string | "" |
Directory containing the webhook certificate |
| Webhook cert name | --webhook-cert-name |
WEBHOOK_CERT_NAME |
string | tls.crt |
Webhook certificate file name |
| Webhook cert key | --webhook-cert-key |
WEBHOOK_CERT_KEY |
string | tls.key |
Webhook key file name |
| Metrics cert path | --metrics-cert-path |
METRICS_CERT_PATH |
string | "" |
Directory containing the metrics server certificate |
| Metrics cert name | --metrics-cert-name |
METRICS_CERT_NAME |
string | tls.crt |
Metrics server certificate file name |
| Metrics cert key | --metrics-cert-key |
METRICS_CERT_KEY |
string | tls.key |
Metrics key file name |
| Scale to zero | — | WVA_SCALE_TO_ZERO |
bool | false |
Enable scale-to-zero feature |
| Limited mode | — | WVA_LIMITED_MODE |
bool | false |
Enable limited mode |
| Scale-from-zero concurrency | — | SCALE_FROM_ZERO_ENGINE_MAX_CONCURRENCY |
int | 10 |
Max concurrent scale-from-zero operations |
WVA implements fail-fast validation: if required configuration is missing or invalid, the controller will:
- Not start (exits with error code 1)
- Log clear error messages indicating what's missing
- Prevent running with invalid configuration
Required Configuration:
PROMETHEUS_BASE_URL- Must be set via environment variable or ConfigMap
Check Startup Errors:
# View controller logs for validation errors
kubectl logs -n workload-variant-autoscaler-system \
deployment/workload-variant-autoscaler-controller-manager | \
grep -i "config\|validation\|error"
# Check pod status
kubectl get pods -n workload-variant-autoscaler-system
# If CrashLoopBackOff, check logs for config errorsStatic Config Updates:
- Changes to immutable parameters are rejected at runtime
- Controller emits Warning events and logs errors
- Action Required: Restart the controller to apply changes
Dynamic Config Updates:
- Changes to mutable parameters are applied immediately
- Controller logs the changes (old → new values)
- No restart required
Monitor Configuration Changes:
# Watch for config update logs
kubectl logs -n workload-variant-autoscaler-system \
deployment/workload-variant-autoscaler-controller-manager -f | \
grep "Updated.*config"
# Example output:
# "Updated optimization interval" old=60s new=120s
# "Updated saturation config" oldEntries=2 newEntries=3The VariantAutoscaling CR has the following required fields:
- scaleTargetRef: Reference to the target Deployment to scale (follows HPA pattern)
- kind: Resource kind (e.g., "Deployment")
- name: Name of the deployment
- modelID: OpenAI API compatible identifier for your model (e.g., "meta/llama-3.1-8b")
- variantCost: Cost per replica for saturation-based cost optimization (default: "10.0")
- Must be a string matching pattern
^\d+(\.\d+)?$(numeric string) - Used by capacity analyzer when multiple variants can handle the load
- Must be a string matching pattern
Specifies the cost per replica for this variant, used in saturation-based cost optimization.
spec:
modelID: "meta/llama-3.1-8b"
variantCost: "15.5" # Cost per replica (default: "10.0")Default: "10.0"
Validation: Must be a string matching pattern ^\d+(\.\d+)?$ (numeric string)
Use Cases:
- Differentiated Pricing: Higher cost for premium accelerators (H100) vs. standard (A100)
- Multi-Tenant Cost Tracking: Assign different costs per customer/tenant
- Cost-Based Optimization: Saturation analyzer prefers lower-cost variants when multiple variants can handle load
Example:
# Premium variant (H100, higher cost)
spec:
modelID: "meta/llama-3.1-70b"
variantCost: "80.0"
# Standard variant (A100, lower cost)
spec:
modelID: "meta/llama-3.1-70b"
variantCost: "40.0"Behavior:
- Saturation analyzer uses
variantCostwhen deciding which variant to scale - If costs are equal, chooses variant with most available capacity
- Does not affect model-based optimization
See CRD Reference for advanced configuration options.
WVA supports configuration via environment variables for operational settings:
Prometheus Configuration:
PROMETHEUS_BASE_URL: Prometheus server URL (required for metrics collection)PROMETHEUS_TLS_INSECURE_SKIP_VERIFY: Skip TLS verification (development only)PROMETHEUS_CA_CERT_PATH: CA certificate path for TLSPROMETHEUS_CLIENT_CERT_PATH: Client certificate for mutual TLSPROMETHEUS_CLIENT_KEY_PATH: Client key for mutual TLSPROMETHEUS_SERVER_NAME: Expected server name in TLS certificatePROMETHEUS_BEARER_TOKEN: Bearer token for authentication
Other Configuration:
CONFIG_MAP_NAME: ConfigMap name (default: auto-generated from Helm release)POD_NAMESPACE: Controller namespace (auto-injected by Kubernetes)
See Prometheus Integration for detailed Prometheus configuration.
- Assign higher costs to premium accelerators (H100) and lower costs to standard ones (A100)
- Use consistent cost values across variants of the same model to enable fair comparison
- The saturation analyzer will prefer scaling lower-cost variants when multiple can handle the load
- Always specify
scaleTargetRefexplicitly to avoid ambiguity - Use descriptive names that indicate the model and accelerator type
- Add labels to deployments and VAs for easier operational management
- Monitor VA status conditions to detect issues with target deployments
WVA exposes metrics for monitoring and integrates with HPA for automatic scaling.
See:
For complete documentation, see Multi-Controller Isolation Guide.
Deployment Not Found:
- Verify the deployment name in
scaleTargetRefmatches exactly - Check that the deployment exists in the same namespace as the VA
- Review VA status conditions:
kubectl get va <name> -o yaml
Metrics Not Available:
- Ensure Prometheus is properly configured and scraping vLLM metrics
- Verify ServiceMonitor is created for the vLLM deployment
- Check VA status condition
MetricsAvailable
Cost Optimization Not Working:
- Verify
variantCostis specified for all variants of the same model - Check that variants have different costs to enable cost-based selection
- Review saturation analyzer logs for decision-making process
- Check if min replicas can be reduced