Configuration Guide

This guide explains how to configure Workload Variant Autoscaler for your workloads.

Enabling Autoscaling for a Model Deployment
Operating mode overview
- Saturation behavior
ConfigMaps
Configuration options
- Required fields
- Optional fields
Cost configuration
- variantCost
Advanced options
Best practices
Monitoring configuration
- Prometheus metrics
Multi-controller environments
Troubleshooting configuration
- Common issues
Next steps

Enabling Autoscaling for a Model Deployment

You can enable autoscaling for your model deployment by creating a VariantAutoscaling resource that references your deployment and model ID and a backend autoscaler (HPA or KEDA).

Choose between the following approaches:

With HPA - Use Kubernetes HPA for autoscaling based on WVA's custom metrics
With KEDA - Use KEDA for autoscaling based on WVA's custom metrics

Operating Mode

WVA operates in saturation mode.

Saturation Mode

Behavior: Reactive scaling based on saturation detection
How It Works: Monitors KV cache usage and queue lengths, scales when thresholds exceeded
Configuration: Uses capacity-scaling-config ConfigMap
Pros: Fast response (<30s), predictable, no model training needed
Cons: Reactive (scales after saturation detected)

See Saturation Analyzer Documentation for configuration details.

ConfigMaps

WVA uses ConfigMaps for cluster-wide configuration.

Configuration Precedence

Configuration values are resolved with following precedence (highest to lowest):

CLI Flags — only when explicitly set on the command line (highest priority)
Environment Variables
ConfigMap (in workload-variant-autoscaler-system namespace)
Defaults (lowest priority)

Note: CLI flag defaults do not override environment variables or ConfigMap values. Only flags that are explicitly passed on the command line take precedence. For example, if --leader-elect is not passed but LEADER_ELECT=true is set in the environment, the environment value (true) is used.

Example:

# CLI flag explicitly set (highest priority)
--metrics-bind-address=":8443"

# Environment variable (used when flag is not explicitly set)
export METRICS_BIND_ADDRESS=":8080"

# ConfigMap (used when neither flag nor env is set)
# wva-variantautoscaling-config
data:
  METRICS_BIND_ADDRESS: ":9090"

# Default (used if none of the above are set)
# Default: "0" (disabled)

Immutable vs Mutable Parameters

Immutable Parameters (Require Restart)

These settings cannot be changed at runtime via ConfigMap updates. Attempts to change them will:

Be rejected by the controller
Emit a Warning Kubernetes event
Require a controller restart to take effect

Immutable Parameters:

PROMETHEUS_BASE_URL - Prometheus connection endpoint
METRICS_BIND_ADDRESS - Metrics bind address
HEALTH_PROBE_BIND_ADDRESS - Health probe bind address
LEADER_ELECTION_ID - Leader election coordination ID
TLS certificate paths (webhook and metrics certificates)

Example - Attempting to Change Immutable Parameter:

# This will be rejected and emit a Warning event
apiVersion: v1
kind: ConfigMap
metadata:
  name: wva-variantautoscaling-config
  namespace: workload-variant-autoscaler-system
data:
  PROMETHEUS_BASE_URL: "https://new-prometheus:9090"  # Requires restart

Check for Rejected Changes:

# View Warning events
kubectl get events -n workload-variant-autoscaler-system \
  --field-selector reason=ImmutableConfigChangeRejected

# Controller logs
kubectl logs -n workload-variant-autoscaler-system \
  deployment/workload-variant-autoscaler-controller-manager | \
  grep "Attempted to change immutable parameters"

Mutable Parameters (Runtime Updates)

These settings can be changed at runtime via ConfigMap updates without restarting the controller:

Mutable Parameters:

GLOBAL_OPT_INTERVAL - Optimization interval (default: 60s)
Saturation scaling configuration (via wva-saturation-scaling-config ConfigMap)
Scale-to-zero configuration (via wva-model-scale-to-zero-config ConfigMap)
Prometheus cache settings

Example - Runtime Configuration Update:

# This will be applied immediately without restart
apiVersion: v1
kind: ConfigMap
metadata:
  name: wva-variantautoscaling-config
  namespace: workload-variant-autoscaler-system
data:
  GLOBAL_OPT_INTERVAL: "120s"  # Applied immediately

Immutable ConfigMap (Security Hardening)

For enhanced security, you can make the entire ConfigMap immutable using the Helm chart option wva.configMap.immutable: true. This provides additional protection beyond the controller's runtime validation.

Security Benefits:

Prevents accidental changes: Kubernetes will reject any update attempts
Protects against malicious modifications: Even with RBAC access, the ConfigMap cannot be modified
Ensures configuration integrity: Configuration can only be changed through controlled Helm upgrades
Reduces attack surface: Eliminates runtime configuration as a potential attack vector

Trade-offs:

Runtime updates disabled: All configuration changes (including mutable parameters) require ConfigMap recreation
Change process: To update configuration:
1. Delete the ConfigMap: kubectl delete configmap <name> -n <namespace>
2. Update Helm values and upgrade: helm upgrade ... --set wva.configMap.immutable=false ...
3. Restart the controller pod

Enable Immutable ConfigMap:

# Via Helm values
helm install workload-variant-autoscaler ./charts/workload-variant-autoscaler \
  -n workload-variant-autoscaler-system \
  --set wva.configMap.immutable=true

When to Use:

Production environments with strict security requirements
Multi-tenant clusters where configuration tampering is a concern
Compliance requirements that mandate immutable infrastructure
High-security deployments where configuration changes should be audited and controlled

When NOT to Use:

Development environments where rapid iteration is needed
Scenarios requiring frequent runtime config updates (e.g., A/B testing, dynamic tuning)
Environments where ConfigMap updates are part of normal operations

Namespace-Local ConfigMap Overrides

WVA supports namespace-local ConfigMap overrides that allow different namespaces to have different configuration settings without requiring separate controller instances. This provides a middle ground between global configuration and full multi-controller isolation.

Use Cases:

Different teams sharing a cluster with different SLO requirements
Staging vs production namespaces with different scaling thresholds
Gradual rollout of new thresholds in one namespace before applying cluster-wide
Environment-specific tuning without operational overhead

How It Works:

Global ConfigMap (in controller namespace): Provides default configuration for all namespaces
Namespace-Local ConfigMap (in target namespace): Overrides global settings for that namespace only
Resolution Order: Namespace-local > Global (automatic fallback if namespace-local doesn't exist)

Well-Known ConfigMap Names:

The following ConfigMap names are recognized for namespace-local overrides:

wva-saturation-scaling-config - Saturation scaling thresholds
wva-model-scale-to-zero-config - Scale-to-zero configuration

Example: Namespace-Local Saturation Config

# Global ConfigMap (in workload-variant-autoscaler-system namespace)
apiVersion: v1
kind: ConfigMap
metadata:
  name: wva-saturation-scaling-config
  namespace: workload-variant-autoscaler-system
data:
  default: |
    kvCacheThreshold: 0.80
    queueLengthThreshold: 5
    kvSpareTrigger: 0.10
    queueSpareTrigger: 3

# Namespace-Local Override (in production namespace)
apiVersion: v1
kind: ConfigMap
metadata:
  name: wva-saturation-scaling-config  # Same well-known name
  namespace: production  # Different namespace
data:
  default: |
    kvCacheThreshold: 0.70  # More aggressive for production
    queueLengthThreshold: 3
    kvSpareTrigger: 0.20
    queueSpareTrigger: 5

Result: VAs in the production namespace use production thresholds (0.70), while VAs in other namespaces use global defaults (0.80).

Example: Namespace-Local Scale-to-Zero Config

# Global ConfigMap
apiVersion: v1
kind: ConfigMap
metadata:
  name: wva-model-scale-to-zero-config
  namespace: workload-variant-autoscaler-system
data:
  model1: |
    model_id: model1
    enable_scale_to_zero: true
    retention_period: 10m

# Namespace-Local Override
apiVersion: v1
kind: ConfigMap
metadata:
  name: wva-model-scale-to-zero-config
  namespace: staging
data:
  model1: |
    model_id: model1
    enable_scale_to_zero: false  # Disable scale-to-zero in staging
    retention_period: 5m

ConfigMap Deletion:

When a namespace-local ConfigMap is deleted, WVA automatically falls back to the global configuration. No restart required - the fallback happens immediately.

# Delete namespace-local ConfigMap
kubectl delete configmap wva-saturation-scaling-config -n production

# VAs in production namespace now use global config

Namespace Discovery:

WVA uses a hybrid approach to discover namespaces for namespace-local ConfigMap watching:

Automatic (VA-based): WVA automatically tracks namespaces that have VariantAutoscaling resources. This is the default behavior - no configuration needed.
Explicit Opt-in (Label-based): You can opt-in namespaces by adding the label wva.llmd.ai/config-enabled=true to a namespace. This enables namespace-local ConfigMap watching even before VariantAutoscaling resources are created, avoiding race conditions.

Example: Opt-in a namespace for namespace-local ConfigMaps:

# Label a namespace to enable namespace-local ConfigMap watching
kubectl label namespace production wva.llmd.ai/config-enabled=true

# Now you can create namespace-local ConfigMaps before VAs exist
kubectl apply -f - <<EOF
apiVersion: v1
kind: ConfigMap
metadata:
  name: wva-saturation-scaling-config
  namespace: production
data:
  default: |
    kvCacheThreshold: 0.70
    queueLengthThreshold: 3
EOF

When to use label-based opt-in:

Creating namespace-local ConfigMaps before VariantAutoscaling resources exist
Explicitly controlling which namespaces can have overrides (security/audit)
Multi-controller isolation (each controller can watch different label values)

Limitations:

Main ConfigMap (wva-variantautoscaling-config) is only supported globally, not as namespace-local override
Optimization interval (GLOBAL_OPT_INTERVAL) is global only
Prometheus cache settings are global only

Relationship with Multi-Controller Isolation:

Namespace-local ConfigMaps are complementary to multi-controller isolation:

Namespace-local ConfigMaps: Single controller, configuration isolation only
Multi-controller isolation: Multiple controllers, complete operational isolation

They can be used together - you can have multiple controller instances, each using namespace-local configs within their scope.

Main Configuration ConfigMap

The main configuration ConfigMap (wva-variantautoscaling-config) supports both static and dynamic settings:

apiVersion: v1
kind: ConfigMap
metadata:
  name: wva-variantautoscaling-config
  namespace: workload-variant-autoscaler-system
data:
  # Mutable: Optimization interval (can be changed at runtime)
  GLOBAL_OPT_INTERVAL: "60s"

  # Immutable: Prometheus connection (requires restart if changed)
  PROMETHEUS_BASE_URL: "https://prometheus:9090"

  # Immutable: Feature flags (require restart if changed)
  WVA_SCALE_TO_ZERO: "true"
  WVA_LIMITED_MODE: "false"

Note: The ConfigMap name is auto-generated by Helm based on the release name. For Kustomize deployments, set the CONFIG_MAP_NAME environment variable in the deployment manifest.

Configuration via Environment Variables

Many settings can be configured via environment variables (useful for containerized deployments):

# Deployment manifest
env:
  # Prometheus connection (immutable - requires restart to change)
  - name: PROMETHEUS_BASE_URL
    value: "https://prometheus:9090"

  # Optional: Override ConfigMap name
  - name: CONFIG_MAP_NAME
    value: "my-custom-config"

  # Optional: Override namespace
  - name: POD_NAMESPACE
    value: "workload-variant-autoscaler-system"

See: Prometheus Integration for complete Prometheus configuration options.

Configuration via CLI Flags

Infrastructure settings can be configured via CLI flags. Only flags explicitly passed on the command line take highest precedence; unset flags fall through to environment variables, ConfigMap, and then defaults.

# Start controller with custom settings
./manager \
  --metrics-bind-address=":8443" \
  --health-probe-bind-address=":8081" \
  --leader-elect \
  --leader-election-lease-duration=60s \
  --leader-election-renew-deadline=50s \
  --leader-election-retry-period=10s \
  --rest-client-timeout=60s

Configuration Parameter Reference

The following table lists all static configuration parameters with their CLI flag, environment variable, ConfigMap key, type, and default value. All three sources share the same key name (except CLI flags which use kebab-case).

Note: CLI flags are typically set in the Helm chart or deployment manifest, not directly.

Parameter	CLI Flag	Env Var / ConfigMap Key	Type	Default	Description
Metrics bind address	`--metrics-bind-address`	`METRICS_BIND_ADDRESS`	string	`0`	Metrics endpoint bind address (`:8443` for HTTPS, `:8080` for HTTP, `0` to disable)
Health probe address	`--health-probe-bind-address`	`HEALTH_PROBE_BIND_ADDRESS`	string	`:8081`	Health probe endpoint bind address
Leader election	`--leader-elect`	`LEADER_ELECT`	bool	`false`	Enable leader election for HA
Leader election ID	—	`LEADER_ELECTION_ID`	string	`72dd1cf1.llm-d.ai`	Leader election coordination ID
Lease duration	`--leader-election-lease-duration`	`LEADER_ELECTION_LEASE_DURATION`	duration	`60s`	Duration non-leaders wait before force-acquiring leadership
Renew deadline	`--leader-election-renew-deadline`	`LEADER_ELECTION_RENEW_DEADLINE`	duration	`50s`	Duration the leader retries refreshing before giving up
Retry period	`--leader-election-retry-period`	`LEADER_ELECTION_RETRY_PERIOD`	duration	`10s`	Duration between retry attempts
REST timeout	`--rest-client-timeout`	`REST_CLIENT_TIMEOUT`	duration	`60s`	Timeout for Kubernetes API server REST calls
Secure metrics	`--metrics-secure`	`METRICS_SECURE`	bool	`true`	Serve metrics endpoint via HTTPS
Enable HTTP/2	`--enable-http2`	`ENABLE_HTTP2`	bool	`false`	Enable HTTP/2 for metrics and webhook servers
Watch namespace	`--watch-namespace`	`WATCH_NAMESPACE`	string	`""`	Namespace to watch (empty = all namespaces)
Log verbosity	`-v`	`V`	int	`2`	Log level verbosity
Webhook cert path	`--webhook-cert-path`	`WEBHOOK_CERT_PATH`	string	`""`	Directory containing the webhook certificate
Webhook cert name	`--webhook-cert-name`	`WEBHOOK_CERT_NAME`	string	`tls.crt`	Webhook certificate file name
Webhook cert key	`--webhook-cert-key`	`WEBHOOK_CERT_KEY`	string	`tls.key`	Webhook key file name
Metrics cert path	`--metrics-cert-path`	`METRICS_CERT_PATH`	string	`""`	Directory containing the metrics server certificate
Metrics cert name	`--metrics-cert-name`	`METRICS_CERT_NAME`	string	`tls.crt`	Metrics server certificate file name
Metrics cert key	`--metrics-cert-key`	`METRICS_CERT_KEY`	string	`tls.key`	Metrics key file name
Scale to zero	—	`WVA_SCALE_TO_ZERO`	bool	`false`	Enable scale-to-zero feature
Limited mode	—	`WVA_LIMITED_MODE`	bool	`false`	Enable limited mode
Scale-from-zero concurrency	—	`SCALE_FROM_ZERO_ENGINE_MAX_CONCURRENCY`	int	`10`	Max concurrent scale-from-zero operations

Fail-Fast Validation

WVA implements fail-fast validation: if required configuration is missing or invalid, the controller will:

Not start (exits with error code 1)
Log clear error messages indicating what's missing
Prevent running with invalid configuration

Required Configuration:

PROMETHEUS_BASE_URL - Must be set via environment variable or ConfigMap

Check Startup Errors:

# View controller logs for validation errors
kubectl logs -n workload-variant-autoscaler-system \
  deployment/workload-variant-autoscaler-controller-manager | \
  grep -i "config\|validation\|error"

# Check pod status
kubectl get pods -n workload-variant-autoscaler-system
# If CrashLoopBackOff, check logs for config errors

Configuration Update Behavior

Static Config Updates:

Changes to immutable parameters are rejected at runtime
Controller emits Warning events and logs errors
Action Required: Restart the controller to apply changes

Dynamic Config Updates:

Changes to mutable parameters are applied immediately
Controller logs the changes (old → new values)
No restart required

Monitor Configuration Changes:

# Watch for config update logs
kubectl logs -n workload-variant-autoscaler-system \
  deployment/workload-variant-autoscaler-controller-manager -f | \
  grep "Updated.*config"

# Example output:
# "Updated optimization interval" old=60s new=120s
# "Updated saturation config" oldEntries=2 newEntries=3

Configuration Options

Required Fields

The VariantAutoscaling CR has the following required fields:

scaleTargetRef: Reference to the target Deployment to scale (follows HPA pattern)
- kind: Resource kind (e.g., "Deployment")
- name: Name of the deployment
modelID: OpenAI API compatible identifier for your model (e.g., "meta/llama-3.1-8b")

Optional Fields

variantCost: Cost per replica for saturation-based cost optimization (default: "10.0")
- Must be a string matching pattern ^\d+(\.\d+)?$ (numeric string)
- Used by capacity analyzer when multiple variants can handle the load

Cost Configuration

variantCost (Optional)

Specifies the cost per replica for this variant, used in saturation-based cost optimization.

spec:
  modelID: "meta/llama-3.1-8b"
  variantCost: "15.5"  # Cost per replica (default: "10.0")

Default: "10.0" Validation: Must be a string matching pattern ^\d+(\.\d+)?$ (numeric string)

Use Cases:

Differentiated Pricing: Higher cost for premium accelerators (H100) vs. standard (A100)
Multi-Tenant Cost Tracking: Assign different costs per customer/tenant
Cost-Based Optimization: Saturation analyzer prefers lower-cost variants when multiple variants can handle load

Example:

# Premium variant (H100, higher cost)
spec:
  modelID: "meta/llama-3.1-70b"
  variantCost: "80.0"

# Standard variant (A100, lower cost)
spec:
  modelID: "meta/llama-3.1-70b"
  variantCost: "40.0"

Behavior:

Saturation analyzer uses variantCost when deciding which variant to scale
If costs are equal, chooses variant with most available capacity
Does not affect model-based optimization

Advanced Options

See CRD Reference for advanced configuration options.

Best Practices

Environment Variables

WVA supports configuration via environment variables for operational settings:

Prometheus Configuration:

PROMETHEUS_BASE_URL: Prometheus server URL (required for metrics collection)
PROMETHEUS_TLS_INSECURE_SKIP_VERIFY: Skip TLS verification (development only)
PROMETHEUS_CA_CERT_PATH: CA certificate path for TLS
PROMETHEUS_CLIENT_CERT_PATH: Client certificate for mutual TLS
PROMETHEUS_CLIENT_KEY_PATH: Client key for mutual TLS
PROMETHEUS_SERVER_NAME: Expected server name in TLS certificate
PROMETHEUS_BEARER_TOKEN: Bearer token for authentication

Other Configuration:

CONFIG_MAP_NAME: ConfigMap name (default: auto-generated from Helm release)
POD_NAMESPACE: Controller namespace (auto-injected by Kubernetes)

See Prometheus Integration for detailed Prometheus configuration.

Cost Optimization

Assign higher costs to premium accelerators (H100) and lower costs to standard ones (A100)
Use consistent cost values across variants of the same model to enable fair comparison
The saturation analyzer will prefer scaling lower-cost variants when multiple can handle the load

Deployment Configuration

Always specify scaleTargetRef explicitly to avoid ambiguity
Use descriptive names that indicate the model and accelerator type
Add labels to deployments and VAs for easier operational management
Monitor VA status conditions to detect issues with target deployments

Monitoring Configuration

WVA exposes metrics for monitoring and integrates with HPA for automatic scaling.

Prometheus Metrics

See:

Multi-Controller Environments

For complete documentation, see Multi-Controller Isolation Guide.

Troubleshooting Configuration

Common Issues

Deployment Not Found:

Verify the deployment name in scaleTargetRef matches exactly
Check that the deployment exists in the same namespace as the VA
Review VA status conditions: kubectl get va <name> -o yaml

Metrics Not Available:

Ensure Prometheus is properly configured and scraping vLLM metrics
Verify ServiceMonitor is created for the vLLM deployment
Check VA status condition MetricsAvailable

Cost Optimization Not Working:

Verify variantCost is specified for all variants of the same model
Check that variants have different costs to enable cost-based selection
Review saturation analyzer logs for decision-making process
Check if min replicas can be reduced

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Configuration Guide

Enabling Autoscaling for a Model Deployment

Operating Mode

Saturation Mode

ConfigMaps

Configuration Precedence

Immutable vs Mutable Parameters

Immutable Parameters (Require Restart)

Mutable Parameters (Runtime Updates)

Immutable ConfigMap (Security Hardening)

Namespace-Local ConfigMap Overrides

Main Configuration ConfigMap

Configuration via Environment Variables

Configuration via CLI Flags

Configuration Parameter Reference

Fail-Fast Validation

Configuration Update Behavior

Configuration Options

Required Fields

Optional Fields

Cost Configuration

variantCost (Optional)

Advanced Options

Best Practices

Environment Variables

Cost Optimization

Deployment Configuration

Monitoring Configuration

Prometheus Metrics

Multi-Controller Environments

Troubleshooting Configuration

Common Issues

Next Steps

FilesExpand file tree

configuration.md

Latest commit

History

configuration.md

File metadata and controls

Configuration Guide

Enabling Autoscaling for a Model Deployment

Operating Mode

Saturation Mode

ConfigMaps

Configuration Precedence

Immutable vs Mutable Parameters

Immutable Parameters (Require Restart)

Mutable Parameters (Runtime Updates)

Immutable ConfigMap (Security Hardening)

Namespace-Local ConfigMap Overrides

Main Configuration ConfigMap

Configuration via Environment Variables

Configuration via CLI Flags

Configuration Parameter Reference

Fail-Fast Validation

Configuration Update Behavior

Configuration Options

Required Fields

Optional Fields

Cost Configuration

variantCost (Optional)

Advanced Options

Best Practices

Environment Variables

Cost Optimization

Deployment Configuration

Monitoring Configuration

Prometheus Metrics

Multi-Controller Environments

Troubleshooting Configuration

Common Issues

Next Steps