The Workload Variant Autoscaler supports saturation-based scaling using KV cache utilization and queue length metrics. This feature is enabled by default and configured via a ConfigMap.
Key features:
- ✅ ConfigMap-based configuration with global defaults and per-model overrides
- ✅ Efficient caching with single read on startup (zero API calls during reconciliation)
- ✅ Automatic reload via ConfigMap watch (immediate response to changes)
- ✅ Thread-safe concurrent access with RWMutex
- ✅ Graceful degradation to defaults if ConfigMap missing
The saturation scaling configuration is stored in a ConfigMap named capacity-scaling-config in the Workload Variant Autoscaler controller's namespace.
Location: deploy/configmap-capacity-scaling.yaml
| Parameter | Type | Description | Default |
|---|---|---|---|
kvCacheThreshold |
float64 | Replica is considered saturated if KV cache utilization ≥ threshold (0.0-1.0) | 0.80 |
queueLengthThreshold |
int | Replica is considered saturated if queue length ≥ threshold | 5 |
kvSpareTrigger |
float64 | Scale-up signal if average spare KV capacity < trigger (0.0-1.0) | 0.10 |
queueSpareTrigger |
int | Scale-up signal if average spare queue capacity < trigger | 3 |
The default configuration is automatically used if:
- The ConfigMap is not deployed
- The ConfigMap exists but has no
defaultentry - An entry fails validation
Default values:
kvCacheThreshold: 0.80
queueLengthThreshold: 5
kvSpareTrigger: 0.1
queueSpareTrigger: 3The saturation analyzer uses a spare capacity model to determine when to scale up. Instead of waiting for replicas to become fully saturated, WVA proactively scales when the average spare capacity across non-saturated replicas falls below configured thresholds.
Scale-up logic:
-
Calculate spare capacity for each non-saturated replica:
- Spare KV capacity =
kvCacheThreshold - current_kv_usage - Spare queue capacity =
queueLengthThreshold - current_queue_length
- Spare KV capacity =
-
Average across non-saturated replicas:
- WVA computes the average spare capacity across all healthy (non-saturated) replicas
-
Trigger scale-up when spare capacity is low:
- If
avg_spare_kv < kvSpareTriggerORavg_spare_queue < queueSpareTrigger - Scale-up is triggered to add capacity before existing replicas saturate
- If
-
Cascade scaling prevention:
- Variants with pending replicas (pods that exist but aren't ready yet) are skipped during scale-up
- This prevents repeatedly scaling the same variant while previous scale-up operations complete
- Pod startup can take 2-7 minutes (model loading, health checks)
Example scenario:
kvCacheThreshold = 0.80,kvSpareTrigger = 0.10- Replica A: 65% KV cache usage → Spare capacity: 0.15
- Replica B: 72% KV cache usage → Spare capacity: 0.08
- Average spare KV: (0.15 + 0.08) / 2 = 0.115
- Since 0.115 > 0.10, no scale-up yet
- If Replica B increases to 75%: Average spare = 0.10 → Scale-up triggered
This proactive approach ensures adequate headroom and prevents request drops by scaling before saturation occurs.
For detailed implementation, see: Saturation Analyzer Documentation
The End Point Picker (EPP) is an intelligent request routing component in the InferenceScheduler that selects the optimal inference server replica to handle each incoming request. EPP monitors replica capacity metrics (KV cache utilization, queue depth), as well as other replica metrics and uses scoring algorithms to route requests to replicas.
EPP Deployment Model: Each model has a 1-to-1 relationship with its EPP instance. Every model served by the inference infrastructure has a dedicated EPP component that routes requests specifically to that model's replicas.
Example deployment pattern:
- Model:
Qwen/Qwen3-0.6Bin namespacellm-d-autoscaler→ Dedicated EPP instancegaie-workload-autoscaler-epp - Model:
ibm/granite-13bin namespaceproduction→ Dedicated EPP instancegaie-production-epp - Each model deployment has its own EPP instance (naming follows namespace/workload convention)
This 1-to-1 architecture means that saturation detection and request routing decisions are model-specific, with each EPP instance monitoring only its associated model's replicas.
For optimal cluster performance, we strongly recommend using the same threshold values for both WVA (Workload Variant Autoscaler) and InferenceScheduler (End Point Picker) for each model deployment.
Using aligned thresholds ensures consistent capacity management across the cluster and prevents request drop situations.
Why threshold alignment matters:
-
Reduced Request Drop Rates: When WVA and EPP use the same saturation thresholds, the scheduler will avoid routing requests to replicas that WVA already considers saturated. This prevents the scheduler from overloading replicas that are about to trigger scale-up.
-
Consistent Capacity Assessment: Both components evaluate replica capacity using the same criteria (KV cache utilization and queue length), ensuring coordinated behavior across the entire inference stack.
-
Improved GPU Utilization: Aligned thresholds allow the cluster to maintain optimal GPU utilization without oversaturation. The scheduler respects the same capacity boundaries that drive autoscaling decisions.
-
Faster Response to Load Changes: When both components agree on saturation thresholds, the system responds more quickly to load changes with coordinated routing and scaling actions.
# WVA Configuration (capacity-scaling-config ConfigMap)
apiVersion: v1
kind: ConfigMap
metadata:
name: capacity-scaling-config
namespace: <workload-variant-autoscaler-namespace>
data:
default: |
kvCacheThreshold: 0.80 # Should match EPP kvCacheUtilThreshold
queueLengthThreshold: 5 # Should match EPP queueDepthThreshold
kvSpareTrigger: 0.10 # WVA-specific (scale-up trigger)
queueSpareTrigger: 3 # WVA-specific (scale-up trigger)The InferenceScheduler EPP component uses the gateway-api-inference-extension saturation detector to identify cluster overload.
Per-Model Configuration: Since each model has its own dedicated EPP instance, saturation detection is configured per model deployment. This allows different models to have different saturation thresholds based on their specific characteristics and SLO requirements.
# EPP Saturation Detector Configuration (per-model EPP instance)
saturationDetector:
...
queueDepthThreshold: 5 # Default: 5 - Backend waiting queue size threshold
kvCacheUtilThreshold: 0.8 # Default: 0.8 - KV cache utilization threshold (0.0-1.0)
...Configuration Notes:
- All parameters are optional; omitting them applies the documented defaults
- EPP configuration is read only on startup - changes require EPP pod restart
- Unlike WVA, EPP does not currently support live ConfigMap updates
- Each EPP instance (one per model) can have different threshold values
| Concept | WVA Field | EPP Field | Aligned Default | Description |
|---|---|---|---|---|
| KV Cache Saturation | kvCacheThreshold |
kvCacheUtilThreshold |
0.80 (80%) | Replica is saturated when KV cache ≥ threshold |
| Queue Saturation | queueLengthThreshold |
queueDepthThreshold |
5 | Replica is saturated when queue length ≥ threshold |
| Scale-Up Trigger (KV) | kvSpareTrigger |
(not applicable) | 0.10 (10%) | WVA-only: Trigger scale-up when spare KV < threshold |
| Scale-Up Trigger (Queue) | queueSpareTrigger |
(not applicable) | 3 | WVA-only: Trigger scale-up when spare queue < threshold |
Choose thresholds based on your workload characteristics and SLO requirements:
| Workload Type | kvCacheThreshold | queueLengthThreshold | Rationale |
|---|---|---|---|
| Conservative (Default) | 0.80 | 5 | Balanced performance and utilization |
| Aggressive (High GPU utilization) | 0.90 | 15 | Maximize GPU usage, higher latency variance |
| Strict (Low latency SLO) | 0.70 | 3 | Prioritize responsiveness, lower utilization |
Update capacity-scaling-config ConfigMap:
kubectl edit cm capacity-scaling-config -n <workload-variant-autoscaler-namespace>Changes take effect immediately (WVA watches ConfigMap and auto-reloads).
Important: Since each model has its own dedicated EPP instance (1-to-1 relationship), you must configure the EPP instance for each specific model deployment separately.
Current approach:
-
Identify the EPP instance for your target model:
# Example: Find EPP deployment for a specific model in namespace kubectl get deployments -n llm-d-autoscaler | grep epp
-
Update the EPP instance's environment variables or configuration file for that specific model
-
Restart the EPP pod for that model:
# Restart the specific model's EPP instance kubectl rollout restart deployment/gaie-<model-name>-epp -n <namespace>
Example for multiple models:
# Model 1: granite-13b in production
kubectl rollout restart deployment/gaie-granite-13b-epp -n production
# Model 2: llama-70b in lab
kubectl rollout restart deployment/gaie-llama-70b-epp -n labWVA verification:
kubectl get cm capacity-scaling-config -n <workload-variant-autoscaler-namespace> -o yamlEPP verification (per-model instance):
# Check specific model's EPP pod logs for loaded configuration
kubectl logs -n <namespace> deployment/gaie-<model-name>-epp | grep -i "saturation\|threshold"
# Example: Verify EPP configuration for granite-13b model in production
kubectl logs -n production deployment/gaie-granite-13b-epp | grep -i "saturation\|threshold"-
Core Thresholds Must Match Per Model:
kvCacheThreshold(WVA) =kvCacheUtilThreshold(EPP)queueLengthThreshold(WVA) =queueDepthThreshold(EPP)- Important: Since each model has its own EPP instance, ensure thresholds align for each model deployment individually
-
Per-Model Configuration Strategy:
- Use WVA's per-model override feature to set model-specific thresholds
- Configure the corresponding EPP instance with matching thresholds
- Document the threshold mapping for each model deployment
- Example: If
ibm/granite-13buseskvCacheThreshold: 0.85in WVA, its dedicated EPP must usekvCacheUtilThreshold: 0.85
-
WVA-Specific Parameters (
kvSpareTrigger,queueSpareTrigger):- These control WVA's scale-up aggressiveness
- Should be set lower than saturation thresholds
- Provide headroom before replicas become saturated
- Recommended:
kvSpareTrigger = kvCacheThreshold - 0.1 to 0.2
-
Testing Threshold Changes:
- Test in development environment first
- Monitor impact on request drop rate and latency for the specific model
- Adjust based on observed behavior
- Remember to update both WVA and the model's EPP instance
Simply deploy the controller without the ConfigMap. The system will log a warning and use hardcoded defaults:
WARN Saturation scaling ConfigMap not found, using hardcoded defaults
Edit deploy/configmap-capacity-scaling.yaml:
apiVersion: v1
kind: ConfigMap
metadata:
name: capacity-scaling-config
namespace: <workload-variant-autoscaler-namespace>
data:
default: |
kvCacheThreshold: 0.75
queueLengthThreshold: 10
kvSpareTrigger: 0.15
queueSpareTrigger: 5Apply the ConfigMap:
kubectl apply -f deploy/configmap-capacity-scaling.yamlNote: Changes take effect immediately! The controller watches the ConfigMap and automatically:
- Reloads the cache when changes are detected
- Triggers reconciliation of all VariantAutoscaling resources
- Applies the new configuration without requiring pod restart
Add model-specific configuration entries to override defaults for specific model/namespace pairs:
apiVersion: v1
kind: ConfigMap
metadata:
name: capacity-scaling-config
namespace: <workload-variant-autoscaler-namespace>
data:
default: |
kvCacheThreshold: 0.80
queueLengthThreshold: 5
kvSpareTrigger: 0.1
queueSpareTrigger: 3
# Override for granite model in production namespace
granite-13b-production: |
model_id: ibm/granite-13b
namespace: production
kvCacheThreshold: 0.85
kvSpareTrigger: 0.15
# Override for llama model in lab namespace
llama-70b-lab: |
model_id: meta/llama-70b
namespace: lab
queueLengthThreshold: 20
queueSpareTrigger: 10Key points:
- Entry keys (e.g.,
granite-13b-production) can be any descriptive name - Each override must include
model_idandnamespacefields - Only specified fields are overridden; others inherit from
default - Multiple overrides can exist for different model/namespace combinations
You can override only specific parameters while inheriting the rest from defaults:
my-model-override: |
model_id: my-org/my-model
namespace: my-namespace
kvCacheThreshold: 0.90
# Other fields inherit from defaultThe controller validates all configuration entries on load. Invalid entries are logged and skipped:
- KvCacheThreshold: Must be between 0.0 and 1.0
- QueueLengthThreshold: Must be ≥ 0
- KvSpareTrigger: Must be between 0.0 and 1.0
- QueueSpareTrigger: Must be ≥ 0
- Consistency:
kvCacheThresholdmust be ≥kvSpareTrigger
Invalid entry (logged and skipped):
invalid-config: |
model_id: test/model
namespace: test
kvCacheThreshold: 1.5 # ERROR: Must be ≤ 1.0Log output:
WARN Invalid saturation scaling config entry, skipping key=invalid-config error=kvCacheThreshold must be between 0 and 1, got 1.50
The controller uses an efficient caching mechanism with ConfigMap watch for optimal performance:
Initialization (on controller startup):
// cmd/main.go
reconciler := &controller.VariantAutoscalingReconciler{...}
reconciler.SetupWithManager(mgr) // Sets up ConfigMap watch
// Initialize cache on startup
if err := reconciler.InitializeCapacityConfigCache(context.Background()); err != nil {
setupLog.Warn("Failed to load initial saturation scaling config, will use defaults")
}Reconciliation (zero API calls):
// In Reconcile loop - uses cached config (fast, no API call)
capacityConfigs := r.getCapacityConfigFromCache()
// For a specific VariantAutoscaling resource
capacityConfig := r.getCapacityScalingConfigForVariant(
capacityConfigs,
va.Spec.ModelID,
va.Namespace,
)
// Use capacityConfig for saturation-based scaling decisions
if currentKvUtil >= capacityConfig.KvCacheThreshold {
// Apply saturation scaling logic
}The controller watches the capacity-scaling-config ConfigMap for changes:
- ConfigMap change detected → Watch event triggered
- Cache automatically reloaded → New configuration loaded
- All VariantAutoscaling resources reconciled → New config applied immediately
Log output on ConfigMap change:
INFO Saturation scaling ConfigMap changed, reloading cache
INFO Saturation scaling config cache updated entries=3 has_default=true
INFO Triggering reconciliation for all VariantAutoscaling resources due to ConfigMap change count=5
| Operation | Before (Without Cache) | After (With Cache) |
|---|---|---|
| Startup | N/A | Single ConfigMap read |
| Per Reconciliation | ConfigMap API call | Memory read only |
| Config Change | Manual pod restart needed | Automatic reload + reconcile |
| Latency Impact | Network round-trip per reconcile | Zero (memory access) |
| Concurrency | Serial API calls | Thread-safe concurrent reads |
Cache benefits:
- ✅ Single read on startup instead of per-reconciliation
- ✅ Zero API calls during reconciliation (cached access)
- ✅ Event-driven updates (immediate response to changes)
- ✅ Thread-safe concurrent access (RWMutex)
- ✅ Defensive copying prevents external modification
Symptom: Warning log message
WARN Saturation scaling ConfigMap not found, using hardcoded defaults configmap=capacity-scaling-config namespace=<workload-variant-autoscaler-namespace>
Solution: Deploy the ConfigMap:
kubectl apply -f deploy/configmap-capacity-scaling.yamlSymptom: Warning log message
WARN Invalid saturation scaling config entry, skipping key=my-config error=...
Solution: Fix the validation error in the ConfigMap entry and reapply.
Symptom: Warning log message
WARN No 'default' entry in saturation scaling ConfigMap, using hardcoded defaults
Solution: Add a default entry to the ConfigMap:
data:
default: |
kvCacheThreshold: 0.80
queueLengthThreshold: 5
kvSpareTrigger: 0.1
queueSpareTrigger: 3Symptom: Model-specific override is not being used
Checklist:
- Verify
model_idexactly matchesva.Spec.ModelID - Verify
namespaceexactly matches the VariantAutoscaling resource namespace - Check controller logs for validation errors
- Ensure entry passed validation (check for WARN logs)
Debug log (when override is applied):
DEBUG Applied saturation scaling override key=my-override modelID=ibm/granite-13b namespace=production config={...}
Symptom: Updated ConfigMap but controller still uses old values
Solution: The controller watches for ConfigMap changes and automatically reloads. Check:
-
Verify ConfigMap was updated:
kubectl get cm capacity-scaling-config -n <workload-variant-autoscaler-namespace> -o yaml
-
Check controller logs for reload confirmation:
kubectl logs -n <workload-variant-autoscaler-namespace> deployment/wva-controller | grep "Saturation scaling"
Expected logs:
INFO Saturation scaling ConfigMap changed, reloading cache INFO Saturation scaling config cache updated entries=2 has_default=true INFO Triggering reconciliation for all VariantAutoscaling resources -
If no logs appear, verify watch is working:
- Check controller pod is running:
kubectl get pods -n <workload-variant-autoscaler-namespace> - Check for errors:
kubectl logs -n <workload-variant-autoscaler-namespace> deployment/wva-controller --tail=100
- Check controller pod is running:
-
Manual restart (last resort):
kubectl rollout restart deployment/wva-controller -n <workload-variant-autoscaler-namespace>
Symptom: Warning on controller startup
WARN Failed to load initial saturation scaling config, will use defaults
Solution: This is non-fatal. The controller continues with hardcoded defaults. To fix:
-
Deploy the ConfigMap:
kubectl apply -f deploy/configmap-capacity-scaling.yaml
-
The watch mechanism will automatically reload the cache once ConfigMap is available
-
Verify cache loaded:
kubectl logs -n <workload-variant-autoscaler-namespace> deployment/wva-controller | grep "Saturation scaling configuration loaded"
deploy/configmap-capacity-scaling.yaml:
apiVersion: v1
kind: ConfigMap
metadata:
name: capacity-scaling-config
namespace: <workload-variant-autoscaler-namespace>
data:
# Conservative defaults for most workloads
default: |
kvCacheThreshold: 0.80
queueLengthThreshold: 5
kvSpareTrigger: 0.1
queueSpareTrigger: 3
# High-priority production workload - scale aggressively
granite-prod: |
model_id: ibm/granite-13b
namespace: production
kvCacheThreshold: 0.70
queueLengthThreshold: 3
kvSpareTrigger: 0.20
queueSpareTrigger: 5
# Development workload - allow higher saturation
llama-dev: |
model_id: meta/llama-70b
namespace: development
kvCacheThreshold: 0.90
queueLengthThreshold: 15
kvSpareTrigger: 0.05
queueSpareTrigger: 2Apply the configuration:
kubectl apply -f deploy/configmap-capacity-scaling.yamlVerify deployment:
kubectl get cm capacity-scaling-config -n <workload-variant-autoscaler-namespace>
kubectl describe cm capacity-scaling-config -n <workload-variant-autoscaler-namespace>CapacityScalingConfig:
type CapacityScalingConfig struct {
ModelID string `yaml:"model_id,omitempty"`
Namespace string `yaml:"namespace,omitempty"`
KvCacheThreshold float64 `yaml:"kvCacheThreshold"`
QueueLengthThreshold int `yaml:"queueLengthThreshold"`
KvSpareTrigger float64 `yaml:"kvSpareTrigger"`
QueueSpareTrigger int `yaml:"queueSpareTrigger"`
}Methods:
DefaultCapacityScalingConfig() CapacityScalingConfig- Returns hardcoded defaultsValidate() error- Validates configuration valuesMerge(override CapacityScalingConfig)- Applies partial override
The caching mechanism uses the following components:
Thread Safety:
- Uses
sync.RWMutexfor concurrent access control - Multiple reconciliation loops can read cache simultaneously
- Write operations (cache reload) are exclusive
Defensive Copy:
getCapacityConfigFromCache()returns a deep copy- Prevents external code from modifying cached configuration
- Each caller gets an independent copy
Watch Mechanism:
- Kubernetes watch on
capacity-scaling-configConfigMap - Predicate filters to only relevant ConfigMap events
- Event handler reloads cache and triggers reconciliation
Graceful Degradation:
- Controller starts successfully even if ConfigMap missing
- Uses hardcoded defaults as fallback
- Automatically loads config once ConfigMap becomes available