Saturation Scaling Configuration

Overview

The Workload Variant Autoscaler supports saturation-based scaling using KV cache utilization and queue length metrics. This feature is enabled by default and configured via a ConfigMap.

Key features:

✅ ConfigMap-based configuration with global defaults and per-model overrides
✅ Efficient caching with single read on startup (zero API calls during reconciliation)
✅ Automatic reload via ConfigMap watch (immediate response to changes)
✅ Thread-safe concurrent access with RWMutex
✅ Graceful degradation to defaults if ConfigMap missing

Configuration

ConfigMap Structure

The saturation scaling configuration is stored in a ConfigMap named capacity-scaling-config in the Workload Variant Autoscaler controller's namespace.

Location: deploy/configmap-capacity-scaling.yaml

Parameters

Parameter	Type	Description	Default
`kvCacheThreshold`	float64	Replica is considered saturated if KV cache utilization ≥ threshold (0.0-1.0)	0.80
`queueLengthThreshold`	int	Replica is considered saturated if queue length ≥ threshold	5
`kvSpareTrigger`	float64	Scale-up signal if average spare KV capacity < trigger (0.0-1.0)	0.10
`queueSpareTrigger`	int	Scale-up signal if average spare queue capacity < trigger	3

Default Configuration

The default configuration is automatically used if:

The ConfigMap is not deployed
The ConfigMap exists but has no default entry
An entry fails validation

Default values:

kvCacheThreshold: 0.80
queueLengthThreshold: 5
kvSpareTrigger: 0.1
queueSpareTrigger: 3

How Scale-Up Triggers Work

The saturation analyzer uses a spare capacity model to determine when to scale up. Instead of waiting for replicas to become fully saturated, WVA proactively scales when the average spare capacity across non-saturated replicas falls below configured thresholds.

Scale-up logic:

Calculate spare capacity for each non-saturated replica:
- Spare KV capacity = kvCacheThreshold - current_kv_usage
- Spare queue capacity = queueLengthThreshold - current_queue_length
Average across non-saturated replicas:
- WVA computes the average spare capacity across all healthy (non-saturated) replicas
Trigger scale-up when spare capacity is low:
- If avg_spare_kv < kvSpareTrigger OR avg_spare_queue < queueSpareTrigger
- Scale-up is triggered to add capacity before existing replicas saturate
Cascade scaling prevention:
- Variants with pending replicas (pods that exist but aren't ready yet) are skipped during scale-up
- This prevents repeatedly scaling the same variant while previous scale-up operations complete
- Pod startup can take 2-7 minutes (model loading, health checks)

Example scenario:

kvCacheThreshold = 0.80, kvSpareTrigger = 0.10
Replica A: 65% KV cache usage → Spare capacity: 0.15
Replica B: 72% KV cache usage → Spare capacity: 0.08
Average spare KV: (0.15 + 0.08) / 2 = 0.115
Since 0.115 > 0.10, no scale-up yet
If Replica B increases to 75%: Average spare = 0.10 → Scale-up triggered

This proactive approach ensures adequate headroom and prevents request drops by scaling before saturation occurs.

For detailed implementation, see: Saturation Analyzer Documentation

Best Practices: Coordinating with InferenceScheduler (End Point Picker)

What is End Point Picker (EPP)?

The End Point Picker (EPP) is an intelligent request routing component in the InferenceScheduler that selects the optimal inference server replica to handle each incoming request. EPP monitors replica capacity metrics (KV cache utilization, queue depth), as well as other replica metrics and uses scoring algorithms to route requests to replicas.

Deployment Architecture

EPP Deployment Model: Each model has a 1-to-1 relationship with its EPP instance. Every model served by the inference infrastructure has a dedicated EPP component that routes requests specifically to that model's replicas.

Example deployment pattern:

Model: Qwen/Qwen3-0.6B in namespace llm-d-autoscaler → Dedicated EPP instance gaie-workload-autoscaler-epp
Model: ibm/granite-13b in namespace production → Dedicated EPP instance gaie-production-epp
Each model deployment has its own EPP instance (naming follows namespace/workload convention)

This 1-to-1 architecture means that saturation detection and request routing decisions are model-specific, with each EPP instance monitoring only its associated model's replicas.

Threshold Alignment Recommendation

For optimal cluster performance, we strongly recommend using the same threshold values for both WVA (Workload Variant Autoscaler) and InferenceScheduler (End Point Picker) for each model deployment.

Using aligned thresholds ensures consistent capacity management across the cluster and prevents request drop situations.

Why threshold alignment matters:

Reduced Request Drop Rates: When WVA and EPP use the same saturation thresholds, the scheduler will avoid routing requests to replicas that WVA already considers saturated. This prevents the scheduler from overloading replicas that are about to trigger scale-up.
Consistent Capacity Assessment: Both components evaluate replica capacity using the same criteria (KV cache utilization and queue length), ensuring coordinated behavior across the entire inference stack.
Improved GPU Utilization: Aligned thresholds allow the cluster to maintain optimal GPU utilization without oversaturation. The scheduler respects the same capacity boundaries that drive autoscaling decisions.
Faster Response to Load Changes: When both components agree on saturation thresholds, the system responds more quickly to load changes with coordinated routing and scaling actions.

Configuration Comparison

WVA Saturation Scaling Configuration

# WVA Configuration (capacity-scaling-config ConfigMap)
apiVersion: v1
kind: ConfigMap
metadata:
  name: capacity-scaling-config
  namespace: <workload-variant-autoscaler-namespace>
data:
  default: |
    kvCacheThreshold: 0.80        # Should match EPP kvCacheUtilThreshold
    queueLengthThreshold: 5       # Should match EPP queueDepthThreshold
    kvSpareTrigger: 0.10          # WVA-specific (scale-up trigger)
    queueSpareTrigger: 3          # WVA-specific (scale-up trigger)

EPP Saturation Detector Configuration

The InferenceScheduler EPP component uses the gateway-api-inference-extension saturation detector to identify cluster overload.

Per-Model Configuration: Since each model has its own dedicated EPP instance, saturation detection is configured per model deployment. This allows different models to have different saturation thresholds based on their specific characteristics and SLO requirements.

# EPP Saturation Detector Configuration (per-model EPP instance)
saturationDetector:
  ...
  queueDepthThreshold: 5          # Default: 5 - Backend waiting queue size threshold
  kvCacheUtilThreshold: 0.8       # Default: 0.8 - KV cache utilization threshold (0.0-1.0)
  ...

Configuration Notes:

All parameters are optional; omitting them applies the documented defaults
EPP configuration is read only on startup - changes require EPP pod restart
Unlike WVA, EPP does not currently support live ConfigMap updates
Each EPP instance (one per model) can have different threshold values

Parameter Mapping and Alignment

Concept	WVA Field	EPP Field	Aligned Default	Description
KV Cache Saturation	`kvCacheThreshold`	`kvCacheUtilThreshold`	0.80 (80%)	Replica is saturated when KV cache ≥ threshold
Queue Saturation	`queueLengthThreshold`	`queueDepthThreshold`	5	Replica is saturated when queue length ≥ threshold
Scale-Up Trigger (KV)	`kvSpareTrigger`	(not applicable)	0.10 (10%)	WVA-only: Trigger scale-up when spare KV < threshold
Scale-Up Trigger (Queue)	`queueSpareTrigger`	(not applicable)	3	WVA-only: Trigger scale-up when spare queue < threshold

Configuration Workflow

Step 1: Define Thresholds

Choose thresholds based on your workload characteristics and SLO requirements:

Workload Type	kvCacheThreshold	queueLengthThreshold	Rationale
Conservative (Default)	0.80	5	Balanced performance and utilization
Aggressive (High GPU utilization)	0.90	15	Maximize GPU usage, higher latency variance
Strict (Low latency SLO)	0.70	3	Prioritize responsiveness, lower utilization

Step 2: Apply to WVA

Update capacity-scaling-config ConfigMap:

kubectl edit cm capacity-scaling-config -n <workload-variant-autoscaler-namespace>

Changes take effect immediately (WVA watches ConfigMap and auto-reloads).

Step 3: Apply to EPP

Important: Since each model has its own dedicated EPP instance (1-to-1 relationship), you must configure the EPP instance for each specific model deployment separately.

Current approach:

Identify the EPP instance for your target model:

# Example: Find EPP deployment for a specific model in namespace
kubectl get deployments -n llm-d-autoscaler | grep epp

Update the EPP instance's environment variables or configuration file for that specific model

Restart the EPP pod for that model:

# Restart the specific model's EPP instance
kubectl rollout restart deployment/gaie-<model-name>-epp -n <namespace>

Example for multiple models:

# Model 1: granite-13b in production
kubectl rollout restart deployment/gaie-granite-13b-epp -n production

# Model 2: llama-70b in lab
kubectl rollout restart deployment/gaie-llama-70b-epp -n lab

Step 4: Verify Configuration

WVA verification:

kubectl get cm capacity-scaling-config -n <workload-variant-autoscaler-namespace> -o yaml

EPP verification (per-model instance):

# Check specific model's EPP pod logs for loaded configuration
kubectl logs -n <namespace> deployment/gaie-<model-name>-epp | grep -i "saturation\|threshold"

# Example: Verify EPP configuration for granite-13b model in production
kubectl logs -n production deployment/gaie-granite-13b-epp | grep -i "saturation\|threshold"

Alignment Best Practices

Core Thresholds Must Match Per Model:
- kvCacheThreshold (WVA) = kvCacheUtilThreshold (EPP)
- queueLengthThreshold (WVA) = queueDepthThreshold (EPP)
- Important: Since each model has its own EPP instance, ensure thresholds align for each model deployment individually
Per-Model Configuration Strategy:
- Use WVA's per-model override feature to set model-specific thresholds
- Configure the corresponding EPP instance with matching thresholds
- Document the threshold mapping for each model deployment
- Example: If ibm/granite-13b uses kvCacheThreshold: 0.85 in WVA, its dedicated EPP must use kvCacheUtilThreshold: 0.85
WVA-Specific Parameters (kvSpareTrigger, queueSpareTrigger):
- These control WVA's scale-up aggressiveness
- Should be set lower than saturation thresholds
- Provide headroom before replicas become saturated
- Recommended: kvSpareTrigger = kvCacheThreshold - 0.1 to 0.2
Testing Threshold Changes:
- Test in development environment first
- Monitor impact on request drop rate and latency for the specific model
- Adjust based on observed behavior
- Remember to update both WVA and the model's EPP instance

Usage

1. Using Default Configuration

Simply deploy the controller without the ConfigMap. The system will log a warning and use hardcoded defaults:

WARN Saturation scaling ConfigMap not found, using hardcoded defaults

2. Customizing Global Defaults

Edit deploy/configmap-capacity-scaling.yaml:

apiVersion: v1
kind: ConfigMap
metadata:
  name: capacity-scaling-config
  namespace: <workload-variant-autoscaler-namespace>
data:
  default: |
    kvCacheThreshold: 0.75
    queueLengthThreshold: 10
    kvSpareTrigger: 0.15
    queueSpareTrigger: 5

Apply the ConfigMap:

kubectl apply -f deploy/configmap-capacity-scaling.yaml

Note: Changes take effect immediately! The controller watches the ConfigMap and automatically:

Reloads the cache when changes are detected
Triggers reconciliation of all VariantAutoscaling resources
Applies the new configuration without requiring pod restart

3. Per-Model Overrides

Add model-specific configuration entries to override defaults for specific model/namespace pairs:

apiVersion: v1
kind: ConfigMap
metadata:
  name: capacity-scaling-config
  namespace: <workload-variant-autoscaler-namespace>
data:
  default: |
    kvCacheThreshold: 0.80
    queueLengthThreshold: 5
    kvSpareTrigger: 0.1
    queueSpareTrigger: 3

  # Override for granite model in production namespace
  granite-13b-production: |
    model_id: ibm/granite-13b
    namespace: production
    kvCacheThreshold: 0.85
    kvSpareTrigger: 0.15

  # Override for llama model in lab namespace
  llama-70b-lab: |
    model_id: meta/llama-70b
    namespace: lab
    queueLengthThreshold: 20
    queueSpareTrigger: 10

Key points:

Entry keys (e.g., granite-13b-production) can be any descriptive name
Each override must include model_id and namespace fields
Only specified fields are overridden; others inherit from default
Multiple overrides can exist for different model/namespace combinations

4. Partial Overrides

You can override only specific parameters while inheriting the rest from defaults:

  my-model-override: |
    model_id: my-org/my-model
    namespace: my-namespace
    kvCacheThreshold: 0.90
    # Other fields inherit from default

Validation

The controller validates all configuration entries on load. Invalid entries are logged and skipped:

Validation Rules

KvCacheThreshold: Must be between 0.0 and 1.0
QueueLengthThreshold: Must be ≥ 0
KvSpareTrigger: Must be between 0.0 and 1.0
QueueSpareTrigger: Must be ≥ 0
Consistency: kvCacheThreshold must be ≥ kvSpareTrigger

Example Validation Errors

Invalid entry (logged and skipped):

  invalid-config: |
    model_id: test/model
    namespace: test
    kvCacheThreshold: 1.5  # ERROR: Must be ≤ 1.0

Log output:

WARN Invalid saturation scaling config entry, skipping key=invalid-config error=kvCacheThreshold must be between 0 and 1, got 1.50

Integration with Controller

Caching Architecture

The controller uses an efficient caching mechanism with ConfigMap watch for optimal performance:

Initialization (on controller startup):

// cmd/main.go
reconciler := &controller.VariantAutoscalingReconciler{...}
reconciler.SetupWithManager(mgr)  // Sets up ConfigMap watch

// Initialize cache on startup
if err := reconciler.InitializeCapacityConfigCache(context.Background()); err != nil {
    setupLog.Warn("Failed to load initial saturation scaling config, will use defaults")
}

Reconciliation (zero API calls):

// In Reconcile loop - uses cached config (fast, no API call)
capacityConfigs := r.getCapacityConfigFromCache()

// For a specific VariantAutoscaling resource
capacityConfig := r.getCapacityScalingConfigForVariant(
    capacityConfigs,
    va.Spec.ModelID,
    va.Namespace,
)

// Use capacityConfig for saturation-based scaling decisions
if currentKvUtil >= capacityConfig.KvCacheThreshold {
    // Apply saturation scaling logic
}

Automatic Cache Updates

The controller watches the capacity-scaling-config ConfigMap for changes:

ConfigMap change detected → Watch event triggered
Cache automatically reloaded → New configuration loaded
All VariantAutoscaling resources reconciled → New config applied immediately

Log output on ConfigMap change:

INFO  Saturation scaling ConfigMap changed, reloading cache
INFO  Saturation scaling config cache updated entries=3 has_default=true
INFO  Triggering reconciliation for all VariantAutoscaling resources due to ConfigMap change count=5

Performance Characteristics

Operation	Before (Without Cache)	After (With Cache)
Startup	N/A	Single ConfigMap read
Per Reconciliation	ConfigMap API call	Memory read only
Config Change	Manual pod restart needed	Automatic reload + reconcile
Latency Impact	Network round-trip per reconcile	Zero (memory access)
Concurrency	Serial API calls	Thread-safe concurrent reads

Cache benefits:

✅ Single read on startup instead of per-reconciliation
✅ Zero API calls during reconciliation (cached access)
✅ Event-driven updates (immediate response to changes)
✅ Thread-safe concurrent access (RWMutex)
✅ Defensive copying prevents external modification

Troubleshooting

ConfigMap Not Found

Symptom: Warning log message

WARN Saturation scaling ConfigMap not found, using hardcoded defaults configmap=capacity-scaling-config namespace=<workload-variant-autoscaler-namespace>

Solution: Deploy the ConfigMap:

kubectl apply -f deploy/configmap-capacity-scaling.yaml

Invalid Configuration Entry

Symptom: Warning log message

WARN Invalid saturation scaling config entry, skipping key=my-config error=...

Solution: Fix the validation error in the ConfigMap entry and reapply.

Missing Default Entry

Symptom: Warning log message

WARN No 'default' entry in saturation scaling ConfigMap, using hardcoded defaults

Solution: Add a default entry to the ConfigMap:

data:
  default: |
    kvCacheThreshold: 0.80
    queueLengthThreshold: 5
    kvSpareTrigger: 0.1
    queueSpareTrigger: 3

Override Not Applied

Symptom: Model-specific override is not being used

Checklist:

Verify model_id exactly matches va.Spec.ModelID
Verify namespace exactly matches the VariantAutoscaling resource namespace
Check controller logs for validation errors
Ensure entry passed validation (check for WARN logs)

Debug log (when override is applied):

DEBUG Applied saturation scaling override key=my-override modelID=ibm/granite-13b namespace=production config={...}

Config Changes Not Taking Effect

Symptom: Updated ConfigMap but controller still uses old values

Solution: The controller watches for ConfigMap changes and automatically reloads. Check:

Verify ConfigMap was updated:

kubectl get cm capacity-scaling-config -n <workload-variant-autoscaler-namespace> -o yaml

Check controller logs for reload confirmation:

kubectl logs -n <workload-variant-autoscaler-namespace> deployment/wva-controller | grep "Saturation scaling"

Expected logs:

INFO  Saturation scaling ConfigMap changed, reloading cache
INFO  Saturation scaling config cache updated entries=2 has_default=true
INFO  Triggering reconciliation for all VariantAutoscaling resources

If no logs appear, verify watch is working:
- Check controller pod is running: kubectl get pods -n <workload-variant-autoscaler-namespace>
- Check for errors: kubectl logs -n <workload-variant-autoscaler-namespace> deployment/wva-controller --tail=100

Manual restart (last resort):

kubectl rollout restart deployment/wva-controller -n <workload-variant-autoscaler-namespace>

Cache Initialization Failed

Symptom: Warning on controller startup

WARN Failed to load initial saturation scaling config, will use defaults

Solution: This is non-fatal. The controller continues with hardcoded defaults. To fix:

Deploy the ConfigMap:

kubectl apply -f deploy/configmap-capacity-scaling.yaml

The watch mechanism will automatically reload the cache once ConfigMap is available

Verify cache loaded:

kubectl logs -n <workload-variant-autoscaler-namespace> deployment/wva-controller | grep "Saturation scaling configuration loaded"

Example: Production Setup

deploy/configmap-capacity-scaling.yaml:

apiVersion: v1
kind: ConfigMap
metadata:
  name: capacity-scaling-config
  namespace: <workload-variant-autoscaler-namespace>
data:
  # Conservative defaults for most workloads
  default: |
    kvCacheThreshold: 0.80
    queueLengthThreshold: 5
    kvSpareTrigger: 0.1
    queueSpareTrigger: 3

  # High-priority production workload - scale aggressively
  granite-prod: |
    model_id: ibm/granite-13b
    namespace: production
    kvCacheThreshold: 0.70
    queueLengthThreshold: 3
    kvSpareTrigger: 0.20
    queueSpareTrigger: 5

  # Development workload - allow higher saturation
  llama-dev: |
    model_id: meta/llama-70b
    namespace: development
    kvCacheThreshold: 0.90
    queueLengthThreshold: 15
    kvSpareTrigger: 0.05
    queueSpareTrigger: 2

Apply the configuration:

kubectl apply -f deploy/configmap-capacity-scaling.yaml

Verify deployment:

kubectl get cm capacity-scaling-config -n <workload-variant-autoscaler-namespace>
kubectl describe cm capacity-scaling-config -n <workload-variant-autoscaler-namespace>

API Reference

Go Structs

CapacityScalingConfig:

type CapacityScalingConfig struct {
    ModelID              string  `yaml:"model_id,omitempty"`
    Namespace            string  `yaml:"namespace,omitempty"`
    KvCacheThreshold     float64 `yaml:"kvCacheThreshold"`
    QueueLengthThreshold int     `yaml:"queueLengthThreshold"`
    KvSpareTrigger       float64 `yaml:"kvSpareTrigger"`
    QueueSpareTrigger    int     `yaml:"queueSpareTrigger"`
}

Methods:

DefaultCapacityScalingConfig() CapacityScalingConfig - Returns hardcoded defaults
Validate() error - Validates configuration values
Merge(override CapacityScalingConfig) - Applies partial override

Architecture Notes

Caching Implementation Details

The caching mechanism uses the following components:

Thread Safety:

Uses sync.RWMutex for concurrent access control
Multiple reconciliation loops can read cache simultaneously
Write operations (cache reload) are exclusive

Defensive Copy:

getCapacityConfigFromCache() returns a deep copy
Prevents external code from modifying cached configuration
Each caller gets an independent copy

Watch Mechanism:

Kubernetes watch on capacity-scaling-config ConfigMap
Predicate filters to only relevant ConfigMap events
Event handler reloads cache and triggers reconciliation

Graceful Degradation:

Controller starts successfully even if ConfigMap missing
Uses hardcoded defaults as fallback
Automatically loads config once ConfigMap becomes available

FilesExpand file tree

saturation-scaling-config.md

Latest commit

History

saturation-scaling-config.md

File metadata and controls

Saturation Scaling Configuration

Overview

Configuration

ConfigMap Structure

Parameters

Default Configuration

How Scale-Up Triggers Work

Best Practices: Coordinating with InferenceScheduler (End Point Picker)

What is End Point Picker (EPP)?

Deployment Architecture

Threshold Alignment Recommendation

Configuration Comparison

WVA Saturation Scaling Configuration

EPP Saturation Detector Configuration

Parameter Mapping and Alignment

Configuration Workflow

Step 1: Define Thresholds

Step 2: Apply to WVA

Step 3: Apply to EPP

Step 4: Verify Configuration

Alignment Best Practices

Usage

1. Using Default Configuration

2. Customizing Global Defaults

3. Per-Model Overrides

4. Partial Overrides

Validation

Validation Rules

Example Validation Errors

Integration with Controller

Caching Architecture

Automatic Cache Updates

Performance Characteristics

Troubleshooting

ConfigMap Not Found

Invalid Configuration Entry

Missing Default Entry

Override Not Applied

Config Changes Not Taking Effect

Cache Initialization Failed

Example: Production Setup

API Reference

Go Structs

Architecture Notes

Caching Implementation Details