This skill helps configure and optimize Workload Variant Autoscaler (WVA) for llm-d inference deployments. Unlike traditional autoscaling approaches, this skill focuses on configuration guidance rather than installation automation.
This skill is designed to help users:
- Configure WVA based on specific requirements (aggressive scaling, cost optimization, etc.)
- Tune saturation thresholds for optimal performance
- Set up multi-variant deployments with cost-aware scaling
- Troubleshoot WVA behavior and scaling decisions
- Align WVA with Inference Scheduler (EPP) for optimal cluster performance
- ❌ Create new automation scripts (uses existing ones from llm-d repos)
- ❌ Duplicate installation documentation (links to official guides)
- ❌ Provide generic configurations (asks questions to understand requirements)
This skill requires access to three repositories:
-
llm-d: Main repository with installation guides
- GitHub: https://github.com/llm-d/llm-d
- Contains:
guides/workload-autoscaling/with helmfiles and templates
-
llm-d-workload-variant-autoscaler: WVA controller repository
- GitHub: https://github.com/llm-d/llm-d-workload-variant-autoscaler
- Contains: Documentation, configuration samples, CRDs
-
llm-d-benchmark: Testing and benchmarking tools
- GitHub: https://github.com/llm-d/llm-d-benchmark
- Contains: Testing scripts, workload profiles
When you first use this skill, it will:
- Check for repositories in common locations
- Ask you to specify custom locations if not found
- Offer to clone missing repositories to a location of your choice
The skill uses environment variables to reference repository paths:
LLMD_REPO_PATH: Path to llm-d repositoryWVA_REPO_PATH: Path to llm-d-workload-variant-autoscaler repositoryBENCHMARK_REPO_PATH: Path to llm-d-benchmark repository
skills/wva-autoscale-llm-d-workers/
├── SKILL.md # Main skill with configuration guidance
├── Troubleshooting.md # Quick troubleshooting reference
└── README.md # This file
The skill provides guidance for common configuration patterns:
- Lower saturation thresholds (kvCacheThreshold: 0.70)
- Higher spare triggers (kvSpareTrigger: 0.15)
- Faster HPA stabilization (60s scale-up)
- Multiple VariantAutoscaling resources with different costs
- WVA preferentially scales cheaper variants
- Example: H100 (variantCost: "80.0") vs A100 (variantCost: "40.0")
- Higher saturation thresholds (kvCacheThreshold: 0.85)
- Lower spare triggers (kvSpareTrigger: 0.05)
- Longer HPA stabilization (300s scale-up, 600s scale-down)
- Enable in WVA values:
scaleToZero: true - Configure retention period in ConfigMap
- Set HPA minReplicas: 0
WVA (Workload Variant Autoscaler) provides intelligent autoscaling for LLM inference using:
- KV Cache Saturation: Memory pressure in inference server
- Queue Depth: Request backlog monitoring
- Cost Optimization: Preferential scaling of cheaper variants
- Spare Capacity Model: Proactive scaling before saturation
A variant is a way of serving a model with a specific combination of:
- Hardware (H100, A100, L4 GPUs)
- Runtime configuration
- Serving approach (parallelism, LoRA adapters)
Multiple variants of the same model enable cost-aware scaling decisions.
WVA and Inference Scheduler (End Point Picker) must use the same saturation thresholds for optimal performance. Misalignment causes:
- EPP marks replica saturated → stops routing → WVA sees capacity → doesn't scale
- Or: WVA scales up → EPP still routing to old replicas → new capacity unused
User: "Set up autoscaling for Qwen/Qwen3-32B on H100s"
Skill Response:
- Asks clarifying questions (scaling goals, platform, etc.)
- Provides VariantAutoscaling YAML
- Provides HPA configuration
- Explains monitoring commands
- Links to relevant documentation
User: "I have Llama-70B on H100 and A100, optimize for cost"
Skill Response:
- Creates two VariantAutoscaling resources with different costs
- Explains how WVA will prefer A100 (cheaper)
- Provides HPA configurations for both
- Shows monitoring commands to verify behavior
User: "WVA isn't scaling, replicas stay at 1"
Skill Response:
- Checks current saturation levels
- Reviews WVA controller logs
- Verifies HPA metrics
- Suggests threshold adjustments if needed
Use llm-d-benchmark to test WVA configurations:
# Deploy with WVA
cd ${BENCHMARK_REPO_PATH}
./setup/standup.sh -p <namespace> -m <model-id> -c inference-scheduling --wva
# Run workload
./run.sh -l guidellm -w chatbot_synthetic -p <namespace> -m <model-id> -c inference-scheduling
# Teardown
./setup/teardown.sh -p <namespace> -d -c inference-scheduling- Start with defaults: Use default thresholds, tune based on observed behavior
- Align thresholds: Keep WVA and EPP synchronized
- Monitor first: Observe saturation patterns before aggressive tuning
- Stabilization windows: Use longer windows to prevent flapping
- Test with load: Validate with llm-d-benchmark under realistic load
- Cost accuracy: Set variantCost to reflect actual infrastructure costs
- Scale-to-zero: Only enable in dev/test, not production
-
Focus shifted from installation to configuration
- Removed custom installation scripts
- Links to official installation guides instead
- Emphasizes configuration patterns and tuning
-
Portable repository references
- Uses environment variables instead of hardcoded paths
- Detects and clones repositories as needed
- Works across different users and machines
-
Configuration-centric approach
- Provides configuration patterns for common use cases
- Explains the "why" behind configuration choices
- Helps translate user requirements into WVA config
-
Simplified structure
- Removed scripts/ and templates/ directories
- Uses existing scripts from llm-d repositories
- Focuses on guidance, not automation
- Troubleshooting guidance (updated with dynamic paths)
- Focus on WVA for llm-d deployments
- Support for multi-variant cost optimization
- Emphasis on threshold alignment with EPP
When improving this skill:
- Keep focus on configuration guidance, not installation
- Use existing scripts from llm-d repositories
- Provide specific examples based on user requirements
- Explain the reasoning behind configuration choices
- Link to official documentation for detailed information
- WVA GitHub: https://github.com/llm-d/llm-d-workload-variant-autoscaler
- llm-d GitHub: https://github.com/llm-d/llm-d
- llm-d-benchmark GitHub: https://github.com/llm-d/llm-d-benchmark