docs: Add User Documentation for GPU Limiter and Scale-to-Zero Features by ev-shindin · Pull Request #638 · llm-d/llm-d-workload-variant-autoscaler

ev-shindin · 2026-01-26T14:30:32Z

Summary

Adds comprehensive user-facing documentation for the GPU Limiter and Scale-to-Zero features, along with cross-references throughout existing documentation to ensure discoverability.

Changes

New Documentation Files

File	Description
`docs/user-guide/gpu-limiter.md`	Complete user guide for GPU Limiter (Experimental) - 307 lines
`docs/user-guide/scale-to-zero.md`	Complete user guide for Scale-to-Zero - 405 lines
`docs/user-guide/scale-from-zero.md`	Placeholder for Scale-from-Zero documentation - 30 lines
`config/samples/saturation-scaling-config.yaml`	Sample ConfigMap with `enableLimiter: true`

Updated Documentation

File	Changes
`README.md`	Added GPU Limiter and Scale-to-Zero to Key Features section and User Guide links
`docs/README.md`	Added links to new user guides in documentation index
`charts/workload-variant-autoscaler/README.md`	Added configuration sections for GPU Limiter and Scale-to-Zero with Helm examples
`charts/workload-variant-autoscaler/values.yaml`	Improved comments for `limitedMode` and `scaleToZero` with doc references

Documentation Content

GPU Limiter (Experimental)

The GPU Limiter documentation covers:

Problem statement: unfulfillable scale-up requests, unfair resource distribution
Solution: resource-aware scaling that constrains decisions based on GPU availability
Scaling pipeline diagram showing where limiter fits
Greedy-by-saturation allocation algorithm explanation
GPU type awareness (H100, A100, MI300X tracked separately)
Configuration via saturation-scaling-config ConfigMap with enableLimiter: true
Prerequisites: GPU operator, deployment GPU requirements
Example scenarios: single model, multiple models competing, heterogeneous GPU types
Troubleshooting and best practices

Scale-to-Zero

The Scale-to-Zero documentation covers:

Problem statement: wasted GPU resources, higher costs for idle models
Solution: automatic scaling to zero after configurable retention period
Decision logic and retention period explanation
Configuration via model-scale-to-zero-config ConfigMap
Namespace-scoped overrides for multi-environment deployments
Prerequisites: HPA feature gate, Prometheus metrics
Example scenarios: development, production with mixed criticality, cost optimization
Troubleshooting and best practices
Reference to Scale-from-Zero for scaling back up

Scale-from-Zero (Placeholder)

Basic placeholder documentation noting:

Feature monitors EPP queue for pending requests
Automatically triggers scale-up when traffic arrives
Works in conjunction with Scale-to-Zero
Full documentation coming soon

Helm Chart Updates

Added configuration sections to chart README:

# GPU Limiter
wva:
  capacityScaling:
    default:
      enableLimiter: true

# Scale to Zero
wva:
  scaleToZero: true

Improved values.yaml comments:

# GPU Limiter mode - constrains scaling based on actual GPU availability (Experimental)
# When enabled, scale-up decisions are limited by available GPUs per accelerator type
# Uses greedy-by-saturation algorithm to prioritize most saturated models
# See docs/user-guide/gpu-limiter.md for details
limitedMode: false

# Scale to Zero - automatically scale idle models to zero replicas
# When enabled, models with no traffic for the retention period will scale to 0
# Requires: Kubernetes 1.27+ (HPAScaleToZero feature gate) and HPA minReplicas: 0
# Configure retention periods via model-scale-to-zero-config ConfigMap
# See docs/user-guide/scale-to-zero.md for details
scaleToZero: false

Related Issues

Addresses documentation gap for GPU Limiter feature
Addresses documentation gap for Scale-to-Zero feature
Improves feature discoverability in main README

Checklist

User guide for GPU Limiter created
User guide for Scale-to-Zero created
Placeholder for Scale-from-Zero created
Main README updated with feature highlights
docs/README.md index updated
Helm chart README updated with configuration examples
Helm values.yaml comments improved
Sample ConfigMap added to config/samples/

…ero features Add comprehensive user documentation for three scaling features: - GPU Limiter (Experimental): Resource-aware scaling that constrains autoscaling decisions based on actual GPU availability. Documents the greedy-by-saturation allocation algorithm and per-accelerator type tracking. - Scale to Zero: Automatic scaling of idle model deployments to zero replicas after a configurable retention period. Includes namespace- scoped configuration for multi-environment deployments. - Scale from Zero (placeholder): Documents the complementary feature that automatically scales models back up when requests arrive. All documents follow consistent structure with overview, configuration, prerequisites, example scenarios, troubleshooting, and best practices.

Update documentation to reference the new GPU Limiter and Scale-to-Zero features consistently across all user-facing docs: - README.md: Add features to Key Features section and User Guide links - docs/README.md: Add links to GPU Limiter, Scale-to-Zero, Scale-from-Zero - Helm Chart README: Add configuration sections for both features with examples showing how to enable via Helm values and ConfigMaps - Helm values.yaml: Improve comments for limitedMode and scaleToZero options with references to documentation - config/samples: Add saturation-scaling-config.yaml sample with enableLimiter example

lionelvillard

LGTM. Added few comments.

lionelvillard · 2026-01-29T19:56:38Z

+  Result: Each model limited by its GPU type availability
+```
+
+## Troubleshooting


We now have a troubleshooting guide. Can you move this section over there?

I moved this to another to saturation-scaling-config.md. Or it should be in another file?

- scale-to-zero: clarify retention_period is unrelated to HPA stabilizationWindowSeconds - scale-to-zero: remove cold start limitation (applies to scale-from-zero, not scale-to-zero) - scale-to-zero: remove broken links to removed scale-from-zero.md - gpu-limiter: move troubleshooting section to saturation-scaling-config.md - saturation-scaling-config: add GPU Limiter Issues troubleshooting subsection

lionelvillard · 2026-02-10T15:34:12Z

+
+### Scaling Pipeline
+
+```


I would make this pipeline more generic. Basically there is a pipeline block before GPU limiter and after it. The block before computes a desired number of replicas and the saturation analyzer is just an example.

Signed-off-by: Lionel Villard <villard@us.ibm.com>

ev-shindin added 3 commits January 26, 2026 15:51

docs: remove trailing whitespace from scale-from-zero.md

6c2876e

ev-shindin requested a review from lionelvillard January 27, 2026 16:30

lionelvillard reviewed Jan 28, 2026

View reviewed changes

Comment thread docs/user-guide/scale-from-zero.md Outdated

lionelvillard reviewed Jan 28, 2026

View reviewed changes

Comment thread docs/README.md Outdated

lionelvillard reviewed Jan 28, 2026

View reviewed changes

ev-shindin added 2 commits January 28, 2026 21:00

docs: remove scale-from-zero user guide

e0273d2

fix(docs): address review comment

9bdee74

ev-shindin self-assigned this Jan 28, 2026

ev-shindin requested review from lionelvillard January 28, 2026 21:43

lionelvillard reviewed Jan 29, 2026

View reviewed changes

Comment thread docs/user-guide/scale-to-zero.md Outdated

lionelvillard reviewed Jan 29, 2026

View reviewed changes

Comment thread docs/user-guide/scale-to-zero.md Outdated

ev-shindin requested a review from lionelvillard February 3, 2026 09:16

lionelvillard reviewed Feb 10, 2026

View reviewed changes

lionelvillard previously approved these changes Mar 10, 2026

View reviewed changes

Merge branch 'main' into docs/user-guide-features

d928fe1

Signed-off-by: Lionel Villard <villard@us.ibm.com>

lionelvillard dismissed their stale review via d928fe1 March 10, 2026 18:58

lionelvillard enabled auto-merge (squash) March 10, 2026 18:59

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docs: Add User Documentation for GPU Limiter and Scale-to-Zero Features#638

docs: Add User Documentation for GPU Limiter and Scale-to-Zero Features#638
ev-shindin wants to merge 7 commits intollm-d:mainfrom
ev-shindin:docs/user-guide-features

ev-shindin commented Jan 26, 2026

Uh oh!

Uh oh!

Uh oh!

lionelvillard left a comment

Uh oh!

lionelvillard Jan 29, 2026

Uh oh!

ev-shindin Feb 2, 2026

Uh oh!

Uh oh!

Uh oh!

lionelvillard Feb 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants


		### Scaling Pipeline

		```

Conversation

ev-shindin commented Jan 26, 2026

Summary

Changes

New Documentation Files

Updated Documentation

Documentation Content

GPU Limiter (Experimental)

Scale-to-Zero

Scale-from-Zero (Placeholder)

Helm Chart Updates

Related Issues

Checklist

Uh oh!

Uh oh!

Uh oh!

lionelvillard left a comment

Choose a reason for hiding this comment

Uh oh!

lionelvillard Jan 29, 2026

Choose a reason for hiding this comment

Uh oh!

ev-shindin Feb 2, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

lionelvillard Feb 10, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants