Skip to content

docs: Add User Documentation for GPU Limiter and Scale-to-Zero Features#638

Open
ev-shindin wants to merge 7 commits intollm-d:mainfrom
ev-shindin:docs/user-guide-features
Open

docs: Add User Documentation for GPU Limiter and Scale-to-Zero Features#638
ev-shindin wants to merge 7 commits intollm-d:mainfrom
ev-shindin:docs/user-guide-features

Conversation

@ev-shindin
Copy link
Copy Markdown
Collaborator

Summary

Adds comprehensive user-facing documentation for the GPU Limiter and Scale-to-Zero features, along with cross-references throughout existing documentation to ensure discoverability.

Changes

New Documentation Files

File Description
docs/user-guide/gpu-limiter.md Complete user guide for GPU Limiter (Experimental) - 307 lines
docs/user-guide/scale-to-zero.md Complete user guide for Scale-to-Zero - 405 lines
docs/user-guide/scale-from-zero.md Placeholder for Scale-from-Zero documentation - 30 lines
config/samples/saturation-scaling-config.yaml Sample ConfigMap with enableLimiter: true

Updated Documentation

File Changes
README.md Added GPU Limiter and Scale-to-Zero to Key Features section and User Guide links
docs/README.md Added links to new user guides in documentation index
charts/workload-variant-autoscaler/README.md Added configuration sections for GPU Limiter and Scale-to-Zero with Helm examples
charts/workload-variant-autoscaler/values.yaml Improved comments for limitedMode and scaleToZero with doc references

Documentation Content

GPU Limiter (Experimental)

The GPU Limiter documentation covers:

  • Problem statement: unfulfillable scale-up requests, unfair resource distribution
  • Solution: resource-aware scaling that constrains decisions based on GPU availability
  • Scaling pipeline diagram showing where limiter fits
  • Greedy-by-saturation allocation algorithm explanation
  • GPU type awareness (H100, A100, MI300X tracked separately)
  • Configuration via saturation-scaling-config ConfigMap with enableLimiter: true
  • Prerequisites: GPU operator, deployment GPU requirements
  • Example scenarios: single model, multiple models competing, heterogeneous GPU types
  • Troubleshooting and best practices

Scale-to-Zero

The Scale-to-Zero documentation covers:

  • Problem statement: wasted GPU resources, higher costs for idle models
  • Solution: automatic scaling to zero after configurable retention period
  • Decision logic and retention period explanation
  • Configuration via model-scale-to-zero-config ConfigMap
  • Namespace-scoped overrides for multi-environment deployments
  • Prerequisites: HPA feature gate, Prometheus metrics
  • Example scenarios: development, production with mixed criticality, cost optimization
  • Troubleshooting and best practices
  • Reference to Scale-from-Zero for scaling back up

Scale-from-Zero (Placeholder)

Basic placeholder documentation noting:

  • Feature monitors EPP queue for pending requests
  • Automatically triggers scale-up when traffic arrives
  • Works in conjunction with Scale-to-Zero
  • Full documentation coming soon

Helm Chart Updates

Added configuration sections to chart README:

# GPU Limiter
wva:
  capacityScaling:
    default:
      enableLimiter: true

# Scale to Zero
wva:
  scaleToZero: true

Improved values.yaml comments:

# GPU Limiter mode - constrains scaling based on actual GPU availability (Experimental)
# When enabled, scale-up decisions are limited by available GPUs per accelerator type
# Uses greedy-by-saturation algorithm to prioritize most saturated models
# See docs/user-guide/gpu-limiter.md for details
limitedMode: false

# Scale to Zero - automatically scale idle models to zero replicas
# When enabled, models with no traffic for the retention period will scale to 0
# Requires: Kubernetes 1.27+ (HPAScaleToZero feature gate) and HPA minReplicas: 0
# Configure retention periods via model-scale-to-zero-config ConfigMap
# See docs/user-guide/scale-to-zero.md for details
scaleToZero: false

Related Issues

  • Addresses documentation gap for GPU Limiter feature
  • Addresses documentation gap for Scale-to-Zero feature
  • Improves feature discoverability in main README

Checklist

  • User guide for GPU Limiter created
  • User guide for Scale-to-Zero created
  • Placeholder for Scale-from-Zero created
  • Main README updated with feature highlights
  • docs/README.md index updated
  • Helm chart README updated with configuration examples
  • Helm values.yaml comments improved
  • Sample ConfigMap added to config/samples/

…ero features

Add comprehensive user documentation for three scaling features:

- GPU Limiter (Experimental): Resource-aware scaling that constrains
  autoscaling decisions based on actual GPU availability. Documents
  the greedy-by-saturation allocation algorithm and per-accelerator
  type tracking.

- Scale to Zero: Automatic scaling of idle model deployments to zero
  replicas after a configurable retention period. Includes namespace-
  scoped configuration for multi-environment deployments.

- Scale from Zero (placeholder): Documents the complementary feature
  that automatically scales models back up when requests arrive.

All documents follow consistent structure with overview, configuration,
prerequisites, example scenarios, troubleshooting, and best practices.
Update documentation to reference the new GPU Limiter and Scale-to-Zero
features consistently across all user-facing docs:

- README.md: Add features to Key Features section and User Guide links
- docs/README.md: Add links to GPU Limiter, Scale-to-Zero, Scale-from-Zero
- Helm Chart README: Add configuration sections for both features with
  examples showing how to enable via Helm values and ConfigMaps
- Helm values.yaml: Improve comments for limitedMode and scaleToZero
  options with references to documentation
- config/samples: Add saturation-scaling-config.yaml sample with
  enableLimiter example
Comment thread docs/user-guide/scale-from-zero.md Outdated
Comment thread docs/README.md Outdated
Copy link
Copy Markdown
Collaborator

@lionelvillard lionelvillard left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Added few comments.

@ev-shindin ev-shindin self-assigned this Jan 28, 2026
Result: Each model limited by its GPU type availability
```

## Troubleshooting
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We now have a troubleshooting guide. Can you move this section over there?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I moved this to another to saturation-scaling-config.md. Or it should be in another file?

Comment thread docs/user-guide/scale-to-zero.md Outdated
Comment thread docs/user-guide/scale-to-zero.md Outdated
- scale-to-zero: clarify retention_period is unrelated to HPA stabilizationWindowSeconds
- scale-to-zero: remove cold start limitation (applies to scale-from-zero, not scale-to-zero)
- scale-to-zero: remove broken links to removed scale-from-zero.md
- gpu-limiter: move troubleshooting section to saturation-scaling-config.md
- saturation-scaling-config: add GPU Limiter Issues troubleshooting subsection

### Scaling Pipeline

```
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would make this pipeline more generic. Basically there is a pipeline block before GPU limiter and after it. The block before computes a desired number of replicas and the saturation analyzer is just an example.

lionelvillard
lionelvillard previously approved these changes Mar 10, 2026
Signed-off-by: Lionel Villard <villard@us.ibm.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants