Workload Variant Autoscaler (WVA) Integration

llm-d-benchmark integrates with the Workload Variant Autoscaler (WVA) so benchmarking scenarios can exercise model autoscaling end-to-end. This guide covers how WVA is wired in, how to enable it on a scenario, what each knob in the scenario YAML controls, what the smoketest validates, how to tear it down safely on a shared cluster, and how to debug the most common failure modes.

For background on the autoscaler itself, see:

llm-d/llm-d-workload-variant-autoscaler - the controller source
llm-d well-lit-path WVA guide - the upstream install reference our integration mirrors

Platform support: WVA install is currently only verified on OpenShift. On other platforms, every WVA-related step (install, smoketest, teardown) is deliberately skipped - the scenario YAML can still render the WVA blocks, but nothing is applied to the cluster.

Quick start

End-to-end on a fresh machine, against a logged-in OpenShift cluster:

# 1. Clone (replace the branch if you're targeting a specific one)
git clone https://github.com/llm-d/llm-d-benchmark.git
cd llm-d-benchmark

# 2. Install (creates .venv, installs the llmdbenchmark CLI + planner)
./install.sh

# Or one-shot via curl, optionally pinning a branch:
#   LLMDBENCH_BRANCH=<BRANCH_HERE> \
#     curl -sSL https://raw.githubusercontent.com/llm-d/llm-d-benchmark/main/install.sh | bash
# (the curl form clones into ./llm-d-benchmark/ for you)

# 3. Activate the venv created by install.sh
source .venv/bin/activate

# 4. Confirm you're pointed at the right cluster
oc whoami

# 5. Standup the WVA-enabled scenario (substitute your namespace)
llmdbenchmark --spec guides/inference-scheduling-wva standup -p <namespace>

When standup completes, the smoketest will have already verified the full Namespaced WVA pipeline (controller, prometheus-adapter, VA, HPA, end-to-end metric flow). The HPA's TARGETS column should read a numeric value (e.g. q/1) rather than <unknown>:

oc get hpa -n <namespace>

To redeploy after editing the scenario YAML, please teardown then standup:

llmdbenchmark --spec guides/inference-scheduling-wva teardown -p <namespace>
llmdbenchmark --spec guides/inference-scheduling-wva standup  -p <namespace>

The shared cluster-wide infrastructure (prometheus-adapter, ClusterRole, prometheus-ca ConfigMap) survives teardown automatically - see Section 4 for the full preservation policy.

1. Architecture at a glance

When a scenario has wva.enabled: true and the cluster is OpenShift, standup provisions the following resources, in this order:

cluster-wide / shared
    prometheus-adapter      v5.2.0, in openshift-user-workload-monitoring
                            serves wva_desired_replicas via external-metrics API
    prometheus-ca           ConfigMap, same ns - CA cert for thanos-querier auth
    allow-thanos-querier-api-access
                            ClusterRole granting prometheus-adapter access
                            to OCP's monitoring stack

  <wva namespace>           = deploy namespace by default
      workload-variant-autoscaler   Helm chart v0.6.0, namespaced mode
                                    (reconciles only VAs in this namespace)

      per stack (per model)
          VariantAutoscaling/{model_id_label}-decode
              labels:
                  wva.llmd.ai/controller-instance = <wva.namespace>
              spec.scaleTargetRef -> Deployment/{model_id_label}-decode

          HorizontalPodAutoscaler/{model_id_label}-decode
              spec.scaleTargetRef -> Deployment/{model_id_label}-decode
              metric.selector.matchLabels:
                  variant_name        = {model_id_label}-decode
                  exported_namespace  = <wva.namespace>
                  controller_instance = <wva.namespace>

The data flow that turns this into actual pod scaling:

WVA controller reconciles each VariantAutoscaling it owns and queries thanos-querier for that variant's vLLM saturation metrics.
The controller emits wva_desired_replicas on its :8443/metrics endpoint (Prometheus scrapes via the chart's ServiceMonitor).
prometheus-adapter discovers wva_desired_replicas from user-workload-monitoring Prometheus and exposes it via the external.metrics.k8s.io/v1beta1 API.
The HorizontalPodAutoscaler polls that external-metrics API, matches its selector.matchLabels, and scales the decode Deployment between spec.minReplicas and spec.maxReplicas.

Our integration ensures every join along that chain is byte-aligned (controllerInstance value, VA label, HPA selector). Misalignment in any one of them surfaces as TARGETS: <unknown> on the HPA - see the smoketest validations for what catches each case.

2. Three ways to enable WVA on a scenario

Method	When to use
`-u / --wva` CLI flag on any existing scenario	Quick toggle without editing files; uses defaults from `config/templates/values/defaults.yaml`
`--spec guides/inference-scheduling-wva`	Dedicated scenario where every WVA knob is spelled out inline so you can tweak them per-experiment
`--spec guides/multi-model-wva`	Multi-model scenario: two or more pools under one gateway, each with its own VA + HPA, one shared WVA controller

2a. Via the CLI flag

llmdbenchmark --spec guides/inference-scheduling standup -p <namespace> --wva

That sets wva.enabled: true at render time. All other WVA settings come from defaults - fine for a quick test, but you can't tweak per-experiment HPA behavior without editing the defaults file.

2b. Via the dedicated `inference-scheduling-wva` scenario

llmdbenchmark --spec guides/inference-scheduling-wva standup -p <namespace>

Same model and inference setup as inference-scheduling, plus a fully spelled-out wva: block in the scenario YAML. The -u/--wva flag is not required here because wva.enabled: true is already set in the file.

You'd choose this scenario when you want to:

See/tweak every HPA knob in one place
Override the controller image tag, prometheus-adapter version, etc.
Author a DoE experiment that sweeps over wva.hpa.maxReplicas or wva.variantAutoscaling.variantCost

2c. Via the `multi-model-wva` scenario (multiple pools, one WVA controller)

llmdbenchmark --spec guides/multi-model-wva standup -p <namespace>

Deploys N models behind a single gateway, each with its own EPP + InferencePool + VariantAutoscaling + HPA. One WVA controller in the namespace watches every VA (deduplicated by wva.namespace), so the Prometheus/adapter/controller wiring is identical to the single-model case - only the number of autoscaling targets scales.

The scenario uses the top-level shared: block to hold scenario-wide settings (controller image, chart versions, EPP plugin config, shared HTTPRoute); per-stack blocks hold only model-specific knobs (model name, decode resources, VA + HPA min/max). To add a third model, copy one of the stack entries and change name + model. See guides/multi-model-wva.yaml.

Topology:

         Gateway (shared infra-llmdbench-inference-gateway)
           |
   +-------+----------- HTTPRoute multi-model-route -------------+
   | /qwen3-06b/*                                 /llama-31-8b/* |
   v                                                             v
EPP+InferencePool (qwen3-06b)           EPP+InferencePool (llama-31-8b)
   |                                                             |
vLLM decode + VA + HPA                   vLLM decode + VA + HPA
             ^                               ^
             +------- WVA controller (1) ----+

What the scenario layout buys you:

Shared control plane - one infra-llmdbench gateway release, one istio control plane, one WVA controller, one prometheus-adapter, one shared model PVC (sized for the sum of all models; each stack's weights live in its own model.path subdirectory). Rendered once in the scenario's "shared-infra-owner" stack (first non-standalone stack) and skipped on siblings to avoid parallel-helmfile races.
Per-stack scaling intent - each stack's VA caps what the controller is willing to compute (variantAutoscaling.{min,max}Replicas, variantCost) and each stack's HPA caps what actually gets applied to the Deployment. They're independent per pool, so pool A scaling up to its max doesn't push pool B past its own cap.
One routing URL per pool - the shared HTTPRoute uses httpRoute.pathPrefix: /{stack.name} so every pool is reachable at http://<gateway>/{stack-name}/v1/.... Gateway rewrites the prefix away before the request reaches upstream vLLM, so pods continue to see plain /v1/* paths.
flowControl feature gate on every pool - enabled in the shared.inferenceExtension.pluginsCustomConfig block and inherited by every stack. This is non-optional for WVA: the controller reads EPP queue depth to compute scale signals, and flow-control is what exposes queue depth in the metrics.

3. The WVA knobs in the scenario YAML

All settings live under the wva: block in config/scenarios/guides/inference-scheduling-wva.yaml. Each is documented inline in the file too. Below is what each does and the typical reasons you'd touch it.

3.1 Top-level WVA controller settings

wva:
  enabled: true                  # master switch (same as -u/--wva on CLI)
  wellLitPath: inference-scheduling   # surfaced as `llm-d.ai/guide` label on the VA
  namespace: ""                  # empty = use the deploy namespace from -p
  replicaCount: 1                # WVA controller pod replicas

  controller:
    enabled: true                # disable to render VA+HPA without installing the controller

  namespaceScoped: true          # controller watches only its own ns; one per ns

  image:
    repository: ghcr.io/llm-d/llm-d-workload-variant-autoscaler
    tag: v0.6.0                  # NOTE: image tags use a leading "v"

  metrics:
    enabled: true
    port: 8443                   # /metrics port the controller exposes
    secure: true                 # HTTPS + bearer-token auth

  prometheus:
    baseUrl: https://thanos-querier.openshift-monitoring.svc.cluster.local
    port: 9091

Image tag note: the helm chart version is bare semver (chartVersions.wva: 0.6.0), the container image tag uses a leading v (v0.6.0). They're set independently.

3.2 VariantAutoscaling spec - per-model scaling intent

wva:
  variantAutoscaling:
    enabled: true
    minReplicas: 1               # controller floor
    maxReplicas: 10              # controller ceiling
    variantCost: "10.0"          # relative GPU cost weight (H100=10, A100=8, L40S=5)
    slo:
      tpot: 10                   # Time-Per-Output-Token target (ms)
      ttft: 1000                 # Time-To-First-Token target (ms)

variantCost is what the WVA saturation solver uses to decide which model to scale when several share GPU capacity. Lower slo.tpot/slo.ttft = more aggressive scale-up under load.

3.3 HorizontalPodAutoscaler spec - what actually changes the replica count

wva:
  hpa:
    enabled: true
    minReplicas: 1               # never scale below this; must be >= 1
    maxReplicas: 10              # safety ceiling regardless of controller computation
    targetAvgValue: 1            # 1 = "match controller's desiredReplicas exactly"

    behavior:
      scaleUp:
        stabilizationWindowSeconds: 120
        policies:
          - type: Percent
            value: 100           # 100% per period = double replicas
            periodSeconds: 15
      scaleDown:
        stabilizationWindowSeconds: 120
        policies:
          - type: Percent
            value: 100           # 100% per period = halve replicas
            periodSeconds: 15

Keep wva.hpa.{min,max}Replicas aligned with wva.variantAutoscaling.{min,max}Replicas

the VA caps what the controller is willing to compute, the HPA caps what actually gets applied to the Deployment.

For more behavior tuning options: Kubernetes HPA: configurable scaling behavior.

3.4 Chart version pins

chartVersions:
  wva: 0.6.0                     # WVA controller chart (oci://ghcr.io/llm-d/workload-variant-autoscaler)
  prometheusAdapter: 5.2.0       # bumped charts have broken external-metric rule format

4. Cluster-wide vs per-tenant resources & teardown semantics

WVA installs a mix of cluster-wide and per-tenant resources. To keep multi-tenant clusters healthy, our standup and teardown follow this policy:

Resource	Scope	Standup	Plain teardown	`teardown -d/--deep`
`prometheus-adapter`	`openshift-user-workload-monitoring` (shared)	install if absent; reuse if any tenant already installed it	preserved	preserved
`prometheus-ca` ConfigMap	shared monitoring ns	created	preserved	preserved
`allow-thanos-querier-api-access` ClusterRole	cluster-scoped	applied	preserved	preserved
WVA controller helm release	per-namespace	installed	preserved	uninstalled
`VariantAutoscaling` for this stack	namespace-local	applied	removed	removed
`HorizontalPodAutoscaler` for this stack	namespace-local	applied	removed	removed

The principle: --deep only removes resources that live in the target namespace. Cluster-shared infrastructure (prometheus-adapter + its supporting CRBs/CMs) is never removed by us - it's used by every WVA tenant in the cluster, so its lifecycle belongs to the platform admin, not to a per-tenant teardown.

If you need to fully remove the shared adapter, do it explicitly:

helm uninstall -n openshift-user-workload-monitoring prometheus-adapter
oc delete clusterrole allow-thanos-querier-api-access
oc delete configmap -n openshift-user-workload-monitoring prometheus-ca

5. Smoketest checks

When wva.enabled: true, the smoketest runs eight extra checks beyond the standard pod/inference validation. Each one tells you exactly what it's verifying so a failure points at the broken link rather than a vague "WVA is sad" symptom.

Check	What it verifies	Failure means
`wva_platform_gate`	Cluster is OpenShift (or stack is correctly skipped on other platforms)	informational only
`wva_controller_deployment`	Polls `Deployment/workload-variant-autoscaler-controller-manager` until `Available` with all replicas Ready (<=180s). Fails fast if the manager container's `restartCount` grows mid-wait	controller pod is crash-looping; check `oc logs -n <ns> deploy/workload-variant-autoscaler-controller-manager --previous`
`wva_prometheus_adapter`	`Deployment/prometheus-adapter` in the user-workload monitoring ns is `Available`	adapter wasn't installed, or another tenant's broken install is squatting the cluster role
`wva_variantautoscaling`	The per-stack `VariantAutoscaling/{model_id_label}-decode` exists	step_09 didn't apply the rendered VA
`wva_va_controller_instance_label`	The VA carries `wva.llmd.ai/controller-instance=<value>` matching the controller's `CONTROLLER_INSTANCE`	the controller's predicate filters out the VA -> "No active VariantAutoscalings found" loop, no metric ever emitted. The most subtle WVA gate.
`wva_hpa_target`	The HPA's `scaleTargetRef.name` equals `{model_id_label}-decode`	template drift between VA and HPA
`wva_hpa_selector_alignment`	The HPA's `metric.selector.matchLabels` has all three of `variant_name`, `exported_namespace`, `controller_instance` matching what the controller emits	HPA selector misalignment - controller's metric is in Prometheus but the HPA's selector matches zero rows
`wva_hpa_able_to_scale` (best-effort)	HPA's `AbleToScale` condition is `True`	HPA hasn't initialized yet, or scale subresource lookup failed
`wva_hpa_targets_resolved`	Polls the HPA's `.status.currentMetrics[*].external.current` until the value resolves from `<unknown>` to a number (<=180s). Includes the most recent `ScalingActive=False` reason in the failure message if it times out	full pipeline isn't producing a metric value - could be controller not reconciling, Prometheus not scraping, adapter rule missing, or selector mismatch

All polling timeouts are constants at the top of llmdbenchmark/smoketests/validators/wva.py (_WVA_CONTROLLER_TIMEOUT_SECS, _HPA_TARGETS_TIMEOUT_SECS). Bump them if your cluster is unusually slow.

Replica-count check (in `BaseSmoketest.validate_role_pods`)

The standard decode_replicas check is HPA-aware: when WVA is enabled and the role is HPA-managed (currently only decode), the check passes if the actual pod count is within [wva.hpa.minReplicas, wva.hpa.maxReplicas], not just strictly equal to decode.replicas. Without this relaxation, an idle stack at minReplicas: 1 would falsely fail when decode.replicas: 2.

The role allow-list lives at module top in llmdbenchmark/smoketests/base.py as _WVA_HPA_MANAGED_ROLES = frozenset({"decode"}) - extend it if upstream WVA grows native prefill autoscaling.

6. Quick verification commands

After a successful standup with WVA, these one-liners confirm the full chain is up. Run them in order; each one verifies a downstream link in the pipeline.

# 1. Controller pod is Available, no recent restarts
oc get pod -n <ns> -l control-plane=controller-manager -o wide

# 2. The VA is reconciled (METRICSREADY=True, OPTIMIZED=<a number>)
oc get va -n <ns>

# 3. The VA has the controller-instance label
oc get va -n <ns> <model-id>-decode -o jsonpath='{.metadata.labels}'

# 4. The controller actually emits the metric
oc -n openshift-user-workload-monitoring exec -c prometheus prometheus-user-workload-0 -- \
  wget -qO- 'http://localhost:9090/api/v1/query?query=wva_desired_replicas{controller_instance="<ns>"}' \
  | jq '.data.result'

# 5. prometheus-adapter exposes it via the external-metrics API
oc get --raw "/apis/external.metrics.k8s.io/v1beta1/namespaces/<ns>/wva_desired_replicas" | jq

# 6. The HPA's TARGETS shows a numeric value (no longer <unknown>)
oc get hpa -n <ns>

If 1-3 are green but 4 is empty -> controller isn't reconciling (most likely wva.llmd.ai/controller-instance label mismatch - check #3 against the controller's CONTROLLER_INSTANCE env var).

If 4 has a value but 5 is empty -> prometheus-adapter doesn't have the rule (install was wrong namespace, or rule values file wasn't applied).

If 5 has a value but 6 is <unknown> -> HPA selector doesn't match the metric's labels.

7. Common failure modes & fixes

"No active VariantAutoscalings found" loop in controller logs

Controller is running but says it sees no VAs to reconcile, even though one exists in the watched namespace.

Cause: the VA is missing the wva.llmd.ai/controller-instance label (or it doesn't match the controller's CONTROLLER_INSTANCE env). The controller's predicate silently filters it out.

Fix:

oc label va -n <ns> <model-id>-decode \
  wva.llmd.ai/controller-instance=<controller-instance> --overwrite

...or re-run standup so the rendered template (which already includes the label) is applied.

HPA shows `TARGETS: <unknown>` indefinitely

Something in the metric pipeline isn't lining up. Walk the chain in Section 6 to find which link is broken.

`release: already exists` when installing prometheus-adapter

Another tenant installed prometheus-adapter in their own namespace (not openshift-user-workload-monitoring). The chart's cluster-scoped prometheus-adapter-resource-reader ClusterRole is helm-owned by their release, blocking ours.

Fix: ask that tenant to helm uninstall -n <their-ns> prometheus-adapter. Once the ClusterRole is freed, re-run our standup and we'll install it into the correct namespace per the upstream WVA guide.

Controller pod CrashLoopBackOff with `context deadline exceeded` in logs

The controller can't reach the Kubernetes API server reliably (leader-election lease renewal times out). This is a cluster network / CNI issue, not a WVA bug. Verify:

oc get clusteroperator network -o yaml | yq '.status.conditions'
oc get nodes -o custom-columns='NAME:.metadata.name,READY:.status.conditions[?(@.type=="Ready")].status'

If the network operator is Degraded=True or any node is Ready=False, escalate to the cluster admin.

Image pull fails with `manifest unknown`

The image tag doesn't exist. Check what's actually published:

TOKEN=$(curl -sL "https://ghcr.io/token?scope=repository:llm-d/llm-d-workload-variant-autoscaler:pull" \
  | jq -r .token)
curl -sH "Authorization: Bearer $TOKEN" \
  https://ghcr.io/v2/llm-d/llm-d-workload-variant-autoscaler/tags/list | jq

Note that chart versions are bare semver (0.6.0) but image tags use a leading v (v0.6.0). Mixing them up is a common source of pull failures.

8. Multi-tenant cluster considerations

On a shared cluster, multiple groups may run their own WVA controllers and HPAs. Our integration is designed to coexist:

Namespaced mode (wva.namespaceScoped: true) keeps each controller scoped to its own namespace, so different tenants' controllers don't race on each other's VariantAutoscaling resources.
controllerInstance label + matching VA label + HPA selector ensures a tenant's HPA only consumes metrics from their own controller, even if another tenant's controller is incidentally watching the same namespace.
Shared prometheus-adapter is installed once and reused. Our standup detects an existing install (via the prometheus-adapter-resource-reader ClusterRole's helm-owner annotation) and skips re-installing.

Tenants who run a cluster-scoped WVA controller (i.e., namespaceScoped: false) will reconcile other tenants' VAs too. The controllerInstance label gate on the HPA selector prevents their emitted metrics from ever satisfying our HPA, but it's still considered cluster-hygiene rude to run cluster-scoped.

9. Where each piece lives in the repo

Artifact	File
Chart values rendered into the helm install	`config/templates/jinja/19_wva-values.yaml.j2`
WVA namespace label patch	`config/templates/jinja/23_wva-namespace.yaml.j2`
Per-stack `VariantAutoscaling`	`config/templates/jinja/27_wva-variantautoscaling.yaml.j2`
Per-stack `HorizontalPodAutoscaler`	`config/templates/jinja/28_wva-hpa.yaml.j2`
`prometheus-adapter` values	`config/templates/jinja/21_prometheus-adapter-values.yaml.j2`
`allow-thanos-querier-api-access` ClusterRole	`config/templates/jinja/22_prometheus-rbac.yaml.j2`
Cluster-wide WVA defaults (chart version, image, monitoring URL)	`config/templates/values/defaults.yaml` (`wva:` and `chartVersions.wva` blocks)
Standup admin install (controller + adapter)	`llmdbenchmark/standup/steps/step_03_workload_monitoring.py`
Standup per-stack VA/HPA apply	`llmdbenchmark/standup/steps/step_09_deploy_modelservice.py`
Shared install/teardown helpers	`llmdbenchmark/standup/wva.py`
Teardown logic	`llmdbenchmark/teardown/steps/step_01_uninstall_helm.py`
Smoketest WVA mixin	`llmdbenchmark/smoketests/validators/wva.py`
WVA-enabled scenario (the one to copy/edit for new experiments)	`config/scenarios/guides/inference-scheduling-wva.yaml`
Multi-model WVA scenario (N pools, 1 gateway, 1 controller)	`config/scenarios/guides/multi-model-wva.yaml`

10. Multi-model operations cookbook

Recipes for the day-to-day lifecycle + benchmarking against the multi-model-wva scenario. All commands assume you've installed llmdbenchmark and pointed KUBECONFIG at a cluster where you have (or will have) namespace admin in <namespace>. Stack names (qwen3-06b, llama-31-8b) mirror the shipped scenario; substitute your own if you've customized.

10.1 First-time standup

llmdbenchmark --spec guides/multi-model-wva standup -p <namespace>

Renders both stacks, installs shared infra (istio, Gateway, infra-llmdbench, WVA controller, prometheus-adapter, model PVC) once, then deploys each pool's -ms + -gaie + VA + HPA. Downloads run in parallel - wall time ~ slowest model, not the sum. Standup auto-chains into the smoketest phase unless you pass --skip-smoketest.

10.2 Discover what's deployed (`--list-endpoints`)

llmdbenchmark --spec guides/multi-model-wva run -p <namespace> --list-endpoints

Prints a table of per-stack endpoint URLs + a copy-paste block of ready-to-run llmdbenchmark run invocations. Runs the full render pipeline (so the detected endpoints match exactly what standup would have produced) and exits before launching any harness pods.

10.3 Benchmark a single pool

Preferred - let --stack auto-resolve the endpoint:

llmdbenchmark --spec guides/multi-model-wva run -p <namespace> \
  --stack qwen3-06b \
  -l inference-perf -w sanity_random.yaml -j 1

With --stack qwen3-06b, step 03 auto-detects the gateway endpoint, bakes in the /qwen3-06b path prefix, and the harness pod hits http://<gateway>/qwen3-06b/v1/completions. The gateway rewrites /qwen3-06b/* -> /* so vLLM sees plain /v1/completions.

Alternative - pin --endpoint-url yourself (useful for run-only mode without the scenario file locally):

llmdbenchmark run \
  --endpoint-url http://<gateway>/qwen3-06b \
  --model Qwen/Qwen3-0.6B \
  --namespace <namespace> \
  -l guidellm -w sanity_random.yaml -j 2

10.4 Two parallel guidellm jobs against one pool

llmdbenchmark --spec guides/multi-model-wva run -p <namespace> \
  --stack qwen3-06b \
  -l guidellm -w sanity_random.yaml \
  -j 2

-j 2 launches two guidellm pods hitting the same endpoint simultaneously. Both run the same workload, but each writes to its own {experiment_id}_1 / {experiment_id}_2 results subdirectory on the workload PVC, so metrics don't collide. The harness wait step polls both pods; result collection pulls both directories back.

10.5 Compare two pools side-by-side (two shells)

# Shell 1 - --workspace is a global option, placed before the subcommand
llmdbenchmark --spec guides/multi-model-wva --workspace /tmp/run-qwen run -p <namespace> \
  --stack qwen3-06b \
  -l guidellm -w sanity_random.yaml -j 2

# Shell 2 (in parallel)
llmdbenchmark --spec guides/multi-model-wva --workspace /tmp/run-llama run -p <namespace> \
  --stack llama-31-8b \
  -l guidellm -w sanity_random.yaml -j 2

Distinct --workspace dirs keep the two invocations' render plans, logs, and collected results fully isolated. The WVA controller will see load on both pools' VAs simultaneously and scale them independently. --workspace (and --spec, --base-dir, --dry-run, --verbose, --non-admin) are global options - they must appear before the subcommand name (run, standup, etc.), not after.

10.6 Rerun one pool against a different model

--stack NAME scopes -m/--models to that one stack; siblings keep their scenario-defined models untouched:

llmdbenchmark --spec guides/multi-model-wva run -p <namespace> \
  --stack qwen3-06b \
  --model meta-llama/Llama-3.2-3B \
  -l inference-perf -w sanity_random.yaml

Without --stack, -m applies to every stack and emits a warning - it would collapse the multi-model scenario into N copies of one model, which is rarely desired.

10.7 Re-deploy one pool after a scenario edit

Edit the scenario YAML's stack for llama-31-8b (e.g. bump decode.replicas, swap the model, tweak wva.hpa.maxReplicas), then:

llmdbenchmark --spec guides/multi-model-wva standup -p <namespace> \
  --stack llama-31-8b

Global steps (admin prereqs, shared-infra helmfile, WVA controller, model PVC) still run (they're scenario-wide and idempotent). Per-stack steps only fire for llama-31-8b - qwen3-06b's running pods and VA are left completely alone.

10.8 Tear down one pool, keep siblings running

llmdbenchmark --spec guides/multi-model-wva teardown -p <namespace> \
  --stack llama-31-8b

Uninstalls the llama-31-8b-ms and llama-31-8b-gaie Helm releases (plus their VA + HPA), leaves qwen3-06b and the shared infra-llmdbench + WVA controller + prometheus-adapter in place. Useful for cost management - shrink to one pool over a weekend without disturbing the other.

10.9 Observe scaling events

With both pools running, watch the VA + HPA state in real time:

# Per-pool VariantAutoscaling
kubectl get variantautoscaling -n <namespace> -w

# Per-pool HPA with current/target metric
kubectl get hpa -n <namespace> -w

# Controller logs (look for "Reconciling" per VA)
kubectl logs -n <namespace> -l control-plane=controller-manager -f

# Raw wva_desired_replicas metric for a pool
kubectl exec -n <namespace> \
  $(kubectl get pod -n <namespace> -l app.kubernetes.io/name=workload-variant-autoscaler -o name | head -1) \
  -- curl -sk https://localhost:8443/metrics \
  | grep wva_desired_replicas

Every query that takes a label selector can be filtered to one pool: -l wva.llmd.ai/controller-instance=<namespace>,llm-d.ai/model=qwen-qwen3-0-6b for VAs / HPAs; pods don't carry the routing-prefix label but do carry llm-d.ai/model keyed on each stack's model.shortName.

10.10 Full teardown

llmdbenchmark --spec guides/multi-model-wva teardown -p <namespace>

Removes every Helm release in both stacks plus shared infra and the WVA controller. Prometheus-adapter and istio control-plane persist by design (shared across tenants); add --deep to remove all cluster resources in the deploy + harness namespaces.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Workload Variant Autoscaler (WVA) Integration

Quick start

1. Architecture at a glance

2. Three ways to enable WVA on a scenario

2a. Via the CLI flag

2b. Via the dedicated `inference-scheduling-wva` scenario

2c. Via the `multi-model-wva` scenario (multiple pools, one WVA controller)

3. The WVA knobs in the scenario YAML

3.1 Top-level WVA controller settings

3.2 VariantAutoscaling spec - per-model scaling intent

3.3 HorizontalPodAutoscaler spec - what actually changes the replica count

3.4 Chart version pins

4. Cluster-wide vs per-tenant resources & teardown semantics

5. Smoketest checks

Replica-count check (in `BaseSmoketest.validate_role_pods`)

6. Quick verification commands

7. Common failure modes & fixes

"No active VariantAutoscalings found" loop in controller logs

HPA shows `TARGETS: <unknown>` indefinitely

`release: already exists` when installing prometheus-adapter

Controller pod CrashLoopBackOff with `context deadline exceeded` in logs

Image pull fails with `manifest unknown`

8. Multi-tenant cluster considerations

9. Where each piece lives in the repo

10. Multi-model operations cookbook

10.1 First-time standup

10.2 Discover what's deployed (`--list-endpoints`)

10.3 Benchmark a single pool

10.4 Two parallel guidellm jobs against one pool

10.5 Compare two pools side-by-side (two shells)

10.6 Rerun one pool against a different model

10.7 Re-deploy one pool after a scenario edit

10.8 Tear down one pool, keep siblings running

10.9 Observe scaling events

10.10 Full teardown

FilesExpand file tree

workload-variant-autoscaler.md

Latest commit

History

workload-variant-autoscaler.md

File metadata and controls

Workload Variant Autoscaler (WVA) Integration

Quick start

1. Architecture at a glance

2. Three ways to enable WVA on a scenario

2a. Via the CLI flag

2b. Via the dedicated inference-scheduling-wva scenario

2c. Via the multi-model-wva scenario (multiple pools, one WVA controller)

3. The WVA knobs in the scenario YAML

3.1 Top-level WVA controller settings

3.2 VariantAutoscaling spec - per-model scaling intent

3.3 HorizontalPodAutoscaler spec - what actually changes the replica count

3.4 Chart version pins

4. Cluster-wide vs per-tenant resources & teardown semantics

5. Smoketest checks

Replica-count check (in BaseSmoketest.validate_role_pods)

6. Quick verification commands

7. Common failure modes & fixes

"No active VariantAutoscalings found" loop in controller logs

HPA shows TARGETS: <unknown> indefinitely

release: already exists when installing prometheus-adapter

Controller pod CrashLoopBackOff with context deadline exceeded in logs

Image pull fails with manifest unknown

8. Multi-tenant cluster considerations

9. Where each piece lives in the repo

10. Multi-model operations cookbook

10.1 First-time standup

10.2 Discover what's deployed (--list-endpoints)

10.3 Benchmark a single pool

10.4 Two parallel guidellm jobs against one pool

10.5 Compare two pools side-by-side (two shells)

10.6 Rerun one pool against a different model

10.7 Re-deploy one pool after a scenario edit

10.8 Tear down one pool, keep siblings running

10.9 Observe scaling events

10.10 Full teardown

2b. Via the dedicated `inference-scheduling-wva` scenario

2c. Via the `multi-model-wva` scenario (multiple pools, one WVA controller)

Replica-count check (in `BaseSmoketest.validate_role_pods`)

HPA shows `TARGETS: <unknown>` indefinitely

`release: already exists` when installing prometheus-adapter

Controller pod CrashLoopBackOff with `context deadline exceeded` in logs

Image pull fails with `manifest unknown`

10.2 Discover what's deployed (`--list-endpoints`)