e2e fix: Fix smoke_test by removing variants before test by shuynh2017 · Pull Request #870 · llm-d/llm-d-workload-variant-autoscaler

shuynh2017 · 2026-03-10T16:25:08Z

The issue:

This test keeps on failing at " Basic VA lifecycle should scale up under load" as it tries to scale from 1 to 2. The issue is that by default, the kind cluster is installed with a variant pointing to the same model id. Two variants pointing to the same model id and prevent scaling - will take a look at that next. However, for "Basic VA lifecycle", to keep it simple, we can just make sure we start with clean environment by deleting existing VAs in the llm-d-sim namespace.
After the fix:

Smoke Tests - Infrastructure Readiness Basic VA lifecycle should scale up under load [smoke, full]
/home/shuynh/workload-variant-autoscaler-shuynh/workload-variant-autoscaler/test/e2e/smoke_test.go:479
  STEP: Waiting for VA to stabilize at minReplicas @ 03/10/26 12:12:04.803
  Waiting for VA to be ready: optimized=2, minReplicas=1
  STEP: Waiting for deployment to stabilize (no pods in transition) @ 03/10/26 12:12:04.805
  Waiting for deployment stability: spec=1, status=1, ready=1
  STEP: Waiting for VA to settle at minReplicas before recording initial state (best-effort) @ 03/10/26 12:12:04.806
  Waiting for VA to settle: optimized=2, minReplicas=1
  Waiting for VA to settle: optimized=2, minReplicas=1
  Waiting for VA to settle: optimized=2, minReplicas=1
  Waiting for VA to settle: optimized=2, minReplicas=1
  Waiting for VA to settle: optimized=2, minReplicas=1
  Waiting for VA to settle: optimized=2, minReplicas=1
  Waiting for VA to settle: optimized=2, minReplicas=1
  Waiting for VA to settle: optimized=2, minReplicas=1
  Waiting for VA to settle: optimized=2, minReplicas=1
  Waiting for VA to settle: optimized=2, minReplicas=1
  Waiting for VA to settle: optimized=2, minReplicas=1
  Waiting for VA to settle: optimized=2, minReplicas=1
  Waiting for VA to settle: optimized=2, minReplicas=1
  Waiting for VA to settle: optimized=2, minReplicas=1
  Waiting for VA to settle: optimized=2, minReplicas=1
  Waiting for VA to settle: optimized=2, minReplicas=1
  Waiting for VA to settle: optimized=2, minReplicas=1
  Waiting for VA to settle: optimized=2, minReplicas=1
  Waiting for VA to settle: optimized=2, minReplicas=1
  Waiting for VA to settle: optimized=2, minReplicas=1
  Initial optimized replicas (after stabilization): 1 (settled=true)
  STEP: Starting burst load generation to trigger scale-up @ 03/10/26 12:15:30.461
  STEP: Verifying load job was created @ 03/10/26 12:15:30.471
  Load job status: Active=0, Succeeded=0, Failed=0
  STEP: Waiting for load job pod to start @ 03/10/26 12:15:30.473
  Job exists but no pods yet. Job status: Active=0, Succeeded=0, Failed=0
  Load generation job is running
  STEP: Waiting for load generation to ramp up (30 seconds) @ 03/10/26 12:15:35.482
  STEP: Waiting for VA to detect saturation and recommend scale-up @ 03/10/26 12:16:06.506
  [⏳ Progress 1] 35s elapsed | 6m25s remaining
    VA: 1 replicas (initial: 1) | Metrics: True/MetricsFound | LastRun: 12:15:59
    HPA: Desired=2 | Current=2 | Deployment: Spec=2 | Ready=2
    Load: Phase=Running | Config: burst pattern, 3000 prompts | Active=1 | Succeeded=0 | Failed=0 (~11% of expected 5m10s)
  [⏳ Progress 2] 45s elapsed | 6m15s remaining
    VA: 1 replicas (initial: 1) | Metrics: True/MetricsFound | LastRun: 12:15:59
    HPA: Desired=2 | Current=2 | Deployment: Spec=2 | Ready=2
    Load: Phase=Running | Config: burst pattern, 3000 prompts | Active=1 | Succeeded=0 | Failed=0 (~14% of expected 5m10s)
  [⏳ Progress 3] 55s elapsed | 6m5s remaining
    VA: 1 replicas (initial: 1) | Metrics: True/MetricsFound | LastRun: 12:15:59
    HPA: Desired=1 | Current=2 | Deployment: Spec=1 | Ready=1
    Load: Phase=Running | Config: burst pattern, 3000 prompts | Active=1 | Succeeded=0 | Failed=0 (~17% of expected 5m10s)
    └─ Accelerator: H100 | Metrics: Saturation metrics data is available for scaling decisions | HPA: True/SucceededRescale
  [✓ Progress 4] 1m5s elapsed | 5m55s remaining
    VA: 2 replicas (initial: 1) | Metrics: True/MetricsFound | LastRun: 12:16:30
    HPA: Desired=1 | Current=2 | Deployment: Spec=1 | Ready=1
    Load: Phase=Running | Config: burst pattern, 3000 prompts | Active=1 | Succeeded=0 | Failed=0 (~20% of expected 5m10s)
    └─ Accelerator: H100 | Metrics: Saturation metrics data is available for scaling decisions | HPA: True/SucceededRescale
  ✓ VA detected saturation and recommended 2 replicas (took 1m5.083532136s)
    → VA scale-up detected! Now verifying HPA and deployment scaling...
  STEP: Verifying HPA reads the metric and updates desired replicas @ 03/10/26 12:16:37.701
    HPA check: Desired=1 | Current=2 (elapsed: 0s)
    HPA check: Desired=2 | Current=1 (elapsed: 5s)
  ✓ HPA updated desired replicas to > 1 (took 5.004875205s)
  STEP: Waiting for deployment to scale up and new pods to be ready @ 03/10/26 12:16:42.706
    Deployment check: Spec=2 | Replicas=2 | Ready=1 | VA recommended=2 (elapsed: 0s)
    Deployment check: Spec=2 | Replicas=2 | Ready=2 | VA recommended=2 (elapsed: 10s)
  ✓ Deployment successfully scaled up under load (took 10.015793538s)
    Final state: VA recommended 2 replicas, deployment has 2 ready pods
  STEP: Verifying at least one additional pod becomes ready @ 03/10/26 12:16:52.722
  Deployment successfully scaled up under load
  STEP: Cleaning up test resources @ 03/10/26 12:16:52.724
  Successfully deleted HPA smoke-test-hpa-hpa
  Successfully deleted VA smoke-test-va
  Successfully deleted Job smoke-scaleup-load-load
  Successfully deleted ServiceMonitor smoke-test-ms-monitor
  Successfully deleted Service smoke-test-ms-service
  Successfully deleted Deployment smoke-test-ms-decode
• [280.313 seconds]

lionelvillard

LGTM.

As discussed, the proper fix is to not install the VA in the first place (in install.sh script).

shuynh2017 · 2026-03-10T18:45:18Z

LGTM.

As discussed, the proper fix is to not install the VA in the first place (in install.sh script).

Agree, captured here: #872

lionelvillard · 2026-03-10T19:08:43Z

/ok-to-test

lionelvillard · 2026-03-10T19:08:52Z

/trigger-e2e-full

github-actions · 2026-03-10T19:08:53Z

🚀 OpenShift E2E — approve and run (/ok-to-test)

View the OpenShift E2E workflow run

github-actions · 2026-03-10T19:09:00Z

🚀 Kind E2E (full) triggered by /trigger-e2e-full

View the Kind E2E workflow run

github-actions · 2026-03-10T19:11:45Z

GPU Pre-flight Check ✅

GPUs are available for e2e-openshift tests. Proceeding with deployment.

Resource	Total	Allocated	Available
GPUs	50	13	37

Cluster	Value
Nodes	16 (7 with GPUs)
Total CPU	993 cores
Total Memory	10383 Gi
GPUs required	4 (min) / 6 (recommended)

Fix smoke_test

b69de85

shuynh2017 mentioned this pull request Mar 10, 2026

feat: add support for podman + bump version for llm-d and all images #857

Open

shuynh2017 changed the title ~~Bug fix: Fix smoke_test by removing variants before test~~ e2e fix: Fix smoke_test by removing variants before test Mar 10, 2026

lionelvillard approved these changes Mar 10, 2026

View reviewed changes

lionelvillard enabled auto-merge (squash) March 10, 2026 19:08

lionelvillard merged commit b1b8707 into llm-d:main Mar 10, 2026
17 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

e2e fix: Fix smoke_test by removing variants before test#870

e2e fix: Fix smoke_test by removing variants before test#870
lionelvillard merged 1 commit intollm-d:mainfrom
shuynh2017:shuynh_smoke_e2e_fix

shuynh2017 commented Mar 10, 2026

Uh oh!

lionelvillard left a comment

Uh oh!

shuynh2017 commented Mar 10, 2026

Uh oh!

lionelvillard commented Mar 10, 2026

Uh oh!

lionelvillard commented Mar 10, 2026

Uh oh!

github-actions bot commented Mar 10, 2026

Uh oh!

github-actions bot commented Mar 10, 2026

Uh oh!

github-actions bot commented Mar 10, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

shuynh2017 commented Mar 10, 2026

Uh oh!

lionelvillard left a comment

Choose a reason for hiding this comment

Uh oh!

shuynh2017 commented Mar 10, 2026

Uh oh!

lionelvillard commented Mar 10, 2026

Uh oh!

lionelvillard commented Mar 10, 2026

Uh oh!

github-actions bot commented Mar 10, 2026

Uh oh!

github-actions bot commented Mar 10, 2026

Uh oh!

github-actions bot commented Mar 10, 2026

GPU Pre-flight Check ✅

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants