Skip to content

e2e fix: Fix smoke_test by removing variants before test#870

Merged
lionelvillard merged 1 commit intollm-d:mainfrom
shuynh2017:shuynh_smoke_e2e_fix
Mar 10, 2026
Merged

e2e fix: Fix smoke_test by removing variants before test#870
lionelvillard merged 1 commit intollm-d:mainfrom
shuynh2017:shuynh_smoke_e2e_fix

Conversation

@shuynh2017
Copy link
Copy Markdown
Collaborator

The issue:

  • This test keeps on failing at " Basic VA lifecycle should scale up under load" as it tries to scale from 1 to 2. The issue is that by default, the kind cluster is installed with a variant pointing to the same model id. Two variants pointing to the same model id and prevent scaling - will take a look at that next. However, for "Basic VA lifecycle", to keep it simple, we can just make sure we start with clean environment by deleting existing VAs in the llm-d-sim namespace.
  • After the fix:
Smoke Tests - Infrastructure Readiness Basic VA lifecycle should scale up under load [smoke, full]
/home/shuynh/workload-variant-autoscaler-shuynh/workload-variant-autoscaler/test/e2e/smoke_test.go:479
  STEP: Waiting for VA to stabilize at minReplicas @ 03/10/26 12:12:04.803
  Waiting for VA to be ready: optimized=2, minReplicas=1
  STEP: Waiting for deployment to stabilize (no pods in transition) @ 03/10/26 12:12:04.805
  Waiting for deployment stability: spec=1, status=1, ready=1
  STEP: Waiting for VA to settle at minReplicas before recording initial state (best-effort) @ 03/10/26 12:12:04.806
  Waiting for VA to settle: optimized=2, minReplicas=1
  Waiting for VA to settle: optimized=2, minReplicas=1
  Waiting for VA to settle: optimized=2, minReplicas=1
  Waiting for VA to settle: optimized=2, minReplicas=1
  Waiting for VA to settle: optimized=2, minReplicas=1
  Waiting for VA to settle: optimized=2, minReplicas=1
  Waiting for VA to settle: optimized=2, minReplicas=1
  Waiting for VA to settle: optimized=2, minReplicas=1
  Waiting for VA to settle: optimized=2, minReplicas=1
  Waiting for VA to settle: optimized=2, minReplicas=1
  Waiting for VA to settle: optimized=2, minReplicas=1
  Waiting for VA to settle: optimized=2, minReplicas=1
  Waiting for VA to settle: optimized=2, minReplicas=1
  Waiting for VA to settle: optimized=2, minReplicas=1
  Waiting for VA to settle: optimized=2, minReplicas=1
  Waiting for VA to settle: optimized=2, minReplicas=1
  Waiting for VA to settle: optimized=2, minReplicas=1
  Waiting for VA to settle: optimized=2, minReplicas=1
  Waiting for VA to settle: optimized=2, minReplicas=1
  Waiting for VA to settle: optimized=2, minReplicas=1
  Initial optimized replicas (after stabilization): 1 (settled=true)
  STEP: Starting burst load generation to trigger scale-up @ 03/10/26 12:15:30.461
  STEP: Verifying load job was created @ 03/10/26 12:15:30.471
  Load job status: Active=0, Succeeded=0, Failed=0
  STEP: Waiting for load job pod to start @ 03/10/26 12:15:30.473
  Job exists but no pods yet. Job status: Active=0, Succeeded=0, Failed=0
  Load generation job is running
  STEP: Waiting for load generation to ramp up (30 seconds) @ 03/10/26 12:15:35.482
  STEP: Waiting for VA to detect saturation and recommend scale-up @ 03/10/26 12:16:06.506
  [⏳ Progress 1] 35s elapsed | 6m25s remaining
    VA: 1 replicas (initial: 1) | Metrics: True/MetricsFound | LastRun: 12:15:59
    HPA: Desired=2 | Current=2 | Deployment: Spec=2 | Ready=2
    Load: Phase=Running | Config: burst pattern, 3000 prompts | Active=1 | Succeeded=0 | Failed=0 (~11% of expected 5m10s)
  [⏳ Progress 2] 45s elapsed | 6m15s remaining
    VA: 1 replicas (initial: 1) | Metrics: True/MetricsFound | LastRun: 12:15:59
    HPA: Desired=2 | Current=2 | Deployment: Spec=2 | Ready=2
    Load: Phase=Running | Config: burst pattern, 3000 prompts | Active=1 | Succeeded=0 | Failed=0 (~14% of expected 5m10s)
  [⏳ Progress 3] 55s elapsed | 6m5s remaining
    VA: 1 replicas (initial: 1) | Metrics: True/MetricsFound | LastRun: 12:15:59
    HPA: Desired=1 | Current=2 | Deployment: Spec=1 | Ready=1
    Load: Phase=Running | Config: burst pattern, 3000 prompts | Active=1 | Succeeded=0 | Failed=0 (~17% of expected 5m10s)
    └─ Accelerator: H100 | Metrics: Saturation metrics data is available for scaling decisions | HPA: True/SucceededRescale
  [✓ Progress 4] 1m5s elapsed | 5m55s remaining
    VA: 2 replicas (initial: 1) | Metrics: True/MetricsFound | LastRun: 12:16:30
    HPA: Desired=1 | Current=2 | Deployment: Spec=1 | Ready=1
    Load: Phase=Running | Config: burst pattern, 3000 prompts | Active=1 | Succeeded=0 | Failed=0 (~20% of expected 5m10s)
    └─ Accelerator: H100 | Metrics: Saturation metrics data is available for scaling decisions | HPA: True/SucceededRescale
  ✓ VA detected saturation and recommended 2 replicas (took 1m5.083532136s)
    → VA scale-up detected! Now verifying HPA and deployment scaling...
  STEP: Verifying HPA reads the metric and updates desired replicas @ 03/10/26 12:16:37.701
    HPA check: Desired=1 | Current=2 (elapsed: 0s)
    HPA check: Desired=2 | Current=1 (elapsed: 5s)
  ✓ HPA updated desired replicas to > 1 (took 5.004875205s)
  STEP: Waiting for deployment to scale up and new pods to be ready @ 03/10/26 12:16:42.706
    Deployment check: Spec=2 | Replicas=2 | Ready=1 | VA recommended=2 (elapsed: 0s)
    Deployment check: Spec=2 | Replicas=2 | Ready=2 | VA recommended=2 (elapsed: 10s)
  ✓ Deployment successfully scaled up under load (took 10.015793538s)
    Final state: VA recommended 2 replicas, deployment has 2 ready pods
  STEP: Verifying at least one additional pod becomes ready @ 03/10/26 12:16:52.722
  Deployment successfully scaled up under load
  STEP: Cleaning up test resources @ 03/10/26 12:16:52.724
  Successfully deleted HPA smoke-test-hpa-hpa
  Successfully deleted VA smoke-test-va
  Successfully deleted Job smoke-scaleup-load-load
  Successfully deleted ServiceMonitor smoke-test-ms-monitor
  Successfully deleted Service smoke-test-ms-service
  Successfully deleted Deployment smoke-test-ms-decode
• [280.313 seconds]

@shuynh2017 shuynh2017 changed the title Bug fix: Fix smoke_test by removing variants before test e2e fix: Fix smoke_test by removing variants before test Mar 10, 2026
Copy link
Copy Markdown
Collaborator

@lionelvillard lionelvillard left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

As discussed, the proper fix is to not install the VA in the first place (in install.sh script).

@shuynh2017
Copy link
Copy Markdown
Collaborator Author

LGTM.

As discussed, the proper fix is to not install the VA in the first place (in install.sh script).

Agree, captured here: #872

@lionelvillard lionelvillard enabled auto-merge (squash) March 10, 2026 19:08
@lionelvillard
Copy link
Copy Markdown
Collaborator

/ok-to-test

@lionelvillard
Copy link
Copy Markdown
Collaborator

/trigger-e2e-full

@github-actions
Copy link
Copy Markdown
Contributor

🚀 OpenShift E2E — approve and run (/ok-to-test)

View the OpenShift E2E workflow run

@github-actions
Copy link
Copy Markdown
Contributor

🚀 Kind E2E (full) triggered by /trigger-e2e-full

View the Kind E2E workflow run

@github-actions
Copy link
Copy Markdown
Contributor

GPU Pre-flight Check ✅

GPUs are available for e2e-openshift tests. Proceeding with deployment.

Resource Total Allocated Available
GPUs 50 13 37
Cluster Value
Nodes 16 (7 with GPUs)
Total CPU 993 cores
Total Memory 10383 Gi
GPUs required 4 (min) / 6 (recommended)

@lionelvillard lionelvillard merged commit b1b8707 into llm-d:main Mar 10, 2026
17 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants