🐛 Enable scale-from-zero E2E on CKS and OCP with KEDA support by clubanderson · Pull Request #865 · llm-d/llm-d-workload-variant-autoscaler

clubanderson · 2026-03-10T01:16:06Z

Summary

Remove environment skip in scale_from_zero_test.go — test now runs on all platforms
Add retry logic to detect_inference_pool_api_group() (6 retries, 10s apart) to handle the race where InferencePool instances haven't been created yet after helmfile deploy
Make deploy_keda() skip helm install when KEDA CRD already exists (pre-installed on OCP via CMA operator, on CKS via helm)
Remove environment guard on SCALER_BACKEND=keda — now supported on all environments
Increase deploy wait timeout from 60s to 600s — the kubectl wait --timeout=60s for all deployments was too short for model-serving pods (vLLM) that need to download and load large models into GPU memory (e.g. Meta-Llama-3.1-8B). Both OCP and CKS nightly E2E were failing at "Deploy guide via WVA install.sh" due to this. Now defaults to 600s, overridable via DEPLOY_WAIT_TIMEOUT env var.

Context

PR #849 enabled scale-from-zero tests but they failed on CKS (nightly runs #51, #52) due to:

Pool group detection race — InferencePools not yet created when detect_inference_pool_api_group() runs
SCALER_BACKEND=keda was blocked for non-emulator environments
HPAScaleToZero feature gate is disabled on both CKS and OCP — KEDA ScaledObject is the workaround

Additionally, both OCP and CKS nightly E2E runs (Mar 10-11) were failing because:
4. kubectl wait --for=condition=Available deployment --all --timeout=60s expired before vLLM finished loading the model
5. Script then hit the KEDA environment guard and exited with code 1

KEDA is now pre-installed on both clusters:

OCP (pok-prod-001): Custom Metrics Autoscaler operator v2.18.1-2 in openshift-keda
CKS (waldorf): Upstream KEDA helm chart v2.18.1 in keda namespace

Companion PR: llm-d/llm-d-infra#87

Test plan

CKS nightly E2E passes with scale-from-zero test enabled
OCP nightly E2E passes with scale-from-zero test enabled
kind-emulator CI continues to pass (KEDA installed at runtime as before)

Copilot

Pull request overview

This PR disables the scale-from-zero E2E test outside the kind-emulator environment to prevent recurring CKS nightly failures, aligning the test’s execution scope with the only environment where its prerequisites are currently reliably controlled.

Changes:

Restrict Scale-From-Zero Feature E2E coverage to ENVIRONMENT=kind-emulator.
Update skip rationale/commentary to reflect current prerequisite constraints and known CKS/OpenShift limitations.

Copilot

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.

Copilot · 2026-03-10T01:49:46Z


    else
        log_info "Skipping llm-d deployment (DEPLOY_LLM_D=false)"
    fi



The code comment says deploy_keda will "detect and skip" when KEDA is pre-installed, but deploy_keda currently always runs helm upgrade/install. With the ENVIRONMENT guard removed, this will attempt to install/upgrade KEDA on OpenShift/CKS where it may already be operator-managed or where the caller lacks permissions, potentially breaking the cluster or failing the deployment. Add an explicit pre-check (e.g., presence of ScaledObject CRD and/or keda-operator deployment in the target namespace) to no-op when KEDA already exists, or reintroduce a safe gating mechanism.

Copilot · 2026-03-10T01:49:47Z

+		// Scale-from-zero requires GIE flow control, InferenceObjective, and KEDA
+		// (ScaledObject with minReplicas=0). KEDA must be pre-installed on the cluster.
+		// Only kind-emulator installs KEDA at runtime via install.sh.


This test no longer skips OpenShift, but it will still try to create an HPA with minReplicas=0 when SCALER_BACKEND is left at its default (prometheus-adapter). On clusters without the HPAScaleToZero feature gate (notably OpenShift per config.go), that HPA creation is expected to fail. Consider adding an explicit precondition in BeforeAll to require SCALER_BACKEND=keda when ScaleToZeroEnabled is false (or skip with a clear message) so the test runs reliably across environments.

Copilot · 2026-03-10T01:49:47Z

        log_info "Skipping llm-d deployment (DEPLOY_LLM_D=false)"
    fi


Removing the ENVIRONMENT=kind-emulator guard means deploy_keda can now run on any cluster where SCALER_BACKEND=keda, but deploy_keda fetches and installs the kedacore/keda Helm chart from https://kedacore.github.io/charts without pinning to an immutable version or verifying integrity. This makes cluster bootstrap depend on a mutable third-party artifact, so a compromise or hijack of that chart repository or its DNS could give an attacker code execution in the cluster with KEDA’s privileges. To reduce this supply-chain risk, restrict this remote Helm install to non-production/test environments and/or pin the chart to an immutable version/digest (plus checksum or signature verification) before broadening its use.

Copilot

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.

Copilot · 2026-03-10T03:07:36Z

+    # Search in the target namespace first (avoids cluster-wide RBAC issues), then fall back to -A.
+    local ns_flag="-A"
+    if [ -n "${LLMD_NS:-}" ]; then
+        ns_flag="-n $LLMD_NS"
+    fi


The comment says the detection searches the target namespace first and then falls back to -A, but the implementation only ever uses a single ns_flag (LLMD_NS is always set by default), so there is no fallback. Either implement an actual fallback (try -n "$LLMD_NS" then retry with -A) or update the comment to match the behavior to avoid misleading future debugging.

Copilot · 2026-03-10T03:07:36Z

+    # Skip install if KEDA ScaledObject CRD already exists (pre-installed on cluster)
+    if kubectl get crd scaledobjects.keda.sh >/dev/null 2>&1; then
+        log_success "KEDA is already installed on this cluster — skipping helm install"
+        return
+    fi


deploy_keda() now skips installation solely based on the ScaledObject CRD existing. However undeploy_keda() does not remove CRDs, so a prior uninstall can leave the CRD behind while the KEDA operator is gone; subsequent runs will incorrectly skip install and leave the cluster without KEDA. Consider tightening the skip condition to also verify that the KEDA operator is actually running (e.g., deployment/pods present, possibly across namespaces), or only skip when both CRD + controller are detected.

Copilot · 2026-03-10T03:07:37Z

+	// KEDA is supported on all environments — pre-installed on OCP (CMA operator)
+	// and CKS (helm), installed at runtime on kind-emulator via install.sh.


This comment says KEDA is pre-installed on OCP/CKS, but the e2e config defaults KEDA_NAMESPACE to "keda-system" while the PR description calls out OCP using "openshift-keda" and CKS using "keda". To avoid confusion (and failing the KEDA operator readiness check when SCALER_BACKEND=keda), it would help to mention that KEDA_NAMESPACE must be set appropriately per cluster (or adjust the defaults elsewhere).

- Remove environment skip in scale_from_zero_test.go — test now runs on all platforms (KEDA must be pre-installed on the cluster) - Add retry logic to detect_inference_pool_api_group() to handle the race where InferencePool instances haven't been created yet after helmfile deploy - Make deploy_keda() skip helm install when KEDA CRD already exists (pre-installed on OCP via CMA operator, on CKS via helm) - Remove environment guard on SCALER_BACKEND=keda — supported everywhere Signed-off-by: Andy Anderson <andy@clubanderson.com> Signed-off-by: Andrew Anderson <andy@clubanderson.com>

lionelvillard · 2026-03-10T13:07:53Z

-		if cfg.Environment == "openshift" {
-			Skip("Scale-from-zero test is disabled on OpenShift")
-		}
+		// Scale-from-zero requires GIE flow control, InferenceObjective, and KEDA


Scale-from-zero does not require KEDA. Please update the misleading comment

The kubectl wait --timeout=60s for all deployments in the llm-d namespace was too short for model-serving pods (vLLM) that need to download and load large models (e.g. Meta-Llama-3.1-8B) into GPU memory. This caused both OCP and CKS nightly E2E to fail at the "Deploy guide via WVA install.sh" step. Default is now 600s (10 min), overridable via DEPLOY_WAIT_TIMEOUT env var. The vLLM startupProbe already allows up to 30 minutes. Signed-off-by: Andrew Anderson <andy@clubanderson.com>

clubanderson · 2026-03-11T15:37:53Z

@lionelvillard — This PR is blocking both OCP and CKS nightly E2E runs. Both have been failing every night (Mar 9-11) at the "Deploy guide via WVA install.sh" step due to two issues fixed here:

kubectl wait --timeout=60s is too short — vLLM needs several minutes to download and load Meta-Llama-3.1-8B into GPU memory. The 60s timeout expires while the model is still loading (pod shows 1/2 Running). This PR bumps the default to 600s (overridable via DEPLOY_WAIT_TIMEOUT).
KEDA environment guard rejects non-emulator environments — install.sh errors with KEDA scaler backend is only supported for kind-emulator environment even though KEDA (Custom Metrics Autoscaler) is now installed on both pok-prod (openshift-keda) and waldorf (keda namespace). This PR removes that guard.

Failure logs:

OCP: error: timed out waiting for the condition on deployments/ms-workload-autoscaling-llm-d-modelservice-decode → KEDA error → exit 1
CKS: identical failure pattern

Would appreciate a review when you get a chance — every nightly run is failing until this lands.

Copilot

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 4 comments.

Comments suppressed due to low confidence (1)

deploy/install.sh:1117

The deploy_keda flow installs a third-party Helm chart (kedacore/keda) directly from the public kedacore repo without pinning it to an immutable version or verifying its integrity. If the chart repository or its DNS is compromised, a malicious chart could be deployed with cluster-admin privileges, leading to full cluster compromise. Pin this dependency to a specific, trusted chart version (or content hash/digest) and add integrity verification or vendoring to ensure only vetted KEDA chart code is deployed.

    # Skip install if KEDA ScaledObject CRD already exists (pre-installed on cluster)
    if kubectl get crd scaledobjects.keda.sh >/dev/null 2>&1; then
        log_success "KEDA is already installed on this cluster — skipping helm install"
        return
    fi

    kubectl create namespace "$KEDA_NAMESPACE" --dry-run=client -o yaml | kubectl apply -f -

    helm repo add kedacore https://kedacore.github.io/charts 2>/dev/null || true
    helm repo update

    if ! helm upgrade -i keda kedacore/keda \
        -n "$KEDA_NAMESPACE" \
        --set prometheus.metricServer.enabled=true \
        --set prometheus.operator.enabled=true \
        --wait \
        --timeout=5m; then

Copilot · 2026-03-11T15:41:07Z

+    # Skip install if KEDA ScaledObject CRD already exists (pre-installed on cluster)
+    if kubectl get crd scaledobjects.keda.sh >/dev/null 2>&1; then
+        log_success "KEDA is already installed on this cluster — skipping helm install"
+        return


deploy_keda() skips installation solely based on the ScaledObject CRD existing. A stale CRD can remain after an uninstall, which would cause the script to skip install but still leave KEDA non-functional (no operator / metrics adapter), leading to hard-to-debug E2E failures later. Consider also verifying KEDA is actually running (e.g., check for the relevant deployments/pods or APIService availability) before deciding to skip.

Suggested change

# Skip install if KEDA ScaledObject CRD already exists (pre-installed on cluster)

if kubectl get crd scaledobjects.keda.sh >/dev/null 2>&1; then

log_success "KEDA is already installed on this cluster — skipping helm install"

return

# If the ScaledObject CRD exists, also verify that KEDA components are actually running

# before deciding to skip installation. A stale CRD can remain after uninstall.

if kubectl get crd scaledobjects.keda.sh >/dev/null 2>&1; then

# Check for KEDA operator pods across all namespaces

if kubectl get pods -A 2>/dev/null | grep -q "keda-operator"; then

# Optionally verify the external metrics APIService is available

if kubectl get apiservice v1beta1.external.metrics.k8s.io >/dev/null 2>&1; then

log_success "KEDA CRD and runtime components detected — skipping helm install"

return

fi

fi

log_warning "KEDA ScaledObject CRD found but KEDA components not detected; proceeding with helm install/upgrade"

Copilot · 2026-03-11T15:41:08Z

+		// Scale-from-zero requires GIE flow control, InferenceObjective, and KEDA
+		// (ScaledObject with minReplicas=0). KEDA must be pre-installed on the cluster.
+		// Only kind-emulator installs KEDA at runtime via install.sh.



The Scale-From-Zero test now runs on OpenShift, but when SCALER_BACKEND != "keda" it will create an HPA with minReplicas=0 (via fixtures.EnsureHPA) which is known to be rejected on OpenShift (HPAScaleToZero is forced off in config.go). Add an explicit guard in BeforeAll to Skip/Fail with a clear message unless cfg.ScalerBackend=="keda" (or cfg.ScaleToZeroEnabled is true) to avoid deterministic failures from misconfiguration.

Suggested change

// Guard against misconfiguration on platforms (e.g. OpenShift) that reject HPA minReplicas=0.

// This test requires either KEDA as the scaler backend or explicit scale-to-zero enablement.

if cfg.ScalerBackend != "keda" && !cfg.ScaleToZeroEnabled {

Skip("Scale-From-Zero test requires SCALER_BACKEND=\"keda\" or ENABLE_SCALE_TO_ZERO=true; current configuration does not support scale-to-zero HPAs.")

}

- deploy_keda(): Check operator pods + APIService, not just CRD, to avoid false skip when stale CRD remains after prior uninstall - detect_inference_pool_api_group(): Implement actual namespace-first then cluster-wide fallback (comment said fallback but code didn't) - Pin KEDA chart version (KEDA_CHART_VERSION, default 2.19.0) for reproducible installs - Fix ENABLE_SCALE_TO_ZERO default inconsistency in helm --set - Add Skip guard in scale-from-zero test for non-KEDA environments where HPA rejects minReplicas=0 - Fix misleading comment that said scale-from-zero requires KEDA - Document per-environment KEDA_NAMESPACE values in suite_test.go Signed-off-by: Andrew Anderson <andy@clubanderson.com>

lionelvillard · 2026-03-12T14:06:19Z

/ok-to-test

lionelvillard · 2026-03-12T14:06:29Z

/trigger-e2e-full

github-actions · 2026-03-12T14:06:33Z

🚀 OpenShift E2E — approve and run (/ok-to-test)

View the OpenShift E2E workflow run

github-actions · 2026-03-12T14:06:39Z

🚀 Kind E2E (full) triggered by /trigger-e2e-full

View the Kind E2E workflow run

github-actions · 2026-03-12T14:10:34Z

GPU Pre-flight Check ✅

GPUs are available for e2e-openshift tests. Proceeding with deployment.

Resource	Total	Allocated	Available
GPUs	50	31	19

Cluster	Value
Nodes	16 (7 with GPUs)
Total CPU	993 cores
Total Memory	10383 Gi
GPUs required	4 (min) / 6 (recommended)

clubanderson · 2026-03-13T00:54:28Z

/lgtm
/approve

Copilot AI review requested due to automatic review settings March 10, 2026 01:16

Copilot started reviewing on behalf of clubanderson March 10, 2026 01:16 View session

Copilot AI reviewed Mar 10, 2026

View reviewed changes

clubanderson force-pushed the fix/disable-scale-from-zero-cks branch from 9206f44 to 7a39fb2 Compare March 10, 2026 01:41

clubanderson changed the title ~~🐛 Disable scale-from-zero test on CKS~~ 🐛 Enable scale-from-zero E2E on CKS and OCP with KEDA support Mar 10, 2026

clubanderson mentioned this pull request Mar 10, 2026

🐛 Enable scale-from-zero in WVA nightly E2E on CKS and OCP llm-d/llm-d-infra#87

Merged

2 tasks

clubanderson force-pushed the fix/disable-scale-from-zero-cks branch from 7a39fb2 to 912e1e5 Compare March 10, 2026 01:45

Copilot AI review requested due to automatic review settings March 10, 2026 01:45

Copilot started reviewing on behalf of clubanderson March 10, 2026 01:45 View session

clubanderson force-pushed the fix/disable-scale-from-zero-cks branch from 912e1e5 to 75d41af Compare March 10, 2026 01:48

Copilot AI reviewed Mar 10, 2026

View reviewed changes

clubanderson force-pushed the fix/disable-scale-from-zero-cks branch from 75d41af to 0d180fa Compare March 10, 2026 02:10

Copilot AI review requested due to automatic review settings March 10, 2026 03:02

clubanderson force-pushed the fix/disable-scale-from-zero-cks branch from 0d180fa to dbab13c Compare March 10, 2026 03:02

Copilot started reviewing on behalf of clubanderson March 10, 2026 03:03 View session

Copilot AI reviewed Mar 10, 2026

View reviewed changes

clubanderson force-pushed the fix/disable-scale-from-zero-cks branch from dbab13c to a9fa856 Compare March 10, 2026 03:25

lionelvillard reviewed Mar 10, 2026

View reviewed changes

Copilot AI review requested due to automatic review settings March 11, 2026 15:33

Copilot started reviewing on behalf of clubanderson March 11, 2026 15:34 View session

clubanderson assigned lionelvillard Mar 11, 2026

Copilot AI reviewed Mar 11, 2026

View reviewed changes

lionelvillard reviewed Mar 11, 2026

View reviewed changes

Comment thread deploy/install.sh

lionelvillard approved these changes Mar 12, 2026

View reviewed changes

lionelvillard enabled auto-merge (squash) March 12, 2026 14:10

github-actions bot added the lgtm Looks good to me, indicates that a PR is ready to be merged. label Mar 13, 2026

github-actions bot approved these changes Mar 13, 2026

View reviewed changes

clubanderson merged commit 7196434 into llm-d:main Mar 13, 2026
15 checks passed

clubanderson mentioned this pull request Mar 14, 2026

🐛 WVA nightly E2E failing on CKS and OCP after March 9-13 PR batch #884

Open

		// KEDA is supported on all environments — pre-installed on OCP (CMA operator)
		// and CKS (helm), installed at runtime on kind-emulator via install.sh.

-    # Skip install if KEDA ScaledObject CRD already exists (pre-installed on cluster)
-    if kubectl get crd scaledobjects.keda.sh >/dev/null 2>&1; then
-        log_success "KEDA is already installed on this cluster — skipping helm install"
-        return
+    # If the ScaledObject CRD exists, also verify that KEDA components are actually running
+    # before deciding to skip installation. A stale CRD can remain after uninstall.
+    if kubectl get crd scaledobjects.keda.sh >/dev/null 2>&1; then
+        # Check for KEDA operator pods across all namespaces
+        if kubectl get pods -A 2>/dev/null | grep -q "keda-operator"; then
+            # Optionally verify the external metrics APIService is available
+            if kubectl get apiservice v1beta1.external.metrics.k8s.io >/dev/null 2>&1; then
+                log_success "KEDA CRD and runtime components detected — skipping helm install"
+                return
+            fi
+        fi
+        log_warning "KEDA ScaledObject CRD found but KEDA components not detected; proceeding with helm install/upgrade"

+		// Guard against misconfiguration on platforms (e.g. OpenShift) that reject HPA minReplicas=0.
+		// This test requires either KEDA as the scaler backend or explicit scale-to-zero enablement.
+		if cfg.ScalerBackend != "keda" && !cfg.ScaleToZeroEnabled {
+			Skip("Scale-From-Zero test requires SCALER_BACKEND=\"keda\" or ENABLE_SCALE_TO_ZERO=true; current configuration does not support scale-to-zero HPAs.")
+		}

Conversation

clubanderson commented Mar 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Context

Test plan

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Mar 10, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 10, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 10, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Mar 10, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 10, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 10, 2026

Choose a reason for hiding this comment

Uh oh!

lionelvillard Mar 10, 2026

Choose a reason for hiding this comment

Uh oh!

clubanderson commented Mar 11, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Mar 11, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 11, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

lionelvillard commented Mar 12, 2026

Uh oh!

lionelvillard commented Mar 12, 2026

Uh oh!

github-actions bot commented Mar 12, 2026

Uh oh!

github-actions bot commented Mar 12, 2026

Uh oh!

github-actions bot commented Mar 12, 2026

GPU Pre-flight Check ✅

Uh oh!

clubanderson commented Mar 13, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

clubanderson commented Mar 10, 2026 •

edited

Loading