Skip to content

🐛 Enable scale-from-zero E2E on CKS and OCP with KEDA support#865

Merged
clubanderson merged 3 commits intollm-d:mainfrom
clubanderson:fix/disable-scale-from-zero-cks
Mar 13, 2026
Merged

🐛 Enable scale-from-zero E2E on CKS and OCP with KEDA support#865
clubanderson merged 3 commits intollm-d:mainfrom
clubanderson:fix/disable-scale-from-zero-cks

Conversation

@clubanderson
Copy link
Copy Markdown
Contributor

@clubanderson clubanderson commented Mar 10, 2026

Summary

  • Remove environment skip in scale_from_zero_test.go — test now runs on all platforms
  • Add retry logic to detect_inference_pool_api_group() (6 retries, 10s apart) to handle the race where InferencePool instances haven't been created yet after helmfile deploy
  • Make deploy_keda() skip helm install when KEDA CRD already exists (pre-installed on OCP via CMA operator, on CKS via helm)
  • Remove environment guard on SCALER_BACKEND=keda — now supported on all environments
  • Increase deploy wait timeout from 60s to 600s — the kubectl wait --timeout=60s for all deployments was too short for model-serving pods (vLLM) that need to download and load large models into GPU memory (e.g. Meta-Llama-3.1-8B). Both OCP and CKS nightly E2E were failing at "Deploy guide via WVA install.sh" due to this. Now defaults to 600s, overridable via DEPLOY_WAIT_TIMEOUT env var.

Context

PR #849 enabled scale-from-zero tests but they failed on CKS (nightly runs #51, #52) due to:

  1. Pool group detection race — InferencePools not yet created when detect_inference_pool_api_group() runs
  2. SCALER_BACKEND=keda was blocked for non-emulator environments
  3. HPAScaleToZero feature gate is disabled on both CKS and OCP — KEDA ScaledObject is the workaround

Additionally, both OCP and CKS nightly E2E runs (Mar 10-11) were failing because:
4. kubectl wait --for=condition=Available deployment --all --timeout=60s expired before vLLM finished loading the model
5. Script then hit the KEDA environment guard and exited with code 1

KEDA is now pre-installed on both clusters:

  • OCP (pok-prod-001): Custom Metrics Autoscaler operator v2.18.1-2 in openshift-keda
  • CKS (waldorf): Upstream KEDA helm chart v2.18.1 in keda namespace

Companion PR: llm-d/llm-d-infra#87

Test plan

  • CKS nightly E2E passes with scale-from-zero test enabled
  • OCP nightly E2E passes with scale-from-zero test enabled
  • kind-emulator CI continues to pass (KEDA installed at runtime as before)

Copilot AI review requested due to automatic review settings March 10, 2026 01:16
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR disables the scale-from-zero E2E test outside the kind-emulator environment to prevent recurring CKS nightly failures, aligning the test’s execution scope with the only environment where its prerequisites are currently reliably controlled.

Changes:

  • Restrict Scale-From-Zero Feature E2E coverage to ENVIRONMENT=kind-emulator.
  • Update skip rationale/commentary to reflect current prerequisite constraints and known CKS/OpenShift limitations.

@clubanderson clubanderson force-pushed the fix/disable-scale-from-zero-cks branch from 9206f44 to 7a39fb2 Compare March 10, 2026 01:41
@clubanderson clubanderson changed the title 🐛 Disable scale-from-zero test on CKS 🐛 Enable scale-from-zero E2E on CKS and OCP with KEDA support Mar 10, 2026
@clubanderson clubanderson force-pushed the fix/disable-scale-from-zero-cks branch from 7a39fb2 to 912e1e5 Compare March 10, 2026 01:45
Copilot AI review requested due to automatic review settings March 10, 2026 01:45
@clubanderson clubanderson force-pushed the fix/disable-scale-from-zero-cks branch from 912e1e5 to 75d41af Compare March 10, 2026 01:48
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.

Comment thread deploy/install.sh
Comment on lines 1706 to 1710

else
log_info "Skipping llm-d deployment (DEPLOY_LLM_D=false)"
fi

Copy link

Copilot AI Mar 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code comment says deploy_keda will "detect and skip" when KEDA is pre-installed, but deploy_keda currently always runs helm upgrade/install. With the ENVIRONMENT guard removed, this will attempt to install/upgrade KEDA on OpenShift/CKS where it may already be operator-managed or where the caller lacks permissions, potentially breaking the cluster or failing the deployment. Add an explicit pre-check (e.g., presence of ScaledObject CRD and/or keda-operator deployment in the target namespace) to no-op when KEDA already exists, or reintroduce a safe gating mechanism.

Copilot uses AI. Check for mistakes.
Comment thread test/e2e/scale_from_zero_test.go Outdated
Comment on lines +38 to +40
// Scale-from-zero requires GIE flow control, InferenceObjective, and KEDA
// (ScaledObject with minReplicas=0). KEDA must be pre-installed on the cluster.
// Only kind-emulator installs KEDA at runtime via install.sh.
Copy link

Copilot AI Mar 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test no longer skips OpenShift, but it will still try to create an HPA with minReplicas=0 when SCALER_BACKEND is left at its default (prometheus-adapter). On clusters without the HPAScaleToZero feature gate (notably OpenShift per config.go), that HPA creation is expected to fail. Consider adding an explicit precondition in BeforeAll to require SCALER_BACKEND=keda when ScaleToZeroEnabled is false (or skip with a clear message) so the test runs reliably across environments.

Copilot uses AI. Check for mistakes.
Comment thread deploy/install.sh
Comment on lines 1708 to 1709
log_info "Skipping llm-d deployment (DEPLOY_LLM_D=false)"
fi
Copy link

Copilot AI Mar 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removing the ENVIRONMENT=kind-emulator guard means deploy_keda can now run on any cluster where SCALER_BACKEND=keda, but deploy_keda fetches and installs the kedacore/keda Helm chart from https://kedacore.github.io/charts without pinning to an immutable version or verifying integrity. This makes cluster bootstrap depend on a mutable third-party artifact, so a compromise or hijack of that chart repository or its DNS could give an attacker code execution in the cluster with KEDA’s privileges. To reduce this supply-chain risk, restrict this remote Helm install to non-production/test environments and/or pin the chart to an immutable version/digest (plus checksum or signature verification) before broadening its use.

Copilot uses AI. Check for mistakes.
@clubanderson clubanderson force-pushed the fix/disable-scale-from-zero-cks branch from 75d41af to 0d180fa Compare March 10, 2026 02:10
Copilot AI review requested due to automatic review settings March 10, 2026 03:02
@clubanderson clubanderson force-pushed the fix/disable-scale-from-zero-cks branch from 0d180fa to dbab13c Compare March 10, 2026 03:02
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.

Comment thread deploy/install.sh Outdated
Comment on lines +515 to +519
# Search in the target namespace first (avoids cluster-wide RBAC issues), then fall back to -A.
local ns_flag="-A"
if [ -n "${LLMD_NS:-}" ]; then
ns_flag="-n $LLMD_NS"
fi
Copy link

Copilot AI Mar 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment says the detection searches the target namespace first and then falls back to -A, but the implementation only ever uses a single ns_flag (LLMD_NS is always set by default), so there is no fallback. Either implement an actual fallback (try -n "$LLMD_NS" then retry with -A) or update the comment to match the behavior to avoid misleading future debugging.

Copilot uses AI. Check for mistakes.
Comment thread deploy/install.sh Outdated
Comment on lines +1096 to +1100
# Skip install if KEDA ScaledObject CRD already exists (pre-installed on cluster)
if kubectl get crd scaledobjects.keda.sh >/dev/null 2>&1; then
log_success "KEDA is already installed on this cluster — skipping helm install"
return
fi
Copy link

Copilot AI Mar 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

deploy_keda() now skips installation solely based on the ScaledObject CRD existing. However undeploy_keda() does not remove CRDs, so a prior uninstall can leave the CRD behind while the KEDA operator is gone; subsequent runs will incorrectly skip install and leave the cluster without KEDA. Consider tightening the skip condition to also verify that the KEDA operator is actually running (e.g., deployment/pods present, possibly across namespaces), or only skip when both CRD + controller are detected.

Copilot uses AI. Check for mistakes.
Comment thread test/e2e/suite_test.go Outdated
Comment on lines +56 to +57
// KEDA is supported on all environments — pre-installed on OCP (CMA operator)
// and CKS (helm), installed at runtime on kind-emulator via install.sh.
Copy link

Copilot AI Mar 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment says KEDA is pre-installed on OCP/CKS, but the e2e config defaults KEDA_NAMESPACE to "keda-system" while the PR description calls out OCP using "openshift-keda" and CKS using "keda". To avoid confusion (and failing the KEDA operator readiness check when SCALER_BACKEND=keda), it would help to mention that KEDA_NAMESPACE must be set appropriately per cluster (or adjust the defaults elsewhere).

Copilot uses AI. Check for mistakes.
- Remove environment skip in scale_from_zero_test.go — test now runs on
  all platforms (KEDA must be pre-installed on the cluster)
- Add retry logic to detect_inference_pool_api_group() to handle the race
  where InferencePool instances haven't been created yet after helmfile deploy
- Make deploy_keda() skip helm install when KEDA CRD already exists
  (pre-installed on OCP via CMA operator, on CKS via helm)
- Remove environment guard on SCALER_BACKEND=keda — supported everywhere

Signed-off-by: Andy Anderson <andy@clubanderson.com>
Signed-off-by: Andrew Anderson <andy@clubanderson.com>
@clubanderson clubanderson force-pushed the fix/disable-scale-from-zero-cks branch from dbab13c to a9fa856 Compare March 10, 2026 03:25
Comment thread test/e2e/scale_from_zero_test.go Outdated
if cfg.Environment == "openshift" {
Skip("Scale-from-zero test is disabled on OpenShift")
}
// Scale-from-zero requires GIE flow control, InferenceObjective, and KEDA
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Scale-from-zero does not require KEDA. Please update the misleading comment

The kubectl wait --timeout=60s for all deployments in the llm-d
namespace was too short for model-serving pods (vLLM) that need to
download and load large models (e.g. Meta-Llama-3.1-8B) into GPU
memory. This caused both OCP and CKS nightly E2E to fail at the
"Deploy guide via WVA install.sh" step.

Default is now 600s (10 min), overridable via DEPLOY_WAIT_TIMEOUT
env var. The vLLM startupProbe already allows up to 30 minutes.

Signed-off-by: Andrew Anderson <andy@clubanderson.com>
Copilot AI review requested due to automatic review settings March 11, 2026 15:33
@clubanderson
Copy link
Copy Markdown
Contributor Author

@lionelvillard — This PR is blocking both OCP and CKS nightly E2E runs. Both have been failing every night (Mar 9-11) at the "Deploy guide via WVA install.sh" step due to two issues fixed here:

  1. kubectl wait --timeout=60s is too short — vLLM needs several minutes to download and load Meta-Llama-3.1-8B into GPU memory. The 60s timeout expires while the model is still loading (pod shows 1/2 Running). This PR bumps the default to 600s (overridable via DEPLOY_WAIT_TIMEOUT).

  2. KEDA environment guard rejects non-emulator environmentsinstall.sh errors with KEDA scaler backend is only supported for kind-emulator environment even though KEDA (Custom Metrics Autoscaler) is now installed on both pok-prod (openshift-keda) and waldorf (keda namespace). This PR removes that guard.

Failure logs:

  • OCP: error: timed out waiting for the condition on deployments/ms-workload-autoscaling-llm-d-modelservice-decode → KEDA error → exit 1
  • CKS: identical failure pattern

Would appreciate a review when you get a chance — every nightly run is failing until this lands.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 4 comments.

Comments suppressed due to low confidence (1)

deploy/install.sh:1117

  • The deploy_keda flow installs a third-party Helm chart (kedacore/keda) directly from the public kedacore repo without pinning it to an immutable version or verifying its integrity. If the chart repository or its DNS is compromised, a malicious chart could be deployed with cluster-admin privileges, leading to full cluster compromise. Pin this dependency to a specific, trusted chart version (or content hash/digest) and add integrity verification or vendoring to ensure only vetted KEDA chart code is deployed.
    # Skip install if KEDA ScaledObject CRD already exists (pre-installed on cluster)
    if kubectl get crd scaledobjects.keda.sh >/dev/null 2>&1; then
        log_success "KEDA is already installed on this cluster — skipping helm install"
        return
    fi

    kubectl create namespace "$KEDA_NAMESPACE" --dry-run=client -o yaml | kubectl apply -f -

    helm repo add kedacore https://kedacore.github.io/charts 2>/dev/null || true
    helm repo update

    if ! helm upgrade -i keda kedacore/keda \
        -n "$KEDA_NAMESPACE" \
        --set prometheus.metricServer.enabled=true \
        --set prometheus.operator.enabled=true \
        --wait \
        --timeout=5m; then

Comment thread deploy/install.sh Outdated
Comment on lines +1101 to +1104
# Skip install if KEDA ScaledObject CRD already exists (pre-installed on cluster)
if kubectl get crd scaledobjects.keda.sh >/dev/null 2>&1; then
log_success "KEDA is already installed on this cluster — skipping helm install"
return
Copy link

Copilot AI Mar 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

deploy_keda() skips installation solely based on the ScaledObject CRD existing. A stale CRD can remain after an uninstall, which would cause the script to skip install but still leave KEDA non-functional (no operator / metrics adapter), leading to hard-to-debug E2E failures later. Consider also verifying KEDA is actually running (e.g., check for the relevant deployments/pods or APIService availability) before deciding to skip.

Suggested change
# Skip install if KEDA ScaledObject CRD already exists (pre-installed on cluster)
if kubectl get crd scaledobjects.keda.sh >/dev/null 2>&1; then
log_success "KEDA is already installed on this cluster — skipping helm install"
return
# If the ScaledObject CRD exists, also verify that KEDA components are actually running
# before deciding to skip installation. A stale CRD can remain after uninstall.
if kubectl get crd scaledobjects.keda.sh >/dev/null 2>&1; then
# Check for KEDA operator pods across all namespaces
if kubectl get pods -A 2>/dev/null | grep -q "keda-operator"; then
# Optionally verify the external metrics APIService is available
if kubectl get apiservice v1beta1.external.metrics.k8s.io >/dev/null 2>&1; then
log_success "KEDA CRD and runtime components detected — skipping helm install"
return
fi
fi
log_warning "KEDA ScaledObject CRD found but KEDA components not detected; proceeding with helm install/upgrade"

Copilot uses AI. Check for mistakes.
// Scale-from-zero requires GIE flow control, InferenceObjective, and KEDA
// (ScaledObject with minReplicas=0). KEDA must be pre-installed on the cluster.
// Only kind-emulator installs KEDA at runtime via install.sh.

Copy link

Copilot AI Mar 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Scale-From-Zero test now runs on OpenShift, but when SCALER_BACKEND != "keda" it will create an HPA with minReplicas=0 (via fixtures.EnsureHPA) which is known to be rejected on OpenShift (HPAScaleToZero is forced off in config.go). Add an explicit guard in BeforeAll to Skip/Fail with a clear message unless cfg.ScalerBackend=="keda" (or cfg.ScaleToZeroEnabled is true) to avoid deterministic failures from misconfiguration.

Suggested change
// Guard against misconfiguration on platforms (e.g. OpenShift) that reject HPA minReplicas=0.
// This test requires either KEDA as the scaler backend or explicit scale-to-zero enablement.
if cfg.ScalerBackend != "keda" && !cfg.ScaleToZeroEnabled {
Skip("Scale-From-Zero test requires SCALER_BACKEND=\"keda\" or ENABLE_SCALE_TO_ZERO=true; current configuration does not support scale-to-zero HPAs.")
}

Copilot uses AI. Check for mistakes.
Comment thread deploy/install.sh Outdated
Comment thread deploy/install.sh
- deploy_keda(): Check operator pods + APIService, not just CRD, to
  avoid false skip when stale CRD remains after prior uninstall
- detect_inference_pool_api_group(): Implement actual namespace-first
  then cluster-wide fallback (comment said fallback but code didn't)
- Pin KEDA chart version (KEDA_CHART_VERSION, default 2.19.0) for
  reproducible installs
- Fix ENABLE_SCALE_TO_ZERO default inconsistency in helm --set
- Add Skip guard in scale-from-zero test for non-KEDA environments
  where HPA rejects minReplicas=0
- Fix misleading comment that said scale-from-zero requires KEDA
- Document per-environment KEDA_NAMESPACE values in suite_test.go

Signed-off-by: Andrew Anderson <andy@clubanderson.com>
Comment thread deploy/install.sh
@lionelvillard
Copy link
Copy Markdown
Collaborator

/ok-to-test

@lionelvillard
Copy link
Copy Markdown
Collaborator

/trigger-e2e-full

@github-actions
Copy link
Copy Markdown
Contributor

🚀 OpenShift E2E — approve and run (/ok-to-test)

View the OpenShift E2E workflow run

@github-actions
Copy link
Copy Markdown
Contributor

🚀 Kind E2E (full) triggered by /trigger-e2e-full

View the Kind E2E workflow run

@lionelvillard lionelvillard enabled auto-merge (squash) March 12, 2026 14:10
@github-actions
Copy link
Copy Markdown
Contributor

GPU Pre-flight Check ✅

GPUs are available for e2e-openshift tests. Proceeding with deployment.

Resource Total Allocated Available
GPUs 50 31 19
Cluster Value
Nodes 16 (7 with GPUs)
Total CPU 993 cores
Total Memory 10383 Gi
GPUs required 4 (min) / 6 (recommended)

@clubanderson
Copy link
Copy Markdown
Contributor Author

/lgtm
/approve

@github-actions github-actions bot added the lgtm Looks good to me, indicates that a PR is ready to be merged. label Mar 13, 2026
@clubanderson clubanderson merged commit 7196434 into llm-d:main Mar 13, 2026
15 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

lgtm Looks good to me, indicates that a PR is ready to be merged.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants