🐛 Enable scale-from-zero E2E on CKS and OCP with KEDA support#865
Conversation
There was a problem hiding this comment.
Pull request overview
This PR disables the scale-from-zero E2E test outside the kind-emulator environment to prevent recurring CKS nightly failures, aligning the test’s execution scope with the only environment where its prerequisites are currently reliably controlled.
Changes:
- Restrict
Scale-From-Zero FeatureE2E coverage toENVIRONMENT=kind-emulator. - Update skip rationale/commentary to reflect current prerequisite constraints and known CKS/OpenShift limitations.
9206f44 to
7a39fb2
Compare
7a39fb2 to
912e1e5
Compare
912e1e5 to
75d41af
Compare
|
|
||
| else | ||
| log_info "Skipping llm-d deployment (DEPLOY_LLM_D=false)" | ||
| fi | ||
|
|
There was a problem hiding this comment.
The code comment says deploy_keda will "detect and skip" when KEDA is pre-installed, but deploy_keda currently always runs helm upgrade/install. With the ENVIRONMENT guard removed, this will attempt to install/upgrade KEDA on OpenShift/CKS where it may already be operator-managed or where the caller lacks permissions, potentially breaking the cluster or failing the deployment. Add an explicit pre-check (e.g., presence of ScaledObject CRD and/or keda-operator deployment in the target namespace) to no-op when KEDA already exists, or reintroduce a safe gating mechanism.
| // Scale-from-zero requires GIE flow control, InferenceObjective, and KEDA | ||
| // (ScaledObject with minReplicas=0). KEDA must be pre-installed on the cluster. | ||
| // Only kind-emulator installs KEDA at runtime via install.sh. |
There was a problem hiding this comment.
This test no longer skips OpenShift, but it will still try to create an HPA with minReplicas=0 when SCALER_BACKEND is left at its default (prometheus-adapter). On clusters without the HPAScaleToZero feature gate (notably OpenShift per config.go), that HPA creation is expected to fail. Consider adding an explicit precondition in BeforeAll to require SCALER_BACKEND=keda when ScaleToZeroEnabled is false (or skip with a clear message) so the test runs reliably across environments.
| log_info "Skipping llm-d deployment (DEPLOY_LLM_D=false)" | ||
| fi |
There was a problem hiding this comment.
Removing the ENVIRONMENT=kind-emulator guard means deploy_keda can now run on any cluster where SCALER_BACKEND=keda, but deploy_keda fetches and installs the kedacore/keda Helm chart from https://kedacore.github.io/charts without pinning to an immutable version or verifying integrity. This makes cluster bootstrap depend on a mutable third-party artifact, so a compromise or hijack of that chart repository or its DNS could give an attacker code execution in the cluster with KEDA’s privileges. To reduce this supply-chain risk, restrict this remote Helm install to non-production/test environments and/or pin the chart to an immutable version/digest (plus checksum or signature verification) before broadening its use.
75d41af to
0d180fa
Compare
0d180fa to
dbab13c
Compare
| # Search in the target namespace first (avoids cluster-wide RBAC issues), then fall back to -A. | ||
| local ns_flag="-A" | ||
| if [ -n "${LLMD_NS:-}" ]; then | ||
| ns_flag="-n $LLMD_NS" | ||
| fi |
There was a problem hiding this comment.
The comment says the detection searches the target namespace first and then falls back to -A, but the implementation only ever uses a single ns_flag (LLMD_NS is always set by default), so there is no fallback. Either implement an actual fallback (try -n "$LLMD_NS" then retry with -A) or update the comment to match the behavior to avoid misleading future debugging.
| # Skip install if KEDA ScaledObject CRD already exists (pre-installed on cluster) | ||
| if kubectl get crd scaledobjects.keda.sh >/dev/null 2>&1; then | ||
| log_success "KEDA is already installed on this cluster — skipping helm install" | ||
| return | ||
| fi |
There was a problem hiding this comment.
deploy_keda() now skips installation solely based on the ScaledObject CRD existing. However undeploy_keda() does not remove CRDs, so a prior uninstall can leave the CRD behind while the KEDA operator is gone; subsequent runs will incorrectly skip install and leave the cluster without KEDA. Consider tightening the skip condition to also verify that the KEDA operator is actually running (e.g., deployment/pods present, possibly across namespaces), or only skip when both CRD + controller are detected.
| // KEDA is supported on all environments — pre-installed on OCP (CMA operator) | ||
| // and CKS (helm), installed at runtime on kind-emulator via install.sh. |
There was a problem hiding this comment.
This comment says KEDA is pre-installed on OCP/CKS, but the e2e config defaults KEDA_NAMESPACE to "keda-system" while the PR description calls out OCP using "openshift-keda" and CKS using "keda". To avoid confusion (and failing the KEDA operator readiness check when SCALER_BACKEND=keda), it would help to mention that KEDA_NAMESPACE must be set appropriately per cluster (or adjust the defaults elsewhere).
- Remove environment skip in scale_from_zero_test.go — test now runs on all platforms (KEDA must be pre-installed on the cluster) - Add retry logic to detect_inference_pool_api_group() to handle the race where InferencePool instances haven't been created yet after helmfile deploy - Make deploy_keda() skip helm install when KEDA CRD already exists (pre-installed on OCP via CMA operator, on CKS via helm) - Remove environment guard on SCALER_BACKEND=keda — supported everywhere Signed-off-by: Andy Anderson <andy@clubanderson.com> Signed-off-by: Andrew Anderson <andy@clubanderson.com>
dbab13c to
a9fa856
Compare
| if cfg.Environment == "openshift" { | ||
| Skip("Scale-from-zero test is disabled on OpenShift") | ||
| } | ||
| // Scale-from-zero requires GIE flow control, InferenceObjective, and KEDA |
There was a problem hiding this comment.
Scale-from-zero does not require KEDA. Please update the misleading comment
The kubectl wait --timeout=60s for all deployments in the llm-d namespace was too short for model-serving pods (vLLM) that need to download and load large models (e.g. Meta-Llama-3.1-8B) into GPU memory. This caused both OCP and CKS nightly E2E to fail at the "Deploy guide via WVA install.sh" step. Default is now 600s (10 min), overridable via DEPLOY_WAIT_TIMEOUT env var. The vLLM startupProbe already allows up to 30 minutes. Signed-off-by: Andrew Anderson <andy@clubanderson.com>
|
@lionelvillard — This PR is blocking both OCP and CKS nightly E2E runs. Both have been failing every night (Mar 9-11) at the "Deploy guide via WVA install.sh" step due to two issues fixed here:
Failure logs:
Would appreciate a review when you get a chance — every nightly run is failing until this lands. |
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 3 out of 3 changed files in this pull request and generated 4 comments.
Comments suppressed due to low confidence (1)
deploy/install.sh:1117
- The
deploy_kedaflow installs a third-party Helm chart (kedacore/keda) directly from the publickedacorerepo without pinning it to an immutable version or verifying its integrity. If the chart repository or its DNS is compromised, a malicious chart could be deployed with cluster-admin privileges, leading to full cluster compromise. Pin this dependency to a specific, trusted chart version (or content hash/digest) and add integrity verification or vendoring to ensure only vetted KEDA chart code is deployed.
# Skip install if KEDA ScaledObject CRD already exists (pre-installed on cluster)
if kubectl get crd scaledobjects.keda.sh >/dev/null 2>&1; then
log_success "KEDA is already installed on this cluster — skipping helm install"
return
fi
kubectl create namespace "$KEDA_NAMESPACE" --dry-run=client -o yaml | kubectl apply -f -
helm repo add kedacore https://kedacore.github.io/charts 2>/dev/null || true
helm repo update
if ! helm upgrade -i keda kedacore/keda \
-n "$KEDA_NAMESPACE" \
--set prometheus.metricServer.enabled=true \
--set prometheus.operator.enabled=true \
--wait \
--timeout=5m; then
| # Skip install if KEDA ScaledObject CRD already exists (pre-installed on cluster) | ||
| if kubectl get crd scaledobjects.keda.sh >/dev/null 2>&1; then | ||
| log_success "KEDA is already installed on this cluster — skipping helm install" | ||
| return |
There was a problem hiding this comment.
deploy_keda() skips installation solely based on the ScaledObject CRD existing. A stale CRD can remain after an uninstall, which would cause the script to skip install but still leave KEDA non-functional (no operator / metrics adapter), leading to hard-to-debug E2E failures later. Consider also verifying KEDA is actually running (e.g., check for the relevant deployments/pods or APIService availability) before deciding to skip.
| # Skip install if KEDA ScaledObject CRD already exists (pre-installed on cluster) | |
| if kubectl get crd scaledobjects.keda.sh >/dev/null 2>&1; then | |
| log_success "KEDA is already installed on this cluster — skipping helm install" | |
| return | |
| # If the ScaledObject CRD exists, also verify that KEDA components are actually running | |
| # before deciding to skip installation. A stale CRD can remain after uninstall. | |
| if kubectl get crd scaledobjects.keda.sh >/dev/null 2>&1; then | |
| # Check for KEDA operator pods across all namespaces | |
| if kubectl get pods -A 2>/dev/null | grep -q "keda-operator"; then | |
| # Optionally verify the external metrics APIService is available | |
| if kubectl get apiservice v1beta1.external.metrics.k8s.io >/dev/null 2>&1; then | |
| log_success "KEDA CRD and runtime components detected — skipping helm install" | |
| return | |
| fi | |
| fi | |
| log_warning "KEDA ScaledObject CRD found but KEDA components not detected; proceeding with helm install/upgrade" |
| // Scale-from-zero requires GIE flow control, InferenceObjective, and KEDA | ||
| // (ScaledObject with minReplicas=0). KEDA must be pre-installed on the cluster. | ||
| // Only kind-emulator installs KEDA at runtime via install.sh. | ||
|
|
There was a problem hiding this comment.
The Scale-From-Zero test now runs on OpenShift, but when SCALER_BACKEND != "keda" it will create an HPA with minReplicas=0 (via fixtures.EnsureHPA) which is known to be rejected on OpenShift (HPAScaleToZero is forced off in config.go). Add an explicit guard in BeforeAll to Skip/Fail with a clear message unless cfg.ScalerBackend=="keda" (or cfg.ScaleToZeroEnabled is true) to avoid deterministic failures from misconfiguration.
| // Guard against misconfiguration on platforms (e.g. OpenShift) that reject HPA minReplicas=0. | |
| // This test requires either KEDA as the scaler backend or explicit scale-to-zero enablement. | |
| if cfg.ScalerBackend != "keda" && !cfg.ScaleToZeroEnabled { | |
| Skip("Scale-From-Zero test requires SCALER_BACKEND=\"keda\" or ENABLE_SCALE_TO_ZERO=true; current configuration does not support scale-to-zero HPAs.") | |
| } |
- deploy_keda(): Check operator pods + APIService, not just CRD, to avoid false skip when stale CRD remains after prior uninstall - detect_inference_pool_api_group(): Implement actual namespace-first then cluster-wide fallback (comment said fallback but code didn't) - Pin KEDA chart version (KEDA_CHART_VERSION, default 2.19.0) for reproducible installs - Fix ENABLE_SCALE_TO_ZERO default inconsistency in helm --set - Add Skip guard in scale-from-zero test for non-KEDA environments where HPA rejects minReplicas=0 - Fix misleading comment that said scale-from-zero requires KEDA - Document per-environment KEDA_NAMESPACE values in suite_test.go Signed-off-by: Andrew Anderson <andy@clubanderson.com>
|
/ok-to-test |
|
/trigger-e2e-full |
|
🚀 OpenShift E2E — approve and run ( |
|
🚀 Kind E2E (full) triggered by |
GPU Pre-flight Check ✅GPUs are available for e2e-openshift tests. Proceeding with deployment.
|
|
/lgtm |
Summary
scale_from_zero_test.go— test now runs on all platformsdetect_inference_pool_api_group()(6 retries, 10s apart) to handle the race where InferencePool instances haven't been created yet after helmfile deploydeploy_keda()skip helm install when KEDA CRD already exists (pre-installed on OCP via CMA operator, on CKS via helm)SCALER_BACKEND=keda— now supported on all environmentskubectl wait --timeout=60sfor all deployments was too short for model-serving pods (vLLM) that need to download and load large models into GPU memory (e.g.Meta-Llama-3.1-8B). Both OCP and CKS nightly E2E were failing at "Deploy guide via WVA install.sh" due to this. Now defaults to 600s, overridable viaDEPLOY_WAIT_TIMEOUTenv var.Context
PR #849 enabled scale-from-zero tests but they failed on CKS (nightly runs #51, #52) due to:
detect_inference_pool_api_group()runsSCALER_BACKEND=kedawas blocked for non-emulator environmentsHPAScaleToZerofeature gate is disabled on both CKS and OCP — KEDA ScaledObject is the workaroundAdditionally, both OCP and CKS nightly E2E runs (Mar 10-11) were failing because:
4.
kubectl wait --for=condition=Available deployment --all --timeout=60sexpired before vLLM finished loading the model5. Script then hit the KEDA environment guard and exited with code 1
KEDA is now pre-installed on both clusters:
openshift-kedakedanamespaceCompanion PR: llm-d/llm-d-infra#87
Test plan