Skip to content

e2e: add provider switch (upstream/azure), bump keda-kaito-scaler to v0.5.1, instrument timings#50

Merged
zhuangqh merged 4 commits into
kaito-project:mainfrom
rambohe-ch:bump-scaler-to-v0.5.0
May 12, 2026
Merged

e2e: add provider switch (upstream/azure), bump keda-kaito-scaler to v0.5.1, instrument timings#50
zhuangqh merged 4 commits into
kaito-project:mainfrom
rambohe-ch:bump-scaler-to-v0.5.0

Conversation

@rambohe-ch
Copy link
Copy Markdown
Collaborator

@rambohe-ch rambohe-ch commented May 11, 2026

  1. Provider switch
  • New E2E_PROVIDER variable in versions.env (default: upstream).
  • 'upstream': everything via Helm/upstream manifests (current behavior).
  • 'azure': enable AKS managed add-ons at cluster create time (--enable-keda, --enable-gateway-api). install-components.sh skips the Helm KEDA install and the upstream Gateway API standard-install.yaml apply, falling back only if the managed CRDs/operator are absent.
  • Threaded through hack/e2e/scripts/{run-e2e-local,setup-cluster, install-components,validate-components}.sh, .github/actions/ e2e-base-setup/action.yaml, and both .github/workflows/e2e*.yaml (workflow_dispatch input 'provider').
  • Derived KEDA_NAMESPACE: 'keda' for upstream, 'kube-system' for azure (managed KEDA add-on lives in kube-system; keda-kaito-scaler is installed in the same namespace so KEDA can resolve its ClusterTriggerAuthentication Secrets).
  • test/e2e/cases.go: kedaScalerNamespace() + init() rewrite of the scaling case's NetworkPolicyAllowedNamespaces so 'kube-system' replaces 'keda' under the azure provider.
  1. setup-cluster.sh azure prerequisites
  • Idempotently install/update the aks-preview Azure CLI extension.
  • Register Microsoft.ContainerService/ManagedGatewayAPIPreview feature flag and wait for it to reach 'Registered' (required by --enable-gateway-api).
  1. keda-kaito-scaler v0.4.1 -> v0.5.1 in versions.env.

  2. Install ordering / phasing

  • keda-kaito-scaler moved from phase2 to phase1 (no install-time dep on KEDA; only emits ScaledObjects later).
  • gpu-node-mocker moved from phase2 to phase1; phase2 now contains only Istio (renamed phase2-istio).
  1. gpu-node-mocker CRD readiness gate
  • cmd/gpu-node-mocker/main.go: new checkRequiredCRDs() runs a discovery query for karpenter.sh/v1 'nodeclaims' before NewManager. If the CRD is not yet served the process exits non-zero so kubelet CrashLoopBackOff retries until the KAITO operator finishes installing it. This is what makes the parallel phase1 ordering safe.
  • install-components.sh: gpu-node-mocker rollout-status timeout bumped 120s -> 420s to absorb the kubelet backoff window.
  1. Timing instrumentation
  • install-components.sh: run_phase records per-task and per-phase wall-clock; a final summary table prints every phase total, the sum, and every phase/task entry so the longest task in each parallel phase is visible.
  • run-e2e-local.sh: every top-level step (setup, build-push, install, validate, test, teardown) wrapped by time_step; a master summary is printed once via the cleanup trap (or directly for non-'all' runs).
  1. Change test region from Sweden Central to Australia East because of quota limitations.

8 Disable NetworkPolicy tests in e2e tests to avoid blocking the other e2e tests. @tnsimon will improve networkpolicy tests lately.

Local usage:
E2E_PROVIDER=azure make e2e-up # AKS managed KEDA + Gateway API
E2E_PROVIDER=upstream make e2e-up # default, everything via Helm

Reason for Change:

Requirements

  • added unit tests and e2e tests (if applicable).

Issue Fixed:

Notes for Reviewers:

…v0.5.0, instrument timings

Provider switch
- New E2E_PROVIDER variable in versions.env (default: upstream).
- 'upstream': everything via Helm/upstream manifests (current behavior).
- 'azure': enable AKS managed add-ons at cluster create time
  (--enable-keda, --enable-gateway-api). install-components.sh skips
  the Helm KEDA install and the upstream Gateway API standard-install.yaml
  apply, falling back only if the managed CRDs/operator are absent.
- Threaded through hack/e2e/scripts/{run-e2e-local,setup-cluster,
  install-components,validate-components}.sh, .github/actions/
  e2e-base-setup/action.yaml, and both .github/workflows/e2e*.yaml
  (workflow_dispatch input 'provider').
- Derived KEDA_NAMESPACE: 'keda' for upstream, 'kube-system' for azure
  (managed KEDA add-on lives in kube-system; keda-kaito-scaler is
  installed in the same namespace so KEDA can resolve its
  ClusterTriggerAuthentication Secrets).
- test/e2e/cases.go: kedaScalerNamespace() + init() rewrite of the
  scaling case's NetworkPolicyAllowedNamespaces so 'kube-system'
  replaces 'keda' under the azure provider.

setup-cluster.sh azure prerequisites
- Idempotently install/update the aks-preview Azure CLI extension.
- Register Microsoft.ContainerService/ManagedGatewayAPIPreview feature
  flag and wait for it to reach 'Registered' (required by
  --enable-gateway-api).

keda-kaito-scaler v0.4.1 -> v0.5.0 in versions.env.

Install ordering / phasing
- keda-kaito-scaler moved from phase2 to phase1 (no install-time dep
  on KEDA; only emits ScaledObjects later).
- gpu-node-mocker moved from phase2 to phase1; phase2 now contains
  only Istio (renamed phase2-istio).

gpu-node-mocker CRD readiness gate
- cmd/gpu-node-mocker/main.go: new checkRequiredCRDs() runs a
  discovery query for karpenter.sh/v1 'nodeclaims' before NewManager.
  If the CRD is not yet served the process exits non-zero so kubelet
  CrashLoopBackOff retries until the KAITO operator finishes
  installing it. This is what makes the parallel phase1 ordering safe.
- install-components.sh: gpu-node-mocker rollout-status timeout
  bumped 120s -> 420s to absorb the kubelet backoff window.

Timing instrumentation
- install-components.sh: run_phase records per-task and per-phase
  wall-clock; a final summary table prints every phase total, the
  sum, and every phase/task entry so the longest task in each
  parallel phase is visible.
- run-e2e-local.sh: every top-level step (setup, build-push, install,
  validate, test, teardown) wrapped by time_step; a master summary is
  printed once via the cleanup trap (or directly for non-'all' runs).

Local usage:
  E2E_PROVIDER=azure   make e2e-up   # AKS managed KEDA + Gateway API
  E2E_PROVIDER=upstream make e2e-up  # default, everything via Helm
@rambohe-ch rambohe-ch force-pushed the bump-scaler-to-v0.5.0 branch from ca2c438 to 2c86438 Compare May 11, 2026 23:34
@rambohe-ch rambohe-ch changed the title e2e: add provider switch (upstream/azure), bump keda-kaito-scaler to v0.5.0, instrument timings e2e: add provider switch (upstream/azure), bump keda-kaito-scaler to v0.5.1, instrument timings May 11, 2026
Signed-off-by: rambohe-ch <rambohe.ch@gmail.com>
Signed-off-by: rambohe-ch <rambohe.ch@gmail.com>
@zhuangqh zhuangqh merged commit 20fc3bf into kaito-project:main May 12, 2026
7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants