Skip to content

test(e2e): add GPU end-to-end suite for dynamo, vllm, kaito#330

Merged
robert-cronin merged 7 commits into
kaito-project:mainfrom
surajssd:self-hosted-runners
Jun 30, 2026
Merged

test(e2e): add GPU end-to-end suite for dynamo, vllm, kaito#330
robert-cronin merged 7 commits into
kaito-project:mainfrom
surajssd:self-hosted-runners

Conversation

@surajssd

@surajssd surajssd commented Jun 23, 2026

Copy link
Copy Markdown
Contributor

Description

Adds a consolidated, GPU-cluster end-to-end test suite (test/e2e/gpu/) that deploys each inference provider — dynamo, vllm, and kaito — through a real ModelDeployment, drives it to Running, and asserts that inference actually serves through the inference gateway. The suite is a zero-dependency Go module driven by a thin Bash orchestrator (scripts/gpu-e2e.sh), with its cluster-free decision logic carved into unit-testable packages that run in CI on a plain runner. It supersedes the old single-provider TestDynamoProviderE2E, porting its deep assertions into the new table-driven matrix. The workflow is documented in docs/development.md under GPU End-to-End Testing.

Type of Change

  • 🐛 Bug fix (non-breaking change that fixes an issue)
  • ✨ New feature (non-breaking change that adds functionality)
  • 💥 Breaking change (fix or feature that would cause existing functionality to change)
  • 📚 Documentation update
  • 🎨 UI/UX improvement
  • ♻️ Refactoring (no functional changes)
  • 🧪 Test update
  • 🔧 Build/CI configuration

Related Issues

  • Relates to kaito-project/airunway#334 — BBR builds its model registry only at startup, so the controller rolling-restarts the shared BBR Deployment once per new ModelDeployment (tracked by the airunway.ai/bbr-restarted annotation). The restart is not zero-downtime: during it, an in-flight request for an already-serving model can miss its X-Gateway-Model-Name header and mis-route. This is documented as a known gateway limitation in docs/gateway.md, and is why disaggregated Dynamo serving is excluded from the v1 matrix.

Changes Made

GPU end-to-end suite (test/e2e/gpu/)

  • New zero-dependency Go module gated by the e2e build tag. A single table-driven TestGPUProviders runs every (provider × scenario) case as a parallel subtest through a uniform lifecycle: apply fixture → wait for the rendered upstream CR → scheduling classification → Running + provider-name check → GatewayReady → provider-specific assertions → inference via the gateway → teardown.
  • v1 matrix covers three aggregated cases: vllm/agg (deployments.apps), kaito/agg (workspaces.kaito.sh), and dynamo/agg (dynamographdeployments.nvidia.com), each with a fixture under testdata/. Adding a new (provider × scenario) case is a data-only change to the cases table.
  • TestMain enforces cheap preconditions (≥1 allocatable nvidia.com/gpu, gateway Programmed) and fails fast.
  • Two-phase scheduling classifier produces three-state PASS/FAIL/SKIP outcomes: a static permanent-unschedulable check (per-pod GPU demand vs. largest node) plus a deadline-bounded poll that distinguishes "not scheduled" (PodScheduled=False) from "scheduled, pulling image."
  • Teardown runs as t.Cleanup so each parallel case frees its GPU as soon as it finishes; a graceful ModelDeployment delete is followed by assertNoOrphans (upstream CR, Dynamo PVC, and download Job are garbage-collected), with a timeout-only force-cascade fallback. Per-case logs and a result marker are written under the results directory.

Orchestration & build

  • scripts/gpu-e2e.sh builds and pushes the controller + provider images in parallel, gates setup-<provider> on operator health, deploys, then invokes the Go suite. It never creates or deletes the cluster. KAITO detection recognizes both the Helm chart and the AKS AI-toolchain add-on (kube-system), mirroring providers/kaito/upstream_health.go. HF_TOKEN is passed to kubectl create secret via stdin so it never lands in process argv.
  • Root Makefile gains gpu-e2e (full run, flags via GPU_E2E_ARGS) and gpu-e2e-check (cluster-free gate).

CI

  • New gpu-e2e-check job in .github/workflows/test.yml runs gofmt, go vet, an -tags=e2e compile-check, and the cluster-free unit tests on a plain ubuntu-latest runner, so the GPU-coupled suite cannot rot between out-of-band GPU runs.

Cluster-free logic extracted for unit testing

  • sched package (UnschedulableReason, PodScheduledMessage, PodInfo, GPUResource) and e2eutil helpers (parseChatResponse, InjectStorageClass) are pure, tag-free functions with table-driven tests — exercising the classifier, the chat-response parser, and the storage-class injector without a cluster.

Inference reachability fix

  • assertInference reaches the gateway through a kubectl port-forward to svc/inference-gateway-istio (e2eutil.PortForwardService) rather than the external LoadBalancer IP, so it works from machines whose egress to that IP is blocked by network policy. The port-forward uses a readiness poll instead of a fixed sleep and re-establishes itself via EnsureReady if the tunnel drops mid-window.

Cleanup of superseded test

  • Removes TestDynamoProviderE2E and its exclusive helpers from providers/dynamo/test/e2e/; its PVC / download-Job / DGD-ownership assertions are ported into the new dynamo case. The Dynamo mocker, multinode, and storage-validation tests are retained, and a stale Makefile comment is corrected.

Docs

  • docs/development.md gains a ## GPU End-to-End Testing section: the workflow, cluster preconditions (GPU nodes + NFD, an RWX-capable StorageClass, the inference gateway, image pull access), the run commands, the GPU_E2E_* environment knobs, and the PASS/FAIL/SKIP outcome semantics.
  • docs/gateway.md documents the shared-BBR restart race (kaito-project/airunway#334).

Testing

The full suite requires a pre-provisioned GPU cluster and runs out-of-band via scripts/gpu-e2e.sh (see docs/development.md). CI runs only the cluster-free gpu-e2e-check gate.

# All three providers, building+pushing images to your registry:
make gpu-e2e GPU_E2E_ARGS="--provider all --registry <your-registry>"

# A single provider:
make gpu-e2e GPU_E2E_ARGS="--provider vllm --registry <your-registry>"

# Re-test without rebuilding (requires an explicit, already-pushed tag):
make gpu-e2e GPU_E2E_ARGS="--provider dynamo --skip-build \
    --registry <your-registry> --img-tag <tag>"

# Run the Go suite directly against an already-deployed cluster (no rebuild):
go test -C test/e2e/gpu -tags=e2e -v -run 'TestGPUProviders/vllm' ./

# Cluster-free gate (what CI runs): gofmt + go vet + -tags=e2e compile + unit tests:
make gpu-e2e-check
  • Unit tests pass (bun run test) — N/A: this branch is a standalone Go module, not the web UI. The equivalent gate is make gpu-e2e-check, which passes locally (sched and e2eutil green, gofmt clean, -tags=e2e compile clean).
  • Manual testing performed — iterate-to-green against a live cluster; all three aggregated cases reach Running, GatewayReady, and serve inference. A SKIP (insufficient GPU capacity) does not fail the run; only a broken deployment, failed inference, or orphaned resource after delete is a FAIL.
  • Tested with a Kubernetes cluster — 4×A100 80GB AKS cluster (southcentralus).

Checklist

  • My code follows the project's style guidelines (gofmt clean, go vet clean)
  • I have run bun run lint — N/A for this Go module; go vet is wired into gpu-e2e-check instead.
  • I have added tests that prove my fix/feature works
  • New and existing unit tests pass locally (sched, e2eutil green)
  • I have updated documentation if needed (docs/development.md, docs/gateway.md)
  • My changes generate no new warnings

Additional Notes

Cluster preconditions (the harness installs none of these except a missing operator via setup-<p>):

  • GPU nodes with the NVIDIA GPU Operator and NFD, advertising nvidia.com/gpu and the nvidia.com/gpu.present=true label.
  • An RWX-capable StorageClass. The Dynamo model-cache PVC defaults to ReadWriteMany; Azure Disk classes are ReadWriteOnce and leave the PVC Pending. Default is azurefile-premium; override with --storage-class.
  • The inference gateway (Gateway API CRDs + GAIE + Istio + BBR + a Gateway named inference-gateway), present and Programmed. make -C providers/dynamo setup-dynamo installs it on a fresh cluster.
  • Pull access to the pushed images. The manager manifests carry no imagePullSecret, so images must be public or the nodes must have pull access — new registry repos often default to private.

Environment knobs (forwarded by the script; can also be set directly for go test):

Variable Meaning
GPU_E2E_STORAGE_CLASS RWX StorageClass injected into the Dynamo fixture and asserted on (default azurefile-premium). Set by --storage-class.
GPU_E2E_KEEP When true, leave ModelDeployments running after the test for inspection. Set by --keep.
GPU_E2E_RESULTS_DIR Override for where per-case result bundles are written (default test/e2e/gpu/gpu-e2e-results/<timestamp>/).
GPU_E2E_RUN_TS Optional fixed timestamp for the results directory name.
  • The new test/e2e/gpu module has zero external dependencies (stdlib only), so it has a go.mod but no go.sum; the CI cache key is test/e2e/gpu/go.mod.
  • KubeRay is not yet covered by the suite.

@surajssd surajssd requested a review from a team as a code owner June 23, 2026 22:48
Copilot AI review requested due to automatic review settings June 23, 2026 22:48

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds a consolidated, GPU-cluster end-to-end test harness for the airunway inference providers (Dynamo, vLLM, KAITO). It deploys each provider through a real ModelDeployment, drives it to Running, and asserts inference actually serves through the inference gateway — closing the coverage gap where Dynamo's GPU e2e test was unused and vLLM/KAITO had none. The suite is a standalone, dependency-free Go module (build tag e2e) orchestrated by scripts/gpu-e2e.sh/make gpu-e2e, and is table-driven so future scenarios are data-only additions.

Changes:

  • New test/e2e/gpu/ Go module: table-driven TestGPUProviders running dynamo/agg, vllm/agg, kaito/agg through a uniform lifecycle (pre-delete → apply → upstream CR → schedule classification → Running → GatewayReady → provider checks → inference), with kubectl-shelling helpers, scheduling/teardown/results logic, and three fixtures.
  • New scripts/gpu-e2e.sh orchestration (build+push controller/provider images, gate operator install on health, deploy, run the suite), a gpu-e2e Makefile target, and a .gitignore entry for result bundles.
  • Removed the superseded TestDynamoProviderE2E and its exclusive helpers/constants from providers/dynamo/test/e2e/dynamo_e2e_test.go (its deep assertions were ported into the new dynamo/agg case).

Reviewed changes

Copilot reviewed 16 out of 17 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
test/e2e/gpu/gpu_e2e_test.go Table-driven lifecycle orchestration for each provider case.
test/e2e/gpu/cases_test.go Test matrix (provider, fixture, upstream CR, pod selector).
test/e2e/gpu/main_test.go TestMain GPU + gateway preconditions.
test/e2e/gpu/scheduling_test.go Phase-1 scheduling classification (PASS/FAIL/SKIP).
test/e2e/gpu/lifecycle_test.go Fixture apply/patch, pre-delete, and cleanup helpers.
test/e2e/gpu/teardown_test.go Owner-first force-cascade + debug collection.
test/e2e/gpu/results_test.go Per-case PASS/FAIL/SKIP artifact bundles.
test/e2e/gpu/dynamo_test.go Ported Dynamo deep assertions (PVC, Job, ownership, conditions).
test/e2e/gpu/e2eutil/e2eutil.go Dependency-free kubectl/HTTP helpers.
test/e2e/gpu/go.mod New module declaration (go 1.25.3, consistent with repo).
test/e2e/gpu/testdata/*.yaml Dynamo/vLLM/KAITO ModelDeployment fixtures.
scripts/gpu-e2e.sh Build/deploy/run orchestration harness.
providers/dynamo/test/e2e/dynamo_e2e_test.go Removes superseded TestDynamoProviderE2E and exclusive helpers.
Makefile Adds gpu-e2e target and help entry.
.gitignore Ignores gpu-e2e-results/.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread test/e2e/gpu/scheduling_test.go Outdated
Comment thread scripts/gpu-e2e.sh Outdated
Comment thread test/e2e/gpu/testdata/dynamo-modeldeployment.yaml Outdated
Comment thread test/e2e/gpu/results_test.go Outdated
Comment thread Makefile

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 18 out of 19 changed files in this pull request and generated 2 comments.

Comment thread test/e2e/gpu/gpu_e2e_test.go Outdated
Comment thread test/e2e/gpu/scheduling_logic_test.go Outdated
Copilot AI review requested due to automatic review settings June 24, 2026 21:20
@surajssd surajssd force-pushed the self-hosted-runners branch from fbd6241 to 3695f1e Compare June 24, 2026 21:20

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 19 out of 20 changed files in this pull request and generated 2 comments.

Comment thread test/e2e/gpu/gpu_e2e_test.go Outdated
Comment thread test/e2e/gpu/main_test.go
Copilot AI review requested due to automatic review settings June 24, 2026 22:28
@surajssd surajssd force-pushed the self-hosted-runners branch from 336ce7a to 10a1c17 Compare June 24, 2026 22:28

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 24 out of 25 changed files in this pull request and generated 2 comments.

Comment thread .github/workflows/test.yml Outdated
Comment thread test/e2e/gpu/gpu_e2e_test.go Outdated
surajssd added 5 commits June 25, 2026 13:30
Add a consolidated, GPU-cluster end-to-end test harness that deploys
each provider through a real `ModelDeployment` and asserts inference
serving via the inference gateway.

- add `test/e2e/gpu/` — a zero-dependency Go module with a table-driven
  suite (`TestGPUProviders`) running each `(provider × scenario)` case
  as a parallel subtest: apply fixture, wait for the upstream CR +
  `Running`, assert `GatewayReady`, post `/v1/chat/completions` through
  the gateway LB, then tear down. Includes a `TestMain` GPU/gateway
  precondition gate, two-phase scheduling classification
  (`PASS`/`FAIL`/`SKIP`), owner-first teardown force-cascade, per-case
  result artifacts, and three fixtures.
- add `scripts/gpu-e2e.sh` — thin harness that builds+pushes the four
  images in parallel, gates `setup-<p>` on operator health, deploys the
  controller and providers, then invokes the Go suite.
- add `gpu-e2e` target and help entry to the root `Makefile` (flags
  passed via `GPU_E2E_ARGS`).
- ignore `gpu-e2e-results/` per-run result bundles.
- remove the superseded `TestDynamoProviderE2E` and its exclusive
  helpers from `providers/dynamo/test/e2e/`; its deep assertions (PVC,
  download Job, DGD ownership, intermediate conditions) are ported into
  the new dynamo case. The dynamo mocker, multinode, and
  storage-validation tests are retained.

Signed-off-by: Suraj Deshmukh <suraj.deshmukh@microsoft.com>
…view

The phase-1 scheduling classifier treated a pod that was scheduled to a
node but still `Pending` while pulling a multi-GB image as
unschedulable, calling `t.Fatalf` after the 2-minute deadline.
Cold-cache runs (the common case in CI) would fail healthy cases. Treat
a pod as scheduled unless it carries an explicit `PodScheduled=False`
condition, leaving image-pull and startup latency to the 45-minute
`Running` wait.

- rewrite `classifyScheduling`/`unschedulableReason` so only
  `PodScheduled=False` counts as not-scheduled; add
  `scheduling_logic_test.go` covering all four decision branches.
- re-add a cascade/no-orphans assertion (`assertNoOrphans`): after a
  graceful MD delete, verify the upstream CR (and the Dynamo PVC +
  download `Job`) are garbage-collected, restoring a regression check
  lost when the old `TestDynamoProviderE2E` was removed.
- raise the `go test` global timeout from `45m` to `75m` so it cannot
  fire before a case completes its `t.Cleanup` and frees its GPU.
- require `docker` unconditionally in `require_tools`, since
  `preflight_pull` runs `docker manifest inspect` even under
  `--skip-build`.
- harden helpers: keep the first error in `WaitFor`, use
  `CombinedOutput` for `getNodes` so kubectl stderr surfaces, log
  swallowed `json.Unmarshal` errors, and fail `patchFixture` loudly if
  the storage-class literal is absent.
- add a `## GPU End-to-End Testing` section to `docs/development.md`
  documenting the workflow, cluster preconditions, and `GPU_E2E_*`
  knobs.
- declare `gpu-e2e` `.PHONY`; drop dead `pvcName`/`jobName` consts; fix
  stale comments referencing the removed test and the results-dir env
  var.

Signed-off-by: Suraj Deshmukh <suraj.deshmukh@microsoft.com>
The inference assertion posted to `status.gateway.endpoint` (the
gateway's external LoadBalancer IP), which is unreachable from machines
whose egress to that IP is blocked by network policy (e.g. an NSG that
denies Internet-sourced inbound) — so every case failed at
`InferenceServing` despite serving correctly. Reach the gateway through
a `kubectl port-forward` to the gateway Service instead, which tunnels
via the API server and works from any machine with kubectl access.

- add `e2eutil.PortForwardService`, a port-forward helper that exposes a
  cluster Service on a free local port and stops itself via `t.Cleanup`.
- `assertInference` now port-forwards `svc/inference-gateway-istio` and
  posts to the local address; `GatewayChatCompletion` takes a base URL
  instead of an endpoint IP. The model name is still read from
  `status.gateway.modelName`.
- fix `patchFixture` to only enforce the `storageClassName` literal for
  Dynamo fixtures that actually declare storage, so a storage-less
  fixture is no longer rejected.
- document the shared-BBR restart race (`kaito-project#334`) as
  a known gateway limitation in `docs/gateway.md`, and note in the case
  table why disaggregated Dynamo serving is excluded from the suite.

Signed-off-by: Suraj Deshmukh <suraj.deshmukh@microsoft.com>
**CI**
- Add a `gpu-e2e-check` make target and CI job that run gofmt, `go vet`,
  an `-tags=e2e` compile, and the cluster-free unit tests on a plain
  runner, so the GPU-coupled suite cannot rot between out-of-band GPU
  runs.

**Extract cluster-free logic for CI unit testing**
- Move the scheduling classifier (`UnschedulableReason`,
  `PodScheduledMessage`, `PodInfo`, `GPUResource`) into a new tag-free
  `sched` package; `main_test.go` and `scheduling_test.go` now consume
  it. Replaces the `e2e`-tagged `scheduling_logic_test.go`, which could
  not run in CI.
- Extract `parseChatResponse` and `InjectStorageClass` /
  `PinnedStorageClass` as pure functions in `e2eutil`, each with
  table-driven tests. `patchFixture` now delegates to
  `InjectStorageClass`.

**Fixes**
- Add `workloadSelector` to narrow the Dynamo scheduling check to the
  GPU worker. The graph-deployment selector also matches the GPU-less
  frontend, which schedules instantly and masked the capacity-SKIP path.
- Harden the gateway port-forward: replace the fixed `sleep 3` with a
  readiness poll, and re-establish the tunnel via `EnsureReady` when it
  drops mid-window.
- `cleanup` now force-cascades only on a delete timeout; other delete
  errors (RBAC, missing CRD) fail loudly instead of silently skipping
  the orphan check.
- `atoiQuantity` uses `strconv.Atoi`, rejecting trailing junk like `5x`
  that `fmt.Sscanf` accepted.
- Remove the dead `providerReadyTimeout` const.

**Security / ops**
- Pass `HF_TOKEN` to `kubectl create secret` via stdin
  (`--from-file=...=/dev/stdin`) so it never appears in process argv.
- Recognize an existing KAITO operator from either the Helm chart or the
  AKS AI-toolchain add-on (`kube-system`) before installing, mirroring
  `providers/kaito/upstream_health.go`.
- Fix a stale comment in `providers/dynamo/Makefile`
  (`TestDynamoProviderE2E` becomes `TestDynamoMultiNodeE2E`,
  `TestDynamoStorageValidationE2E`).

Signed-off-by: Suraj Deshmukh <suraj.deshmukh@microsoft.com>
- `test.yml`: bump the `gpu-e2e-check` job's `actions/checkout` from
  `v6.0.3` to `v7.0.0`, matching the SHA every other job in the file
  already pins.
- `gpu_e2e_test.go`: fix the `runCase` doc comment that claimed teardown
  is registered first. `recordResult` is registered first (so it runs
  last under LIFO); reword the header to match the actual registration
  order.

Signed-off-by: Suraj Deshmukh <suraj.deshmukh@microsoft.com>
Copilot AI review requested due to automatic review settings June 25, 2026 20:30
@surajssd surajssd force-pushed the self-hosted-runners branch from c16022e to c2842cb Compare June 25, 2026 20:30

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 24 out of 25 changed files in this pull request and generated 1 comment.

Comment thread test/e2e/gpu/e2eutil/e2eutil.go Outdated
surajssd and others added 2 commits June 26, 2026 13:21
The old comment claimed `EnsureReady` could "re-pick" the local port and
that a lost close→bind race was "not a hard failure". Neither is true —
`p.local` is fixed at construction and re-bound as-is, and a genuine
port steal makes `start()`'s readiness poll `t.Fatalf` at the 15s
deadline. Reword the comment to match the code (comment-only; no
behavior change).

Signed-off-by: Suraj Deshmukh <suraj.deshmukh@microsoft.com>
Copilot AI review requested due to automatic review settings June 29, 2026 23:46

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 24 out of 25 changed files in this pull request and generated no new comments.

@robert-cronin robert-cronin merged commit 5096797 into kaito-project:main Jun 30, 2026
16 checks passed
@surajssd surajssd deleted the self-hosted-runners branch June 30, 2026 00:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants