test(e2e): add GPU end-to-end suite for dynamo, vllm, kaito by surajssd · Pull Request #330 · kaito-project/airunway

surajssd · 2026-06-23T22:48:12Z

Description

Adds a consolidated, GPU-cluster end-to-end test suite (test/e2e/gpu/) that deploys each inference provider — dynamo, vllm, and kaito — through a real ModelDeployment, drives it to Running, and asserts that inference actually serves through the inference gateway. The suite is a zero-dependency Go module driven by a thin Bash orchestrator (scripts/gpu-e2e.sh), with its cluster-free decision logic carved into unit-testable packages that run in CI on a plain runner. It supersedes the old single-provider TestDynamoProviderE2E, porting its deep assertions into the new table-driven matrix. The workflow is documented in docs/development.md under GPU End-to-End Testing.

Type of Change

🐛 Bug fix (non-breaking change that fixes an issue)
✨ New feature (non-breaking change that adds functionality)
💥 Breaking change (fix or feature that would cause existing functionality to change)
📚 Documentation update
🎨 UI/UX improvement
♻️ Refactoring (no functional changes)
🧪 Test update
🔧 Build/CI configuration

Related Issues

Relates to kaito-project/airunway#334 — BBR builds its model registry only at startup, so the controller rolling-restarts the shared BBR Deployment once per new ModelDeployment (tracked by the airunway.ai/bbr-restarted annotation). The restart is not zero-downtime: during it, an in-flight request for an already-serving model can miss its X-Gateway-Model-Name header and mis-route. This is documented as a known gateway limitation in docs/gateway.md, and is why disaggregated Dynamo serving is excluded from the v1 matrix.

Changes Made

GPU end-to-end suite (test/e2e/gpu/)

New zero-dependency Go module gated by the e2e build tag. A single table-driven TestGPUProviders runs every (provider × scenario) case as a parallel subtest through a uniform lifecycle: apply fixture → wait for the rendered upstream CR → scheduling classification → Running + provider-name check → GatewayReady → provider-specific assertions → inference via the gateway → teardown.
v1 matrix covers three aggregated cases: vllm/agg (deployments.apps), kaito/agg (workspaces.kaito.sh), and dynamo/agg (dynamographdeployments.nvidia.com), each with a fixture under testdata/. Adding a new (provider × scenario) case is a data-only change to the cases table.
TestMain enforces cheap preconditions (≥1 allocatable nvidia.com/gpu, gateway Programmed) and fails fast.
Two-phase scheduling classifier produces three-state PASS/FAIL/SKIP outcomes: a static permanent-unschedulable check (per-pod GPU demand vs. largest node) plus a deadline-bounded poll that distinguishes "not scheduled" (PodScheduled=False) from "scheduled, pulling image."
Teardown runs as t.Cleanup so each parallel case frees its GPU as soon as it finishes; a graceful ModelDeployment delete is followed by assertNoOrphans (upstream CR, Dynamo PVC, and download Job are garbage-collected), with a timeout-only force-cascade fallback. Per-case logs and a result marker are written under the results directory.

Orchestration & build

scripts/gpu-e2e.sh builds and pushes the controller + provider images in parallel, gates setup-<provider> on operator health, deploys, then invokes the Go suite. It never creates or deletes the cluster. KAITO detection recognizes both the Helm chart and the AKS AI-toolchain add-on (kube-system), mirroring providers/kaito/upstream_health.go. HF_TOKEN is passed to kubectl create secret via stdin so it never lands in process argv.
Root Makefile gains gpu-e2e (full run, flags via GPU_E2E_ARGS) and gpu-e2e-check (cluster-free gate).

CI

New gpu-e2e-check job in .github/workflows/test.yml runs gofmt, go vet, an -tags=e2e compile-check, and the cluster-free unit tests on a plain ubuntu-latest runner, so the GPU-coupled suite cannot rot between out-of-band GPU runs.

Cluster-free logic extracted for unit testing

sched package (UnschedulableReason, PodScheduledMessage, PodInfo, GPUResource) and e2eutil helpers (parseChatResponse, InjectStorageClass) are pure, tag-free functions with table-driven tests — exercising the classifier, the chat-response parser, and the storage-class injector without a cluster.

Inference reachability fix

assertInference reaches the gateway through a kubectl port-forward to svc/inference-gateway-istio (e2eutil.PortForwardService) rather than the external LoadBalancer IP, so it works from machines whose egress to that IP is blocked by network policy. The port-forward uses a readiness poll instead of a fixed sleep and re-establishes itself via EnsureReady if the tunnel drops mid-window.

Cleanup of superseded test

Removes TestDynamoProviderE2E and its exclusive helpers from providers/dynamo/test/e2e/; its PVC / download-Job / DGD-ownership assertions are ported into the new dynamo case. The Dynamo mocker, multinode, and storage-validation tests are retained, and a stale Makefile comment is corrected.

Docs

docs/development.md gains a ## GPU End-to-End Testing section: the workflow, cluster preconditions (GPU nodes + NFD, an RWX-capable StorageClass, the inference gateway, image pull access), the run commands, the GPU_E2E_* environment knobs, and the PASS/FAIL/SKIP outcome semantics.
docs/gateway.md documents the shared-BBR restart race (kaito-project/airunway#334).

Testing

The full suite requires a pre-provisioned GPU cluster and runs out-of-band via scripts/gpu-e2e.sh (see docs/development.md). CI runs only the cluster-free gpu-e2e-check gate.

# All three providers, building+pushing images to your registry:
make gpu-e2e GPU_E2E_ARGS="--provider all --registry <your-registry>"

# A single provider:
make gpu-e2e GPU_E2E_ARGS="--provider vllm --registry <your-registry>"

# Re-test without rebuilding (requires an explicit, already-pushed tag):
make gpu-e2e GPU_E2E_ARGS="--provider dynamo --skip-build \
    --registry <your-registry> --img-tag <tag>"

# Run the Go suite directly against an already-deployed cluster (no rebuild):
go test -C test/e2e/gpu -tags=e2e -v -run 'TestGPUProviders/vllm' ./

# Cluster-free gate (what CI runs): gofmt + go vet + -tags=e2e compile + unit tests:
make gpu-e2e-check

Unit tests pass (bun run test) — N/A: this branch is a standalone Go module, not the web UI. The equivalent gate is make gpu-e2e-check, which passes locally (sched and e2eutil green, gofmt clean, -tags=e2e compile clean).
Manual testing performed — iterate-to-green against a live cluster; all three aggregated cases reach Running, GatewayReady, and serve inference. A SKIP (insufficient GPU capacity) does not fail the run; only a broken deployment, failed inference, or orphaned resource after delete is a FAIL.
Tested with a Kubernetes cluster — 4×A100 80GB AKS cluster (southcentralus).

Checklist

My code follows the project's style guidelines (gofmt clean, go vet clean)
I have run bun run lint — N/A for this Go module; go vet is wired into gpu-e2e-check instead.
I have added tests that prove my fix/feature works
New and existing unit tests pass locally (sched, e2eutil green)
I have updated documentation if needed (docs/development.md, docs/gateway.md)
My changes generate no new warnings

Additional Notes

Cluster preconditions (the harness installs none of these except a missing operator via setup-<p>):

GPU nodes with the NVIDIA GPU Operator and NFD, advertising nvidia.com/gpu and the nvidia.com/gpu.present=true label.
An RWX-capable StorageClass. The Dynamo model-cache PVC defaults to ReadWriteMany; Azure Disk classes are ReadWriteOnce and leave the PVC Pending. Default is azurefile-premium; override with --storage-class.
The inference gateway (Gateway API CRDs + GAIE + Istio + BBR + a Gateway named inference-gateway), present and Programmed. make -C providers/dynamo setup-dynamo installs it on a fresh cluster.
Pull access to the pushed images. The manager manifests carry no imagePullSecret, so images must be public or the nodes must have pull access — new registry repos often default to private.

Environment knobs (forwarded by the script; can also be set directly for go test):

Variable	Meaning
`GPU_E2E_STORAGE_CLASS`	RWX `StorageClass` injected into the Dynamo fixture and asserted on (default `azurefile-premium`). Set by `--storage-class`.
`GPU_E2E_KEEP`	When `true`, leave `ModelDeployment`s running after the test for inspection. Set by `--keep`.
`GPU_E2E_RESULTS_DIR`	Override for where per-case result bundles are written (default `test/e2e/gpu/gpu-e2e-results/<timestamp>/`).
`GPU_E2E_RUN_TS`	Optional fixed timestamp for the results directory name.

The new test/e2e/gpu module has zero external dependencies (stdlib only), so it has a go.mod but no go.sum; the CI cache key is test/e2e/gpu/go.mod.
KubeRay is not yet covered by the suite.

Copilot

Pull request overview

This PR adds a consolidated, GPU-cluster end-to-end test harness for the airunway inference providers (Dynamo, vLLM, KAITO). It deploys each provider through a real ModelDeployment, drives it to Running, and asserts inference actually serves through the inference gateway — closing the coverage gap where Dynamo's GPU e2e test was unused and vLLM/KAITO had none. The suite is a standalone, dependency-free Go module (build tag e2e) orchestrated by scripts/gpu-e2e.sh/make gpu-e2e, and is table-driven so future scenarios are data-only additions.

Changes:

New test/e2e/gpu/ Go module: table-driven TestGPUProviders running dynamo/agg, vllm/agg, kaito/agg through a uniform lifecycle (pre-delete → apply → upstream CR → schedule classification → Running → GatewayReady → provider checks → inference), with kubectl-shelling helpers, scheduling/teardown/results logic, and three fixtures.
New scripts/gpu-e2e.sh orchestration (build+push controller/provider images, gate operator install on health, deploy, run the suite), a gpu-e2e Makefile target, and a .gitignore entry for result bundles.
Removed the superseded TestDynamoProviderE2E and its exclusive helpers/constants from providers/dynamo/test/e2e/dynamo_e2e_test.go (its deep assertions were ported into the new dynamo/agg case).

Reviewed changes

Copilot reviewed 16 out of 17 changed files in this pull request and generated 5 comments.

Show a summary per file

File	Description
test/e2e/gpu/gpu_e2e_test.go	Table-driven lifecycle orchestration for each provider case.
test/e2e/gpu/cases_test.go	Test matrix (provider, fixture, upstream CR, pod selector).
test/e2e/gpu/main_test.go	`TestMain` GPU + gateway preconditions.
test/e2e/gpu/scheduling_test.go	Phase-1 scheduling classification (PASS/FAIL/SKIP).
test/e2e/gpu/lifecycle_test.go	Fixture apply/patch, pre-delete, and cleanup helpers.
test/e2e/gpu/teardown_test.go	Owner-first force-cascade + debug collection.
test/e2e/gpu/results_test.go	Per-case PASS/FAIL/SKIP artifact bundles.
test/e2e/gpu/dynamo_test.go	Ported Dynamo deep assertions (PVC, Job, ownership, conditions).
test/e2e/gpu/e2eutil/e2eutil.go	Dependency-free kubectl/HTTP helpers.
test/e2e/gpu/go.mod	New module declaration (go 1.25.3, consistent with repo).
test/e2e/gpu/testdata/*.yaml	Dynamo/vLLM/KAITO ModelDeployment fixtures.
scripts/gpu-e2e.sh	Build/deploy/run orchestration harness.
providers/dynamo/test/e2e/dynamo_e2e_test.go	Removes superseded `TestDynamoProviderE2E` and exclusive helpers.
Makefile	Adds `gpu-e2e` target and help entry.
.gitignore	Ignores `gpu-e2e-results/`.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot

Pull request overview

Copilot reviewed 18 out of 19 changed files in this pull request and generated 2 comments.

Copilot

Pull request overview

Copilot reviewed 19 out of 20 changed files in this pull request and generated 2 comments.

Copilot

Pull request overview

Copilot reviewed 24 out of 25 changed files in this pull request and generated 2 comments.

Add a consolidated, GPU-cluster end-to-end test harness that deploys each provider through a real `ModelDeployment` and asserts inference serving via the inference gateway. - add `test/e2e/gpu/` — a zero-dependency Go module with a table-driven suite (`TestGPUProviders`) running each `(provider × scenario)` case as a parallel subtest: apply fixture, wait for the upstream CR + `Running`, assert `GatewayReady`, post `/v1/chat/completions` through the gateway LB, then tear down. Includes a `TestMain` GPU/gateway precondition gate, two-phase scheduling classification (`PASS`/`FAIL`/`SKIP`), owner-first teardown force-cascade, per-case result artifacts, and three fixtures. - add `scripts/gpu-e2e.sh` — thin harness that builds+pushes the four images in parallel, gates `setup-<p>` on operator health, deploys the controller and providers, then invokes the Go suite. - add `gpu-e2e` target and help entry to the root `Makefile` (flags passed via `GPU_E2E_ARGS`). - ignore `gpu-e2e-results/` per-run result bundles. - remove the superseded `TestDynamoProviderE2E` and its exclusive helpers from `providers/dynamo/test/e2e/`; its deep assertions (PVC, download Job, DGD ownership, intermediate conditions) are ported into the new dynamo case. The dynamo mocker, multinode, and storage-validation tests are retained. Signed-off-by: Suraj Deshmukh <suraj.deshmukh@microsoft.com>

…view The phase-1 scheduling classifier treated a pod that was scheduled to a node but still `Pending` while pulling a multi-GB image as unschedulable, calling `t.Fatalf` after the 2-minute deadline. Cold-cache runs (the common case in CI) would fail healthy cases. Treat a pod as scheduled unless it carries an explicit `PodScheduled=False` condition, leaving image-pull and startup latency to the 45-minute `Running` wait. - rewrite `classifyScheduling`/`unschedulableReason` so only `PodScheduled=False` counts as not-scheduled; add `scheduling_logic_test.go` covering all four decision branches. - re-add a cascade/no-orphans assertion (`assertNoOrphans`): after a graceful MD delete, verify the upstream CR (and the Dynamo PVC + download `Job`) are garbage-collected, restoring a regression check lost when the old `TestDynamoProviderE2E` was removed. - raise the `go test` global timeout from `45m` to `75m` so it cannot fire before a case completes its `t.Cleanup` and frees its GPU. - require `docker` unconditionally in `require_tools`, since `preflight_pull` runs `docker manifest inspect` even under `--skip-build`. - harden helpers: keep the first error in `WaitFor`, use `CombinedOutput` for `getNodes` so kubectl stderr surfaces, log swallowed `json.Unmarshal` errors, and fail `patchFixture` loudly if the storage-class literal is absent. - add a `## GPU End-to-End Testing` section to `docs/development.md` documenting the workflow, cluster preconditions, and `GPU_E2E_*` knobs. - declare `gpu-e2e` `.PHONY`; drop dead `pvcName`/`jobName` consts; fix stale comments referencing the removed test and the results-dir env var. Signed-off-by: Suraj Deshmukh <suraj.deshmukh@microsoft.com>

The inference assertion posted to `status.gateway.endpoint` (the gateway's external LoadBalancer IP), which is unreachable from machines whose egress to that IP is blocked by network policy (e.g. an NSG that denies Internet-sourced inbound) — so every case failed at `InferenceServing` despite serving correctly. Reach the gateway through a `kubectl port-forward` to the gateway Service instead, which tunnels via the API server and works from any machine with kubectl access. - add `e2eutil.PortForwardService`, a port-forward helper that exposes a cluster Service on a free local port and stops itself via `t.Cleanup`. - `assertInference` now port-forwards `svc/inference-gateway-istio` and posts to the local address; `GatewayChatCompletion` takes a base URL instead of an endpoint IP. The model name is still read from `status.gateway.modelName`. - fix `patchFixture` to only enforce the `storageClassName` literal for Dynamo fixtures that actually declare storage, so a storage-less fixture is no longer rejected. - document the shared-BBR restart race (`kaito-project#334`) as a known gateway limitation in `docs/gateway.md`, and note in the case table why disaggregated Dynamo serving is excluded from the suite. Signed-off-by: Suraj Deshmukh <suraj.deshmukh@microsoft.com>

**CI** - Add a `gpu-e2e-check` make target and CI job that run gofmt, `go vet`, an `-tags=e2e` compile, and the cluster-free unit tests on a plain runner, so the GPU-coupled suite cannot rot between out-of-band GPU runs. **Extract cluster-free logic for CI unit testing** - Move the scheduling classifier (`UnschedulableReason`, `PodScheduledMessage`, `PodInfo`, `GPUResource`) into a new tag-free `sched` package; `main_test.go` and `scheduling_test.go` now consume it. Replaces the `e2e`-tagged `scheduling_logic_test.go`, which could not run in CI. - Extract `parseChatResponse` and `InjectStorageClass` / `PinnedStorageClass` as pure functions in `e2eutil`, each with table-driven tests. `patchFixture` now delegates to `InjectStorageClass`. **Fixes** - Add `workloadSelector` to narrow the Dynamo scheduling check to the GPU worker. The graph-deployment selector also matches the GPU-less frontend, which schedules instantly and masked the capacity-SKIP path. - Harden the gateway port-forward: replace the fixed `sleep 3` with a readiness poll, and re-establish the tunnel via `EnsureReady` when it drops mid-window. - `cleanup` now force-cascades only on a delete timeout; other delete errors (RBAC, missing CRD) fail loudly instead of silently skipping the orphan check. - `atoiQuantity` uses `strconv.Atoi`, rejecting trailing junk like `5x` that `fmt.Sscanf` accepted. - Remove the dead `providerReadyTimeout` const. **Security / ops** - Pass `HF_TOKEN` to `kubectl create secret` via stdin (`--from-file=...=/dev/stdin`) so it never appears in process argv. - Recognize an existing KAITO operator from either the Helm chart or the AKS AI-toolchain add-on (`kube-system`) before installing, mirroring `providers/kaito/upstream_health.go`. - Fix a stale comment in `providers/dynamo/Makefile` (`TestDynamoProviderE2E` becomes `TestDynamoMultiNodeE2E`, `TestDynamoStorageValidationE2E`). Signed-off-by: Suraj Deshmukh <suraj.deshmukh@microsoft.com>

- `test.yml`: bump the `gpu-e2e-check` job's `actions/checkout` from `v6.0.3` to `v7.0.0`, matching the SHA every other job in the file already pins. - `gpu_e2e_test.go`: fix the `runCase` doc comment that claimed teardown is registered first. `recordResult` is registered first (so it runs last under LIFO); reword the header to match the actual registration order. Signed-off-by: Suraj Deshmukh <suraj.deshmukh@microsoft.com>

Copilot

Pull request overview

Copilot reviewed 24 out of 25 changed files in this pull request and generated 1 comment.

The old comment claimed `EnsureReady` could "re-pick" the local port and that a lost close→bind race was "not a hard failure". Neither is true — `p.local` is fixed at construction and re-bound as-is, and a genuine port steal makes `start()`'s readiness poll `t.Fatalf` at the 15s deadline. Reword the comment to match the code (comment-only; no behavior change). Signed-off-by: Suraj Deshmukh <suraj.deshmukh@microsoft.com>

Copilot

Pull request overview

Copilot reviewed 24 out of 25 changed files in this pull request and generated no new comments.

surajssd requested a review from a team as a code owner June 23, 2026 22:48

Copilot AI review requested due to automatic review settings June 23, 2026 22:48

Copilot started reviewing on behalf of surajssd June 23, 2026 22:48 View session

Copilot AI reviewed Jun 23, 2026

View reviewed changes

Comment thread test/e2e/gpu/scheduling_test.go Outdated

Comment thread scripts/gpu-e2e.sh Outdated

Comment thread test/e2e/gpu/testdata/dynamo-modeldeployment.yaml Outdated

Comment thread test/e2e/gpu/results_test.go Outdated

Comment thread Makefile

surajssd force-pushed the self-hosted-runners branch from 98d00a9 to ef5f7d3 Compare June 24, 2026 00:00

surajssd requested a review from Copilot June 24, 2026 00:09

Copilot started reviewing on behalf of surajssd June 24, 2026 00:10 View session

Copilot AI reviewed Jun 24, 2026

View reviewed changes

Comment thread test/e2e/gpu/gpu_e2e_test.go Outdated

Comment thread test/e2e/gpu/scheduling_logic_test.go Outdated

Copilot AI review requested due to automatic review settings June 24, 2026 21:20

surajssd force-pushed the self-hosted-runners branch from fbd6241 to 3695f1e Compare June 24, 2026 21:20

Copilot started reviewing on behalf of surajssd June 24, 2026 21:21 View session

Copilot AI reviewed Jun 24, 2026

View reviewed changes

Comment thread test/e2e/gpu/gpu_e2e_test.go Outdated

Comment thread test/e2e/gpu/main_test.go

Copilot AI review requested due to automatic review settings June 24, 2026 22:28

surajssd force-pushed the self-hosted-runners branch from 336ce7a to 10a1c17 Compare June 24, 2026 22:28

Copilot started reviewing on behalf of surajssd June 24, 2026 22:38 View session

Copilot AI reviewed Jun 24, 2026

View reviewed changes

Comment thread .github/workflows/test.yml Outdated

Comment thread test/e2e/gpu/gpu_e2e_test.go Outdated

surajssd added 5 commits June 25, 2026 13:30

Copilot AI review requested due to automatic review settings June 25, 2026 20:30

surajssd force-pushed the self-hosted-runners branch from c16022e to c2842cb Compare June 25, 2026 20:30

Copilot started reviewing on behalf of surajssd June 25, 2026 20:30 View session

Copilot AI reviewed Jun 25, 2026

View reviewed changes

Comment thread test/e2e/gpu/e2eutil/e2eutil.go Outdated

surajssd and others added 2 commits June 26, 2026 13:21

Merge branch 'main' into self-hosted-runners

a183c50

Copilot AI review requested due to automatic review settings June 29, 2026 23:46

Copilot started reviewing on behalf of robert-cronin June 29, 2026 23:46 View session

Copilot AI reviewed Jun 29, 2026

View reviewed changes

robert-cronin approved these changes Jun 30, 2026

View reviewed changes

robert-cronin merged commit 5096797 into kaito-project:main Jun 30, 2026
16 checks passed

surajssd deleted the self-hosted-runners branch June 30, 2026 00:04

Uh oh!

Conversation

surajssd commented Jun 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Type of Change

Related Issues

Changes Made

Testing

Checklist

Additional Notes

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

surajssd commented Jun 23, 2026 •

edited

Loading