test(e2e): add GPU end-to-end suite for dynamo, vllm, kaito#330
Conversation
There was a problem hiding this comment.
Pull request overview
This PR adds a consolidated, GPU-cluster end-to-end test harness for the airunway inference providers (Dynamo, vLLM, KAITO). It deploys each provider through a real ModelDeployment, drives it to Running, and asserts inference actually serves through the inference gateway — closing the coverage gap where Dynamo's GPU e2e test was unused and vLLM/KAITO had none. The suite is a standalone, dependency-free Go module (build tag e2e) orchestrated by scripts/gpu-e2e.sh/make gpu-e2e, and is table-driven so future scenarios are data-only additions.
Changes:
- New
test/e2e/gpu/Go module: table-drivenTestGPUProvidersrunningdynamo/agg,vllm/agg,kaito/aggthrough a uniform lifecycle (pre-delete → apply → upstream CR → schedule classification → Running → GatewayReady → provider checks → inference), withkubectl-shelling helpers, scheduling/teardown/results logic, and three fixtures. - New
scripts/gpu-e2e.shorchestration (build+push controller/provider images, gate operator install on health, deploy, run the suite), agpu-e2eMakefile target, and a.gitignoreentry for result bundles. - Removed the superseded
TestDynamoProviderE2Eand its exclusive helpers/constants fromproviders/dynamo/test/e2e/dynamo_e2e_test.go(its deep assertions were ported into the newdynamo/aggcase).
Reviewed changes
Copilot reviewed 16 out of 17 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
| test/e2e/gpu/gpu_e2e_test.go | Table-driven lifecycle orchestration for each provider case. |
| test/e2e/gpu/cases_test.go | Test matrix (provider, fixture, upstream CR, pod selector). |
| test/e2e/gpu/main_test.go | TestMain GPU + gateway preconditions. |
| test/e2e/gpu/scheduling_test.go | Phase-1 scheduling classification (PASS/FAIL/SKIP). |
| test/e2e/gpu/lifecycle_test.go | Fixture apply/patch, pre-delete, and cleanup helpers. |
| test/e2e/gpu/teardown_test.go | Owner-first force-cascade + debug collection. |
| test/e2e/gpu/results_test.go | Per-case PASS/FAIL/SKIP artifact bundles. |
| test/e2e/gpu/dynamo_test.go | Ported Dynamo deep assertions (PVC, Job, ownership, conditions). |
| test/e2e/gpu/e2eutil/e2eutil.go | Dependency-free kubectl/HTTP helpers. |
| test/e2e/gpu/go.mod | New module declaration (go 1.25.3, consistent with repo). |
| test/e2e/gpu/testdata/*.yaml | Dynamo/vLLM/KAITO ModelDeployment fixtures. |
| scripts/gpu-e2e.sh | Build/deploy/run orchestration harness. |
| providers/dynamo/test/e2e/dynamo_e2e_test.go | Removes superseded TestDynamoProviderE2E and exclusive helpers. |
| Makefile | Adds gpu-e2e target and help entry. |
| .gitignore | Ignores gpu-e2e-results/. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
98d00a9 to
ef5f7d3
Compare
fbd6241 to
3695f1e
Compare
336ce7a to
10a1c17
Compare
Add a consolidated, GPU-cluster end-to-end test harness that deploys each provider through a real `ModelDeployment` and asserts inference serving via the inference gateway. - add `test/e2e/gpu/` — a zero-dependency Go module with a table-driven suite (`TestGPUProviders`) running each `(provider × scenario)` case as a parallel subtest: apply fixture, wait for the upstream CR + `Running`, assert `GatewayReady`, post `/v1/chat/completions` through the gateway LB, then tear down. Includes a `TestMain` GPU/gateway precondition gate, two-phase scheduling classification (`PASS`/`FAIL`/`SKIP`), owner-first teardown force-cascade, per-case result artifacts, and three fixtures. - add `scripts/gpu-e2e.sh` — thin harness that builds+pushes the four images in parallel, gates `setup-<p>` on operator health, deploys the controller and providers, then invokes the Go suite. - add `gpu-e2e` target and help entry to the root `Makefile` (flags passed via `GPU_E2E_ARGS`). - ignore `gpu-e2e-results/` per-run result bundles. - remove the superseded `TestDynamoProviderE2E` and its exclusive helpers from `providers/dynamo/test/e2e/`; its deep assertions (PVC, download Job, DGD ownership, intermediate conditions) are ported into the new dynamo case. The dynamo mocker, multinode, and storage-validation tests are retained. Signed-off-by: Suraj Deshmukh <suraj.deshmukh@microsoft.com>
…view The phase-1 scheduling classifier treated a pod that was scheduled to a node but still `Pending` while pulling a multi-GB image as unschedulable, calling `t.Fatalf` after the 2-minute deadline. Cold-cache runs (the common case in CI) would fail healthy cases. Treat a pod as scheduled unless it carries an explicit `PodScheduled=False` condition, leaving image-pull and startup latency to the 45-minute `Running` wait. - rewrite `classifyScheduling`/`unschedulableReason` so only `PodScheduled=False` counts as not-scheduled; add `scheduling_logic_test.go` covering all four decision branches. - re-add a cascade/no-orphans assertion (`assertNoOrphans`): after a graceful MD delete, verify the upstream CR (and the Dynamo PVC + download `Job`) are garbage-collected, restoring a regression check lost when the old `TestDynamoProviderE2E` was removed. - raise the `go test` global timeout from `45m` to `75m` so it cannot fire before a case completes its `t.Cleanup` and frees its GPU. - require `docker` unconditionally in `require_tools`, since `preflight_pull` runs `docker manifest inspect` even under `--skip-build`. - harden helpers: keep the first error in `WaitFor`, use `CombinedOutput` for `getNodes` so kubectl stderr surfaces, log swallowed `json.Unmarshal` errors, and fail `patchFixture` loudly if the storage-class literal is absent. - add a `## GPU End-to-End Testing` section to `docs/development.md` documenting the workflow, cluster preconditions, and `GPU_E2E_*` knobs. - declare `gpu-e2e` `.PHONY`; drop dead `pvcName`/`jobName` consts; fix stale comments referencing the removed test and the results-dir env var. Signed-off-by: Suraj Deshmukh <suraj.deshmukh@microsoft.com>
The inference assertion posted to `status.gateway.endpoint` (the gateway's external LoadBalancer IP), which is unreachable from machines whose egress to that IP is blocked by network policy (e.g. an NSG that denies Internet-sourced inbound) — so every case failed at `InferenceServing` despite serving correctly. Reach the gateway through a `kubectl port-forward` to the gateway Service instead, which tunnels via the API server and works from any machine with kubectl access. - add `e2eutil.PortForwardService`, a port-forward helper that exposes a cluster Service on a free local port and stops itself via `t.Cleanup`. - `assertInference` now port-forwards `svc/inference-gateway-istio` and posts to the local address; `GatewayChatCompletion` takes a base URL instead of an endpoint IP. The model name is still read from `status.gateway.modelName`. - fix `patchFixture` to only enforce the `storageClassName` literal for Dynamo fixtures that actually declare storage, so a storage-less fixture is no longer rejected. - document the shared-BBR restart race (`kaito-project#334`) as a known gateway limitation in `docs/gateway.md`, and note in the case table why disaggregated Dynamo serving is excluded from the suite. Signed-off-by: Suraj Deshmukh <suraj.deshmukh@microsoft.com>
**CI** - Add a `gpu-e2e-check` make target and CI job that run gofmt, `go vet`, an `-tags=e2e` compile, and the cluster-free unit tests on a plain runner, so the GPU-coupled suite cannot rot between out-of-band GPU runs. **Extract cluster-free logic for CI unit testing** - Move the scheduling classifier (`UnschedulableReason`, `PodScheduledMessage`, `PodInfo`, `GPUResource`) into a new tag-free `sched` package; `main_test.go` and `scheduling_test.go` now consume it. Replaces the `e2e`-tagged `scheduling_logic_test.go`, which could not run in CI. - Extract `parseChatResponse` and `InjectStorageClass` / `PinnedStorageClass` as pure functions in `e2eutil`, each with table-driven tests. `patchFixture` now delegates to `InjectStorageClass`. **Fixes** - Add `workloadSelector` to narrow the Dynamo scheduling check to the GPU worker. The graph-deployment selector also matches the GPU-less frontend, which schedules instantly and masked the capacity-SKIP path. - Harden the gateway port-forward: replace the fixed `sleep 3` with a readiness poll, and re-establish the tunnel via `EnsureReady` when it drops mid-window. - `cleanup` now force-cascades only on a delete timeout; other delete errors (RBAC, missing CRD) fail loudly instead of silently skipping the orphan check. - `atoiQuantity` uses `strconv.Atoi`, rejecting trailing junk like `5x` that `fmt.Sscanf` accepted. - Remove the dead `providerReadyTimeout` const. **Security / ops** - Pass `HF_TOKEN` to `kubectl create secret` via stdin (`--from-file=...=/dev/stdin`) so it never appears in process argv. - Recognize an existing KAITO operator from either the Helm chart or the AKS AI-toolchain add-on (`kube-system`) before installing, mirroring `providers/kaito/upstream_health.go`. - Fix a stale comment in `providers/dynamo/Makefile` (`TestDynamoProviderE2E` becomes `TestDynamoMultiNodeE2E`, `TestDynamoStorageValidationE2E`). Signed-off-by: Suraj Deshmukh <suraj.deshmukh@microsoft.com>
- `test.yml`: bump the `gpu-e2e-check` job's `actions/checkout` from `v6.0.3` to `v7.0.0`, matching the SHA every other job in the file already pins. - `gpu_e2e_test.go`: fix the `runCase` doc comment that claimed teardown is registered first. `recordResult` is registered first (so it runs last under LIFO); reword the header to match the actual registration order. Signed-off-by: Suraj Deshmukh <suraj.deshmukh@microsoft.com>
c16022e to
c2842cb
Compare
The old comment claimed `EnsureReady` could "re-pick" the local port and that a lost close→bind race was "not a hard failure". Neither is true — `p.local` is fixed at construction and re-bound as-is, and a genuine port steal makes `start()`'s readiness poll `t.Fatalf` at the 15s deadline. Reword the comment to match the code (comment-only; no behavior change). Signed-off-by: Suraj Deshmukh <suraj.deshmukh@microsoft.com>
Description
Adds a consolidated, GPU-cluster end-to-end test suite (
test/e2e/gpu/) that deploys each inference provider —dynamo,vllm, andkaito— through a realModelDeployment, drives it toRunning, and asserts that inference actually serves through the inference gateway. The suite is a zero-dependency Go module driven by a thin Bash orchestrator (scripts/gpu-e2e.sh), with its cluster-free decision logic carved into unit-testable packages that run in CI on a plain runner. It supersedes the old single-providerTestDynamoProviderE2E, porting its deep assertions into the new table-driven matrix. The workflow is documented indocs/development.mdunder GPU End-to-End Testing.Type of Change
Related Issues
kaito-project/airunway#334— BBR builds its model registry only at startup, so the controller rolling-restarts the shared BBR Deployment once per newModelDeployment(tracked by theairunway.ai/bbr-restartedannotation). The restart is not zero-downtime: during it, an in-flight request for an already-serving model can miss itsX-Gateway-Model-Nameheader and mis-route. This is documented as a known gateway limitation indocs/gateway.md, and is why disaggregated Dynamo serving is excluded from the v1 matrix.Changes Made
GPU end-to-end suite (
test/e2e/gpu/)e2ebuild tag. A single table-drivenTestGPUProvidersruns every(provider × scenario)case as a parallel subtest through a uniform lifecycle: apply fixture → wait for the rendered upstream CR → scheduling classification →Running+ provider-name check →GatewayReady→ provider-specific assertions → inference via the gateway → teardown.vllm/agg(deployments.apps),kaito/agg(workspaces.kaito.sh), anddynamo/agg(dynamographdeployments.nvidia.com), each with a fixture undertestdata/. Adding a new(provider × scenario)case is a data-only change to thecasestable.TestMainenforces cheap preconditions (≥1 allocatablenvidia.com/gpu, gatewayProgrammed) and fails fast.PASS/FAIL/SKIPoutcomes: a static permanent-unschedulable check (per-pod GPU demand vs. largest node) plus a deadline-bounded poll that distinguishes "not scheduled" (PodScheduled=False) from "scheduled, pulling image."t.Cleanupso each parallel case frees its GPU as soon as it finishes; a gracefulModelDeploymentdelete is followed byassertNoOrphans(upstream CR, Dynamo PVC, and downloadJobare garbage-collected), with a timeout-only force-cascade fallback. Per-case logs and aresultmarker are written under the results directory.Orchestration & build
scripts/gpu-e2e.shbuilds and pushes the controller + provider images in parallel, gatessetup-<provider>on operator health, deploys, then invokes the Go suite. It never creates or deletes the cluster. KAITO detection recognizes both the Helm chart and the AKS AI-toolchain add-on (kube-system), mirroringproviders/kaito/upstream_health.go.HF_TOKENis passed tokubectl create secretvia stdin so it never lands in process argv.Makefilegainsgpu-e2e(full run, flags viaGPU_E2E_ARGS) andgpu-e2e-check(cluster-free gate).CI
gpu-e2e-checkjob in.github/workflows/test.ymlrunsgofmt,go vet, an-tags=e2ecompile-check, and the cluster-free unit tests on a plainubuntu-latestrunner, so the GPU-coupled suite cannot rot between out-of-band GPU runs.Cluster-free logic extracted for unit testing
schedpackage (UnschedulableReason,PodScheduledMessage,PodInfo,GPUResource) ande2eutilhelpers (parseChatResponse,InjectStorageClass) are pure, tag-free functions with table-driven tests — exercising the classifier, the chat-response parser, and the storage-class injector without a cluster.Inference reachability fix
assertInferencereaches the gateway through akubectl port-forwardtosvc/inference-gateway-istio(e2eutil.PortForwardService) rather than the external LoadBalancer IP, so it works from machines whose egress to that IP is blocked by network policy. The port-forward uses a readiness poll instead of a fixed sleep and re-establishes itself viaEnsureReadyif the tunnel drops mid-window.Cleanup of superseded test
TestDynamoProviderE2Eand its exclusive helpers fromproviders/dynamo/test/e2e/; its PVC / download-Job / DGD-ownership assertions are ported into the newdynamocase. The Dynamo mocker, multinode, and storage-validation tests are retained, and a staleMakefilecomment is corrected.Docs
docs/development.mdgains a## GPU End-to-End Testingsection: the workflow, cluster preconditions (GPU nodes + NFD, an RWX-capableStorageClass, the inference gateway, image pull access), the run commands, theGPU_E2E_*environment knobs, and the PASS/FAIL/SKIP outcome semantics.docs/gateway.mddocuments the shared-BBR restart race (kaito-project/airunway#334).Testing
The full suite requires a pre-provisioned GPU cluster and runs out-of-band via
scripts/gpu-e2e.sh(seedocs/development.md). CI runs only the cluster-freegpu-e2e-checkgate.bun run test) — N/A: this branch is a standalone Go module, not the web UI. The equivalent gate ismake gpu-e2e-check, which passes locally (schedande2eutilgreen,gofmtclean,-tags=e2ecompile clean).Running,GatewayReady, and serve inference. ASKIP(insufficient GPU capacity) does not fail the run; only a broken deployment, failed inference, or orphaned resource after delete is aFAIL.southcentralus).Checklist
gofmtclean,go vetclean)bun run lint— N/A for this Go module;go vetis wired intogpu-e2e-checkinstead.sched,e2eutilgreen)docs/development.md,docs/gateway.md)Additional Notes
Cluster preconditions (the harness installs none of these except a missing operator via
setup-<p>):nvidia.com/gpuand thenvidia.com/gpu.present=truelabel.StorageClass. The Dynamo model-cache PVC defaults toReadWriteMany; Azure Disk classes areReadWriteOnceand leave the PVCPending. Default isazurefile-premium; override with--storage-class.Gatewaynamedinference-gateway), present andProgrammed.make -C providers/dynamo setup-dynamoinstalls it on a fresh cluster.imagePullSecret, so images must be public or the nodes must have pull access — new registry repos often default to private.Environment knobs (forwarded by the script; can also be set directly for
go test):GPU_E2E_STORAGE_CLASSStorageClassinjected into the Dynamo fixture and asserted on (defaultazurefile-premium). Set by--storage-class.GPU_E2E_KEEPtrue, leaveModelDeployments running after the test for inspection. Set by--keep.GPU_E2E_RESULTS_DIRtest/e2e/gpu/gpu-e2e-results/<timestamp>/).GPU_E2E_RUN_TStest/e2e/gpumodule has zero external dependencies (stdlib only), so it has ago.modbut nogo.sum; the CI cache key istest/e2e/gpu/go.mod.