This project evaluates a production inference stack built on top of existing OSS projects. The stack is designed based on the llm-d reference stack. Credits go to llm-d contributors for the reference architecture and the contribution of several core components, such as the EPP. In this stack, KAITO is the inference engine, and we focus on evaluating the request routing and autoscaling performance. We run the vLLM simulator so that the entire stack can be evaluated using CPUs only.
- Istio Gateway — Entry point for all inference requests. Routes client requests (e.g.,
POST /v1/chat/completions) through the stack. - llm-gateway-auth — ext_authz API-key authorization filter. Validates the
Authorization: Bearer <token>header against anAPIKeycustom resource resolved from the request'sHostsubdomain (<namespace>.gw.example.com) before any routing or model dispatch happens. Ships two components —apikey-operator(reconcilesAPIKeyCRs into per-namespace Secrets) andapikey-authz(the ext_authz dataplane). - body-based routing (BBR) — Parses request body to extract the model name and injects the
X-Gateway-Model-Nameheader, enabling model-level routing. - llm-d-inference-scheduler (EPP) — Per-model Endpoint Picker (image
mcr.microsoft.com/oss/v2/llm-d/llm-d-inference-scheduler). Performs KV-cache aware routing by injecting thex-gateway-destination-endpointheader, directing requests to the optimal inference pod. - Kaito InferenceSet — Manages groups of vLLM inference pods. Multiple InferenceSets (e.g., Model-A, Model-B) can run different models simultaneously.
- vLLM Inference Pods(llm-d-inference-sim) — Serve model inference requests. On CPU-only E2E clusters, the real vLLM container is replaced by a shadow pod running llm-d-inference-sim (image
ghcr.io/llm-d/llm-d-inference-sim), a lightweight vLLM-compatible simulator that exposes the same OpenAI API andvllm:*Prometheus metrics. Seepkg/gpu-node-mocker/README.mdfor the original-pod ↔ shadow-pod mechanism. - keda-kaito-scaler — Metric-based autoscaler built on KEDA that scales vLLM inference pods up and down based on workload metrics.
- Mocked GPU Nodes / CPU Nodes — Infrastructure layer providing compute resources for inference workloads. The
gpu-node-mockercontroller (E2E-only) fakes GPU nodes on CPU-only clusters and runs thellm-d-inference-simshadow pods on real CPU nodes.
- Client → Istio Gateway. Client sends
POST /v1/chat/completionsto<namespace>.gw.example.comwith a bearer token. - Gateway → ext-proc filters.
llm-gateway-authvalidates the token; BBR parses the body and injectsX-Gateway-Model-Name. - Gateway → EPP. The per-deployment
HTTPRoutematches the model name and callsllm-d-inference-scheduler, which returns the target pod viax-gateway-destination-endpoint. - Gateway → vLLM Pod. Envoy forwards the request directly to the chosen inference pod; the response streams back along the reverse path.
- Unmatched models. The namespace's
model-not-found-directEnvoyFilter(rendered bycharts/modelharness) patches a catch-alldirect_responseonto the Gateway's HCM, returning an OpenAI-compatible404 model_not_founddirectly from Envoy with no backend Pod / Service /ReferenceGrant. The catch-all is also required to keep API-key ext_authz running on unknown-model requests (Istio's CUSTOMAuthorizationPolicyis gated on metadata written during route matching).
- vLLM pods → metrics. Each pod exposes
vllm:*Prometheus metrics (queue depth, KV-cache utilisation, request rate). - keda-kaito-scaler → KEDA. The external scaler aggregates per-
InferenceSetpod metrics and returns a single summed metric value. - KEDA → HPA → InferenceSet. KEDA exposes that value through the external metrics API; the HPA computes the desired replica count from it and patches the
InferenceSet, and the KAITO controller adds or removes vLLM pods.
Install the stack in three steps. Step 1 is one-time per cluster; steps 2 and 3 are repeated per workload namespace and per model.
Installed by hack/e2e/scripts/install-components.sh
(or its production equivalent). These components live across multiple
namespaces and are shared by every model deployment:
| Component | Namespace | Version (versions.env) |
Install method | Role |
|---|---|---|---|---|
| KAITO workspace controller | kaito-system |
latest chart, image nightly-latest |
helm | Reconciles InferenceSet and provisions inference pods. |
gpu-node-mocker (E2E-only) |
kaito-system |
repo HEAD (SHADOW_CONTROLLER_IMAGE) |
helm | Creates fake GPU nodes + shadow pods on CPU-only clusters. |
| Gateway API CRDs | cluster-scoped | GATEWAY_API_VERSION (v1.2.0) |
kubectl | Required for Gateway, HTTPRoute, ReferenceGrant. |
Istio control plane (istiod) |
istio-system |
ISTIO_VERSION (1.29.2) |
istioctl | Implements the Gateway dataplane (Envoy) and ext_proc filter chain. |
| GAIE CRDs | cluster-scoped | latest | kubectl | InferencePool, InferenceObjective. |
| BBR (Body-Based Router) | istio-system |
BBR_VERSION (v1.3.1) |
helm | Installed in Istio's rootNamespace so its EnvoyFilter applies cluster-wide; injects X-Gateway-Model-Name. |
llm-gateway-auth (kaito-project/llm-gateway-auth) |
llm-gateway-auth |
LLM_GATEWAY_AUTH_VERSION (0.0.7-alpha) |
helm | API-key ext_authz for the inference-gateway. Installs the APIKey CRD, the apikey-operator (reconciles APIKey → per-namespace Secret), and the apikey-authz ext_authz dataplane wired into Istio via MeshConfig + AuthorizationPolicy. |
KEDA + KEDA Kaito Scaler (kaito-project/keda-kaito-scaler, optional) |
keda (or kube-system when E2E_PROVIDER=azure) |
KEDA_VERSION (v2.19.0), KEDA_KAITO_SCALER_VERSION (v0.5.1) |
helm | Workload-metric autoscaling. With E2E_PROVIDER=azure, KEDA is provided by the AKS managed add-on in kube-system and the keda-kaito-scaler chart is installed alongside it. |
Provisioned by the charts/modelharness Helm
chart. One Helm release per workload namespace owns every per-namespace
shared resource — the Istio Gateway that fronts the namespace, the
catch-all EnvoyFilter (model-not-found-direct) that returns an
OpenAI-compatible 404 for unknown models directly from Envoy, and —
when enabled — the per-namespace AuthorizationPolicy + APIKey CR
that wire that Gateway into the cluster-wide apikey-ext-authz CUSTOM
provider, plus optional NetworkPolicy resources that lock down
East-West ingress to inference workloads. A namespace may host one or
more model deployments, all of which share its Gateway:
| Resource | Where | Version | Source | Role |
|---|---|---|---|---|
Gateway (gateway.networking.k8s.io/v1) |
Per namespace | API v1 |
charts/modelharness |
Public entry point; gatewayClassName: istio, HTTP/80. |
Catch-all EnvoyFilter model-not-found-direct |
Per namespace | networking.istio.io/v1alpha3 |
charts/modelharness |
Patches a direct_response onto the namespace Gateway's HCM, returning an OpenAI 404 for any path not matched by a deployment-specific HTTPRoute. Required to keep API-key ext_authz running on unknown-model requests. |
AuthorizationPolicy apikey-gateway-ext-authz (auth-enabled) |
Per namespace | security.istio.io/v1 |
charts/modelharness (auth.enabled) |
Wires the per-namespace Gateway pod into the cluster-wide apikey-ext-authz CUSTOM provider (registered in MeshConfig by llm-gateway-auth). |
APIKey default (auth-enabled) |
Per namespace | apikeys.kaito.sh/v1alpha1 |
charts/modelharness (auth.enabled) |
Triggers the apikey-operator to reconcile a Secret (llm-api-key) holding the bearer token clients send. |
NetworkPolicy default-deny-ingress + allow-inference-traffic (network-policy enabled) |
Per namespace | networking.k8s.io/v1 |
charts/modelharness (networkPolicy.enabled) |
Denies all ingress to non-gateway pods, then re-permits intra-namespace ingress so EPP can reach vLLM/shadow pods. Cross-namespace ingress can be opened via networkPolicy.allowedIngressNamespaces (e.g. keda for the keda-kaito-scaler). |
In the E2E suite the chart is installed and uninstalled by
EnsureNamespace / DeleteNamespace (called
from InstallCase / UninstallCase in
cases.go). The auth- and network-policy-related
resources are skipped when auth.enabled=false /
networkPolicy.enabled=false.
Provisioned by the charts/modeldeployment Helm chart. One Helm release
per model deployment, parented to the namespace's Gateway:
| Resource | Version (chart-rendered) | Install method | Role |
|---|---|---|---|
InferenceSet (kaito.sh/v1alpha1) |
v1alpha1 |
helm | Reconciled by KAITO; renders inference pods running vLLM. |
InferencePool (inference.networking.k8s.io/v1) |
v1 |
helm | Selects the inference pods backing this deployment. |
EPP Deployment + Service + RBAC + ConfigMap |
apps/v1, v1, rbac/v1 |
helm | Endpoint Picker (llm-d-inference-scheduler) for KV-cache aware routing. |
HTTPRoute (gateway.networking.k8s.io/v1) |
v1 |
helm | Matches X-Gateway-Model-Name == <name> on the namespace's Gateway and forwards to the InferencePool. |
The chart's name value is the per-deployment routing key; model is
the underlying KAITO preset. See the
charts/modeldeployment chart README
for the full value schema and install examples.
A flat index of the CRD-backed resources Production Stack creates,
grouped by the controller / chart that owns it. Kubernetes native objects
(Deployment, Service, ConfigMap, ServiceAccount, Role /
RoleBinding, Pod, Node, …) are intentionally omitted — they are
implementation details of the charts above and are not listed here.
| Resource (Kind) | Group / Version | Source | Purpose |
|---|---|---|---|
Workspace |
kaito.sh/v1alpha1 |
KAITO | Aggregates inference workloads (used indirectly via InferenceSet). |
InferenceSet |
kaito.sh/v1alpha1 |
KAITO | Declares one model deployment; KAITO renders inference pods. |
InferencePool |
inference.networking.k8s.io/v1 |
Gateway API Inference Extension (GAIE) | GAIE pool selecting the inference pods backing a deployment. |
InferenceObjective |
inference.networking.k8s.io/v1 |
Gateway API Inference Extension (GAIE) | API object defining objective contracts; CRD only — not authored by this stack. |
APIKey |
apikeys.kaito.sh/v1alpha1 |
kaito-project/llm-gateway-auth |
Declares an API key for a gateway namespace; the apikey-operator reconciles it into a Secret (llm-api-key by default) consumed by the apikey-authz ext_authz filter. |
Gateway |
gateway.networking.k8s.io/v1 |
Kubernetes Gateway API | Per-namespace public entry point; gatewayClassName: istio, HTTP/80. |
HTTPRoute |
gateway.networking.k8s.io/v1 |
Kubernetes Gateway API | Per-deployment routes match X-Gateway-Model-Name == <name> and forward to the deployment's InferencePool. |
EnvoyFilter |
networking.istio.io/v1alpha3 |
Istio | BBR injects ext_proc into every Istio Gateway via rootNamespace. charts/modelharness also renders the per-namespace model-not-found-direct filter, which patches an OpenAI 404 direct_response onto the namespace Gateway's HCM as the catch-all for unknown models. |
AuthorizationPolicy |
security.istio.io/v1 |
Istio (rendered by llm-gateway-auth + charts/modelharness) |
llm-gateway-auth targets the cluster-wide inference-gateway; charts/modelharness renders a per-namespace apikey-gateway-ext-authz policy that wires each workload namespace's Gateway pod into the apikey-ext-authz CUSTOM provider. |
The E2E suite under test/e2e/ exercises the full stack
(Gateway → llm-gateway-auth (ext_authz) → BBR → EPP
(llm-d-inference-scheduler) → vLLM shadow pod
(llm-d-inference-sim)) against a live AKS cluster. Tests run as
parallel Ordered Ginkgo Describes, one per case namespace.
See test/e2e/README.md for the full framework
guide, helper API, and the
Adding a new e2e test
workflow.
Releases are driven by a single manual GitHub Actions workflow,
Create release (manually), which
chains two reusable workflows
(publish-image.yaml and
publish-helm-chart.yaml)
into one synchronous run. A release publishes:
- a multi-arch container image at
ghcr.io/kaito-project/gpu-node-mocker:<X.Y.Z>(no leadingv); - the three Helm charts under
charts/—gpu-node-mocker,modeldeployment, andmodelharness— to the chart repository hosted on this repo'sgh-pagesbranch (https://kaito-project.github.io/production-stack/charts/kaito-project); - a GitHub Release at the same
vX.Y.Ztag with auto-generated changelog notes.
To publish vX.Y.Z:
-
Open a PR against
mainthat bumps the chart versions for any chart whose contents changed in this release. For each touched chart incharts/, update itsChart.yaml(versionandappVersion). Whencharts/gpu-node-mockerships a new mocker image, also update itsvalues.yamlimage.tag. A typical bump touches: -
After the PR is merged, run Actions → "Create release (manually)" with
release_version=vX.Y.Z.
Notes:
- Use the same
vX.Y.Zvalue across all jobs. Git tags / Release names carry the leadingv; container image tags do not (X.Y.Z). - Each Helm chart is published with the version declared in its own
Chart.yaml; the workflow does not rewrite chart versions. Always bump charts you intend to ship in step 1. - If a job fails part-way through (e.g. Trivy finds a new CVE), fix the
underlying issue and rerun the workflow —
publish-imageandcreate-gh-releaseare idempotent (they skip tag/release creation when they already exist). - When publishing patch releases on an older minor while
mainhas moved on, cut arelease-vX.Ybranch (e.g.release-v0.3) and run "Create release (manually)" against that branch's tag.
Production Stack is licensed under the Apache License 2.0.
