Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 0 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -54,7 +54,6 @@ namespaces and are shared by every model deployment:
| BBR (Body-Based Router) | `istio-system` | `BBR_VERSION` (v1.3.1) | helm | Installed in Istio's rootNamespace so its EnvoyFilter applies cluster-wide; injects `X-Gateway-Model-Name`. |
| `llm-gateway-auth` ([`kaito-project/llm-gateway-auth`](https://github.com/kaito-project/llm-gateway-auth)) | `llm-gateway-auth` | `LLM_GATEWAY_AUTH_VERSION` | helm | API-key ext_authz for the `inference-gateway`. Installs the `APIKey` CRD, the `apikey-operator` (reconciles `APIKey` → per-namespace Secret), and the `apikey-authz` ext_authz dataplane wired into Istio via `MeshConfig` + `AuthorizationPolicy`. |
| KEDA + KEDA Kaito Scaler ([`kaito-project/keda-kaito-scaler`](https://github.com/kaito-project/keda-kaito-scaler), optional) | `keda` | `KEDA_VERSION` (v2.19.0), `KEDA_KAITO_SCALER_VERSION` (v0.4.1) | helm | Workload-metric autoscaling. |
| `model-not-found` (Deployment + ConfigMap + Service) | `default` | repo `HEAD` ([`hack/e2e/manifests/model-not-found.yaml`](hack/e2e/manifests/model-not-found.yaml)) | kubectl | Cluster-shared nginx-backed Service that returns OpenAI-compatible `404 model_not_found` JSON. Referenced cross-namespace by every workload namespace's catch-all `HTTPRoute` (authorised via a `ReferenceGrant` rendered by `charts/modelharness`). |

### Step 2. modelharness (one-time per workload namespace)

Expand Down
62 changes: 62 additions & 0 deletions charts/modelharness/templates/envoyfilter-not-found.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,62 @@
{{/*
Catch-all "model not found" responder, implemented as an Envoy
direct_response on the per-namespace Gateway. Replaces the previous
HTTPRoute → cluster-shared `model-not-found` Service design.

Why an EnvoyFilter direct_response instead of a backed HTTPRoute:
- Zero backend Pod / Service / cross-namespace ReferenceGrant.
- Response body is generated by Envoy itself (no extra hop).

Why a catch-all is REQUIRED (and not just a UX nicety):
Istio's CUSTOM AuthorizationPolicy is implemented as a paired
`envoy.filters.http.rbac` (shadow) + `envoy.filters.http.ext_authz`
filter — ext_authz is gated on metadata that the RBAC shadow filter
writes during decodeHeaders. When Envoy's router fails to match any
HTTPRoute it returns a local 404 BEFORE the RBAC shadow has finished
evaluating + writing that metadata, which means ext_authz is never
invoked and unknown-model requests SILENTLY BYPASS API-key auth.
Keeping a catch-all route that always matches preserves the full
filter-chain run and ensures auth runs on every request, regardless
of model name. Removing this template re-opens that bypass.

The patch is anchored to BBR's filter name as a `subFilter` so it
attaches to the same HCM that `install_bbr` injects BBR into. The
`workloadSelector` scopes it to this namespace's Gateway pod only.
*/}}
apiVersion: networking.istio.io/v1alpha3
kind: EnvoyFilter
metadata:
name: model-not-found-direct
namespace: {{ include "modelharness.namespace" . }}
labels:
{{- include "modelharness.labels" . | nindent 4 }}
spec:
workloadSelector:
labels:
gateway.networking.k8s.io/gateway-name: {{ include "modelharness.gatewayName" . | quote }}
configPatches:
- applyTo: VIRTUAL_HOST
match:
context: GATEWAY
routeConfiguration:
vhost:
name: ""
patch:
operation: MERGE
value:
routes:
# Appended last; deployment-specific HTTPRoute matches on
# X-Gateway-Model-Name win first, this rule catches the rest.
- name: model-not-found-fallback
match:
prefix: /
direct_response:
status: 404
body:
inline_string: |
{"error":{"message":"The model does not exist.","type":"invalid_request_error","param":"model","code":"model_not_found"}}
response_headers_to_add:
- header:
key: content-type
value: application/json
append_action: OVERWRITE_IF_EXISTS_OR_ADD
34 changes: 0 additions & 34 deletions charts/modelharness/templates/httproute-not-found.yaml

This file was deleted.

24 changes: 0 additions & 24 deletions charts/modelharness/templates/referencegrant.yaml

This file was deleted.

16 changes: 7 additions & 9 deletions charts/modelharness/values.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -19,15 +19,13 @@ gatewayName: ""
# gatewayPort is the HTTP listener port on the Gateway.
gatewayPort: 80

# modelNotFound configures the cross-namespace reference to the
# cluster-shared model-not-found Service that the catch-all HTTPRoute
# forwards unmatched requests to. The Service itself is installed once
# per cluster in `modelNotFound.namespace` (typically `default`) by the
# E2E install script — this chart only renders the catch-all HTTPRoute
# and the ReferenceGrant authorising the cross-namespace backendRef.
modelNotFound:
namespace: "default"
serviceName: "model-not-found"
# Catch-all "model not found" responses are now produced by an Envoy
# direct_response patched onto the Gateway's HCM via the
# `model-not-found-direct` EnvoyFilter (see
# templates/envoyfilter-not-found.yaml). No backend Pod / Service /
# ReferenceGrant is required, so the previous `modelNotFound` config
# (which pointed at a cluster-shared `default/model-not-found` Service)
# has been removed.

# auth toggles the per-namespace API-key authentication artifacts. When
# enabled, the chart renders:
Expand Down
58 changes: 0 additions & 58 deletions hack/e2e/manifests/model-not-found.yaml

This file was deleted.

37 changes: 18 additions & 19 deletions hack/e2e/scripts/install-components.sh
Original file line number Diff line number Diff line change
Expand Up @@ -21,9 +21,11 @@
# CRD is not yet served, so kubelet retries
# until KAITO finishes installing it)
# - BBR chart prefetch (git clone fork repo only)
# - Cluster-shared model-not-found Service in `default` (consumed by
# every workload namespace's catch-all HTTPRoute via a
# ReferenceGrant rendered by charts/modelharness).
#
# (Catch-all 404 handling is now provided by an EnvoyFilter
# direct_response rendered per-namespace by charts/modelharness — no
# cluster-shared Service is required, so install_model_not_found has
# been removed from this script.)
#
# Phase 2 (parallel, depends on Phase 1):
# - Istio (after Gateway API CRDs)
Expand Down Expand Up @@ -52,7 +54,6 @@
set -euo pipefail

SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
MANIFESTS_DIR="${SCRIPT_DIR}/../manifests"

# Validate required version variables are set.
: "${ISTIO_VERSION:?ISTIO_VERSION is not set. Source versions.env or export it before calling this script.}"
Expand Down Expand Up @@ -265,8 +266,19 @@ install_gateway_api_crds() {
}

install_gwie_crds() {
# Use server-side apply (--server-side --force-conflicts) instead of the
# default client-side apply. install_gwie_crds runs in parallel with
# install_kaito in phase1-base, and the KAITO chart bundles the same
# GWIE CRDs (inferencepools / inferenceobjectives in both
# inference.networking.k8s.io and inference.networking.x-k8s.io groups).
# Client-side apply does GET → CREATE-if-missing, which races with KAITO
# creating the CRD between the GET and the CREATE and fails with
# `AlreadyExists`. Server-side apply is a single atomic POST with a
# field manager: if the object already exists it is merged in place
# (with --force-conflicts taking ownership of any fields KAITO set).
echo "=== Installing GWIE CRDs ==="
kubectl apply -f "https://github.com/kubernetes-sigs/gateway-api-inference-extension/releases/latest/download/manifests.yaml"
kubectl apply --server-side --force-conflicts \
-f "https://github.com/kubernetes-sigs/gateway-api-inference-extension/releases/latest/download/manifests.yaml"
}

install_keda() {
Expand Down Expand Up @@ -430,18 +442,6 @@ install_llm_gateway_auth() {
kubectl -n llm-gateway-auth rollout status deployment/apikey-authz --timeout=180s || true
}

install_model_not_found() {
# Cluster-shared catch-all 404 Service in `default`. Every workload
# namespace's modelharness release renders a catch-all HTTPRoute that
# forwards unmatched requests to this Service across namespaces,
# authorised by a per-namespace ReferenceGrant.
echo "=== Deploying cluster-shared model-not-found Service in default ==="
kubectl apply -f "${MANIFESTS_DIR}/model-not-found.yaml"

echo "⏳ Waiting for model-not-found service..."
kubectl -n default rollout status deployment/model-not-found --timeout=120s || true
}

# ── Phased execution ──────────────────────────────────────────────────────
#
# Per-namespace shared resources (Gateway, catch-all HTTPRoute,
Expand All @@ -456,8 +456,7 @@ run_phase phase1-base \
install_keda \
install_keda_kaito_scaler \
install_gpu_mocker \
prefetch_bbr_chart \
install_model_not_found
prefetch_bbr_chart

run_phase phase2-istio \
install_istio
Expand Down
19 changes: 3 additions & 16 deletions hack/e2e/scripts/validate-components.sh
Original file line number Diff line number Diff line change
Expand Up @@ -79,22 +79,9 @@ fi
kubectl -n istio-system get pods -l app=body-based-router 2>/dev/null || true
echo ""

# ── Cluster-shared model-not-found backend ──────────────────────────────
# After the modelharness refactor, per-namespace Istio Gateways
# ("<namespace>-gw") are provisioned at test time by EnsureNamespace
# (charts/modelharness), so no `inference-gateway` Gateway pod exists in
# `default` to validate at install time. The only namespace-tier
# component install-components.sh still pre-installs is the
# cluster-shared 404 Service that every workload namespace's catch-all
# HTTPRoute references via a ReferenceGrant — validate that here.
echo "=== model-not-found (cluster-shared 404 backend) ==="
if kubectl -n default wait --for=condition=ready pod -l app=model-not-found --timeout="${TIMEOUT}" >/dev/null 2>&1; then
pass "model-not-found pod is Running"
else
fail "model-not-found pod is NOT Running"
fi
kubectl -n default get pods -l app=model-not-found 2>/dev/null || true
echo ""
# (Catch-all 404 handling is now produced by an EnvoyFilter
# direct_response rendered per-namespace by charts/modelharness — no
# cluster-shared Service exists to validate.)

# ── KEDA ─────────────────────────────────────────────────────────────────
echo "=== KEDA (namespace: ${KEDA_NAMESPACE}, provider: ${E2E_PROVIDER}) ==="
Expand Down
4 changes: 2 additions & 2 deletions test/e2e/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@ Single source of truth: [`cases.go`](cases.go) → `CaseDeployments`. Each entry

`Name` is unique cluster-wide and is the value matched by `X-Gateway-Model-Name` (i.e. the `model` field clients send in OpenAI-compatible requests). `Model` is the KAITO preset only — multiple deployments may share a preset under different `Name`s.

Inference tests target the case's **`caseGatewayURL`**. Each case namespace gets its own Gateway, catch-all `model-not-found` route, and (when enabled) API-key auth artifacts via the [`charts/modelharness`](../../charts/modelharness) chart installed by `EnsureNamespace`.
Inference tests target the case's **`caseGatewayURL`**. Each case namespace gets its own Gateway, catch-all `model-not-found-direct` EnvoyFilter (Envoy `direct_response` 404), and (when enabled) API-key auth artifacts via the [`charts/modelharness`](../../charts/modelharness) chart installed by `EnsureNamespace`.

## Helpers

Expand Down Expand Up @@ -159,7 +159,7 @@ var GinkgoLabelMyFeature = ginkgo.Label("MyFeature")

### 5. Add per-namespace resources (rare)

If your case needs additional cluster-side resources beyond what the [`charts/modelharness`](../../charts/modelharness) chart already provisions (Gateway, catch-all `model-not-found` Service + HTTPRoute, optional `AuthorizationPolicy` + `APIKey`), add them as templates in `charts/modelharness` so every workload namespace picks them up consistently.
If your case needs additional cluster-side resources beyond what the [`charts/modelharness`](../../charts/modelharness) chart already provisions (Gateway, catch-all `model-not-found-direct` EnvoyFilter, optional `AuthorizationPolicy` + `APIKey`), add them as templates in `charts/modelharness` so every workload namespace picks them up consistently.

### 6. Validate

Expand Down
16 changes: 8 additions & 8 deletions test/e2e/gpu_mocker_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -330,14 +330,14 @@ var _ = Describe("GPU Mocker E2E", Ordered, func() {

Context("Non-existent model request", func() {
It("should return 404 with an OpenAI-compatible error for an unknown model", func() {
// The catch-all model-not-found HTTPRoute is provisioned
// per-namespace by the modelharness chart (installed via
// EnsureNamespace) and forwards unmatched requests across
// namespaces to the cluster-shared `default/model-not-found`
// Service (authorised by a ReferenceGrant). The gpu-mocker
// case has AuthAPIKeyEnabled=false, so no
// AuthorizationPolicy is rendered and the probe needs no
// bearer token.
// The catch-all `model-not-found-direct` EnvoyFilter is
// provisioned per-namespace by the modelharness chart
// (installed via EnsureNamespace) and patches an Envoy
// `direct_response` (status 404 + OpenAI-compatible JSON) onto
// the Gateway's virtual host as a catch-all route. No backend
// Pod / Service is involved. The gpu-mocker case has
// AuthAPIKeyEnabled=false, so no AuthorizationPolicy is
// rendered and the probe needs no bearer token.
resp, err := utils.SendChatCompletion(caseGatewayURL, "non-existent-model-xyz")
Expect(err).NotTo(HaveOccurred())
Expect(resp.StatusCode).To(Equal(http.StatusNotFound))
Expand Down
Loading
Loading