kaito-project · techworldhello · May 13, 2026 · May 12, 2026
@@ -54,7 +54,6 @@ namespaces and are shared by every model deployment:
 | BBR (Body-Based Router)              | `istio-system`   | `BBR_VERSION` (v1.3.1)                  | helm           | Installed in Istio's rootNamespace so its EnvoyFilter applies cluster-wide; injects `X-Gateway-Model-Name`. |
 | `llm-gateway-auth` ([`kaito-project/llm-gateway-auth`](https://github.com/kaito-project/llm-gateway-auth)) | `llm-gateway-auth` | `LLM_GATEWAY_AUTH_VERSION` | helm           | API-key ext_authz for the `inference-gateway`. Installs the `APIKey` CRD, the `apikey-operator` (reconciles `APIKey` → per-namespace Secret), and the `apikey-authz` ext_authz dataplane wired into Istio via `MeshConfig` + `AuthorizationPolicy`. |
 | KEDA + KEDA Kaito Scaler ([`kaito-project/keda-kaito-scaler`](https://github.com/kaito-project/keda-kaito-scaler), optional)  | `keda` | `KEDA_VERSION` (v2.19.0), `KEDA_KAITO_SCALER_VERSION` (v0.4.1) | helm | Workload-metric autoscaling.                                                    |
-| `model-not-found` (Deployment + ConfigMap + Service) | `default` | repo `HEAD` ([`hack/e2e/manifests/model-not-found.yaml`](hack/e2e/manifests/model-not-found.yaml)) | kubectl | Cluster-shared nginx-backed Service that returns OpenAI-compatible `404 model_not_found` JSON. Referenced cross-namespace by every workload namespace's catch-all `HTTPRoute` (authorised via a `ReferenceGrant` rendered by `charts/modelharness`). |
 
 ### Step 2. modelharness (one-time per workload namespace)
 

@@ -0,0 +1,62 @@
+{{/*
+Catch-all "model not found" responder, implemented as an Envoy
+direct_response on the per-namespace Gateway. Replaces the previous
+HTTPRoute → cluster-shared `model-not-found` Service design.
+
+Why an EnvoyFilter direct_response instead of a backed HTTPRoute:
+  - Zero backend Pod / Service / cross-namespace ReferenceGrant.
+  - Response body is generated by Envoy itself (no extra hop).
+
+Why a catch-all is REQUIRED (and not just a UX nicety):
+  Istio's CUSTOM AuthorizationPolicy is implemented as a paired
+  `envoy.filters.http.rbac` (shadow) + `envoy.filters.http.ext_authz`
+  filter — ext_authz is gated on metadata that the RBAC shadow filter
+  writes during decodeHeaders. When Envoy's router fails to match any
+  HTTPRoute it returns a local 404 BEFORE the RBAC shadow has finished
+  evaluating + writing that metadata, which means ext_authz is never
+  invoked and unknown-model requests SILENTLY BYPASS API-key auth.
+  Keeping a catch-all route that always matches preserves the full
+  filter-chain run and ensures auth runs on every request, regardless
+  of model name. Removing this template re-opens that bypass.
+
+The patch is anchored to BBR's filter name as a `subFilter` so it
+attaches to the same HCM that `install_bbr` injects BBR into. The
+`workloadSelector` scopes it to this namespace's Gateway pod only.
+*/}}
+apiVersion: networking.istio.io/v1alpha3
+kind: EnvoyFilter
+metadata:
+  name: model-not-found-direct
+  namespace: {{ include "modelharness.namespace" . }}
+  labels:
+    {{- include "modelharness.labels" . | nindent 4 }}
+spec:
+  workloadSelector:
+    labels:
+      gateway.networking.k8s.io/gateway-name: {{ include "modelharness.gatewayName" . | quote }}
+  configPatches:
+    - applyTo: VIRTUAL_HOST
+      match:
+        context: GATEWAY
+        routeConfiguration:
+          vhost:
+            name: ""
+      patch:
+        operation: MERGE
+        value:
+          routes:
+            # Appended last; deployment-specific HTTPRoute matches on
+            # X-Gateway-Model-Name win first, this rule catches the rest.
+            - name: model-not-found-fallback
+              match:
+                prefix: /
+              direct_response:
+                status: 404
+                body:
+                  inline_string: |
+                    {"error":{"message":"The model does not exist.","type":"invalid_request_error","param":"model","code":"model_not_found"}}
+              response_headers_to_add:
+                - header:
+                    key: content-type
+                    value: application/json
+                  append_action: OVERWRITE_IF_EXISTS_OR_ADD
@@ -19,15 +19,13 @@ gatewayName: ""
 # gatewayPort is the HTTP listener port on the Gateway.
 gatewayPort: 80
 
-# modelNotFound configures the cross-namespace reference to the
-# cluster-shared model-not-found Service that the catch-all HTTPRoute
-# forwards unmatched requests to. The Service itself is installed once
-# per cluster in `modelNotFound.namespace` (typically `default`) by the
-# E2E install script — this chart only renders the catch-all HTTPRoute
-# and the ReferenceGrant authorising the cross-namespace backendRef.
-modelNotFound:
-  namespace: "default"
-  serviceName: "model-not-found"
+# Catch-all "model not found" responses are now produced by an Envoy
+# direct_response patched onto the Gateway's HCM via the
+# `model-not-found-direct` EnvoyFilter (see
+# templates/envoyfilter-not-found.yaml). No backend Pod / Service /
+# ReferenceGrant is required, so the previous `modelNotFound` config
+# (which pointed at a cluster-shared `default/model-not-found` Service)
+# has been removed.
 
 # auth toggles the per-namespace API-key authentication artifacts. When
 # enabled, the chart renders:

@@ -21,9 +21,11 @@
 #                                 CRD is not yet served, so kubelet retries
 #                                 until KAITO finishes installing it)
 #   - BBR chart prefetch (git clone fork repo only)
-#   - Cluster-shared model-not-found Service in `default` (consumed by
-#     every workload namespace's catch-all HTTPRoute via a
-#     ReferenceGrant rendered by charts/modelharness).
+#
+# (Catch-all 404 handling is now provided by an EnvoyFilter
+# direct_response rendered per-namespace by charts/modelharness — no
+# cluster-shared Service is required, so install_model_not_found has
+# been removed from this script.)
 #
 # Phase 2 (parallel, depends on Phase 1):
 #   - Istio                      (after Gateway API CRDs)
@@ -52,7 +54,6 @@
 set -euo pipefail
 
 SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
-MANIFESTS_DIR="${SCRIPT_DIR}/../manifests"
 
 # Validate required version variables are set.
 : "${ISTIO_VERSION:?ISTIO_VERSION is not set. Source versions.env or export it before calling this script.}"
@@ -265,8 +266,19 @@ install_gateway_api_crds() {
 }
 
 install_gwie_crds() {
+  # Use server-side apply (--server-side --force-conflicts) instead of the
+  # default client-side apply. install_gwie_crds runs in parallel with
+  # install_kaito in phase1-base, and the KAITO chart bundles the same
+  # GWIE CRDs (inferencepools / inferenceobjectives in both
+  # inference.networking.k8s.io and inference.networking.x-k8s.io groups).
+  # Client-side apply does GET → CREATE-if-missing, which races with KAITO
+  # creating the CRD between the GET and the CREATE and fails with
+  # `AlreadyExists`. Server-side apply is a single atomic POST with a
+  # field manager: if the object already exists it is merged in place
+  # (with --force-conflicts taking ownership of any fields KAITO set).
   echo "=== Installing GWIE CRDs ==="
-  kubectl apply -f "https://github.com/kubernetes-sigs/gateway-api-inference-extension/releases/latest/download/manifests.yaml"
+  kubectl apply --server-side --force-conflicts \
+    -f "https://github.com/kubernetes-sigs/gateway-api-inference-extension/releases/latest/download/manifests.yaml"
 }
 
 install_keda() {
@@ -430,18 +442,6 @@ install_llm_gateway_auth() {
   kubectl -n llm-gateway-auth rollout status deployment/apikey-authz --timeout=180s || true
 }
 
-install_model_not_found() {
-  # Cluster-shared catch-all 404 Service in `default`. Every workload
-  # namespace's modelharness release renders a catch-all HTTPRoute that
-  # forwards unmatched requests to this Service across namespaces,
-  # authorised by a per-namespace ReferenceGrant.
-  echo "=== Deploying cluster-shared model-not-found Service in default ==="
-  kubectl apply -f "${MANIFESTS_DIR}/model-not-found.yaml"
-
-  echo "⏳ Waiting for model-not-found service..."
-  kubectl -n default rollout status deployment/model-not-found --timeout=120s || true
-}
-
 # ── Phased execution ──────────────────────────────────────────────────────
 #
 # Per-namespace shared resources (Gateway, catch-all HTTPRoute,
@@ -456,8 +456,7 @@ run_phase phase1-base \
   install_keda \
   install_keda_kaito_scaler \
   install_gpu_mocker \
-  prefetch_bbr_chart \
-  install_model_not_found
+  prefetch_bbr_chart
 
 run_phase phase2-istio \
   install_istio

@@ -79,22 +79,9 @@ fi
 kubectl -n istio-system get pods -l app=body-based-router 2>/dev/null || true
 echo ""
 
-# ── Cluster-shared model-not-found backend ──────────────────────────────
-# After the modelharness refactor, per-namespace Istio Gateways
-# ("<namespace>-gw") are provisioned at test time by EnsureNamespace
-# (charts/modelharness), so no `inference-gateway` Gateway pod exists in
-# `default` to validate at install time. The only namespace-tier
-# component install-components.sh still pre-installs is the
-# cluster-shared 404 Service that every workload namespace's catch-all
-# HTTPRoute references via a ReferenceGrant — validate that here.
-echo "=== model-not-found (cluster-shared 404 backend) ==="
-if kubectl -n default wait --for=condition=ready pod -l app=model-not-found --timeout="${TIMEOUT}" >/dev/null 2>&1; then
-  pass "model-not-found pod is Running"
-else
-  fail "model-not-found pod is NOT Running"
-fi
-kubectl -n default get pods -l app=model-not-found 2>/dev/null || true
-echo ""
+# (Catch-all 404 handling is now produced by an EnvoyFilter
+# direct_response rendered per-namespace by charts/modelharness — no
+# cluster-shared Service exists to validate.)
 
 # ── KEDA ─────────────────────────────────────────────────────────────────
 echo "=== KEDA (namespace: ${KEDA_NAMESPACE}, provider: ${E2E_PROVIDER}) ==="

@@ -15,7 +15,7 @@ Single source of truth: [`cases.go`](cases.go) → `CaseDeployments`. Each entry
 
 `Name` is unique cluster-wide and is the value matched by `X-Gateway-Model-Name` (i.e. the `model` field clients send in OpenAI-compatible requests). `Model` is the KAITO preset only — multiple deployments may share a preset under different `Name`s.
 
-Inference tests target the case's **`caseGatewayURL`**. Each case namespace gets its own Gateway, catch-all `model-not-found` route, and (when enabled) API-key auth artifacts via the [`charts/modelharness`](../../charts/modelharness) chart installed by `EnsureNamespace`.
+Inference tests target the case's **`caseGatewayURL`**. Each case namespace gets its own Gateway, catch-all `model-not-found-direct` EnvoyFilter (Envoy `direct_response` 404), and (when enabled) API-key auth artifacts via the [`charts/modelharness`](../../charts/modelharness) chart installed by `EnsureNamespace`.
 
 ## Helpers
 
@@ -159,7 +159,7 @@ var GinkgoLabelMyFeature = ginkgo.Label("MyFeature")
 
 ### 5. Add per-namespace resources (rare)
 
-If your case needs additional cluster-side resources beyond what the [`charts/modelharness`](../../charts/modelharness) chart already provisions (Gateway, catch-all `model-not-found` Service + HTTPRoute, optional `AuthorizationPolicy` + `APIKey`), add them as templates in `charts/modelharness` so every workload namespace picks them up consistently.
+If your case needs additional cluster-side resources beyond what the [`charts/modelharness`](../../charts/modelharness) chart already provisions (Gateway, catch-all `model-not-found-direct` EnvoyFilter, optional `AuthorizationPolicy` + `APIKey`), add them as templates in `charts/modelharness` so every workload namespace picks them up consistently.
 
 ### 6. Validate
 

@@ -330,14 +330,14 @@ var _ = Describe("GPU Mocker E2E", Ordered, func() {
 
 		Context("Non-existent model request", func() {
 			It("should return 404 with an OpenAI-compatible error for an unknown model", func() {
-				// The catch-all model-not-found HTTPRoute is provisioned
-				// per-namespace by the modelharness chart (installed via
-				// EnsureNamespace) and forwards unmatched requests across
-				// namespaces to the cluster-shared `default/model-not-found`
-				// Service (authorised by a ReferenceGrant). The gpu-mocker
-				// case has AuthAPIKeyEnabled=false, so no
-				// AuthorizationPolicy is rendered and the probe needs no
-				// bearer token.
+				// The catch-all `model-not-found-direct` EnvoyFilter is
+				// provisioned per-namespace by the modelharness chart
+				// (installed via EnsureNamespace) and patches an Envoy
+				// `direct_response` (status 404 + OpenAI-compatible JSON) onto
+				// the Gateway's virtual host as a catch-all route. No backend
+				// Pod / Service is involved. The gpu-mocker case has
+				// AuthAPIKeyEnabled=false, so no AuthorizationPolicy is
+				// rendered and the probe needs no bearer token.
 				resp, err := utils.SendChatCompletion(caseGatewayURL, "non-existent-model-xyz")
 				Expect(err).NotTo(HaveOccurred())
 				Expect(resp.StatusCode).To(Equal(http.StatusNotFound))