Improve networkpolicy harness by tnsimon · Pull Request #57 · kaito-project/production-stack

tnsimon · 2026-05-12T01:27:58Z

Reason for Change:
Enable network policy by default. Remove the struct member and control it via the deployment. Repeat the same for allowedNamespaces

Requirements

added unit tests and e2e tests (if applicable).

Issue Fixed:

Notes for Reviewers:

Turn on NetworkPolicyEnabled (with keda + kaito-system as allowed ingress namespaces, matching the scaling case) on every case so every e2e run exercises the same locked-down dataplane that production users will deploy.

…ystem Every workload using the modelharness chart needs to let keda-kaito-scaler reach vLLM `num_requests_waiting` (for KEDA autoscaling) and kaito-system reach shadow pods directly (for gpu-node-mocker / kaito-workspace controller paths that bypass the apiserver). Bake that into the chart default instead of forcing every caller to repeat the same allowlist. Callers that need stricter isolation (e.g. the network-policy e2e cases that assert label-only cross-namespace spoofs are denied) remain free to override with an empty list.

…e toggle NetworkPolicy lockdown is now on by default in the modelharness chart, mirroring the production posture every consumer of this repo needs. The ModelDeploymentValues.NetworkPolicyEnabled bool and the NetworkPolicyAllowedNamespaces []string are removed — the chart default (currently keda + kaito-system) covers every workload, and InstallModelHarness / EnsureNamespace lose the corresponding parameters. Consumers that want to deviate (stricter or looser allowlist, disabled policy) can still override via helm --set against the chart values directly; no test case in this repo currently needs that.

Add kube-system (AKS managed Prometheus ama-metrics, Container Insights ama-logs) and monitoring (user-deployed Prometheus) to the default allowedIngressNamespaces so metric scrapers can reach EPP and shadow pods through the default-deny-ingress NetworkPolicy.

busybox `nc -z` exits non-zero for both "blocked" (silent drop / TIMEOUT) and "refused" (TCP RST — destination reachable, no listener), so the deny tests treat a misrouted EPP Service as a passing test. Switch to agnhost — the same probe binary the upstream Kubernetes NetworkPolicy conformance suite uses — whose documented exit codes distinguish the two: 0=connected, 1=TIMEOUT, 2=REFUSED, 3=DNS_ERROR, 4=OTHER. Deny tests now assert state=="blocked" via classifyResult() instead of just `!connected`, so a regression that silently routes the probe to an empty endpoint fails loud with state=="refused" rather than passing. Also bump the per-probe TCP timeout from 3s to 5s to absorb Cilium identity-propagation lag on 3-node AKS clusters where the source pod and EPP land on different nodes — the destination node may not have the source identity programmed into its policy map for several seconds after the source pod reports Ready (the working hypothesis for the CI-only "should DENY ingress from workload namespace A to workload namespace B" failure on PR kaito-project#57).

The kube-system deny test contradicts the modelharness chart's allowedIngressNamespaces default (which lists kube-system so kubelet exec, konnectivity-agent, and kube-proxy probes can reach inference pods). The agnhost connect probe correctly reported state="connected" EXIT=0 — kube-system genuinely IS allowed, by design. Flip the assertion to ALLOW and document the chart contract it locks in. If anyone removes kube-system from the allowlist, this test will fail with state="blocked" and direct them to values.yaml. Also confirms the agnhost migration is doing its job: previously this would have been hidden behind "EXIT!=0 → blocked", but now we see the real signal.

Simon Tien added 3 commits May 12, 2026 11:25

test(e2e): enable NetworkPolicy for all cases

b1fee12

Turn on NetworkPolicyEnabled (with keda + kaito-system as allowed ingress namespaces, matching the scaling case) on every case so every e2e run exercises the same locked-down dataplane that production users will deploy.

tnsimon requested review from Fei-Guo, rambohe-ch, techworldhello and zhuangqh as code owners May 12, 2026 01:27

tnsimon had a problem deploying to e2e-test May 12, 2026 01:28 — with GitHub Actions Failure

tnsimon had a problem deploying to e2e-test May 12, 2026 02:04 — with GitHub Actions Failure

ci: retrigger e2e

6a37afa

tnsimon had a problem deploying to e2e-test May 12, 2026 04:19 — with GitHub Actions Failure

test: remove kube-system deny test scenario

3717f98

tnsimon had a problem deploying to e2e-test May 12, 2026 04:34 — with GitHub Actions Failure

ci: retrigger e2e

0de7d89

tnsimon had a problem deploying to e2e-test May 12, 2026 04:42 — with GitHub Actions Failure

tnsimon had a problem deploying to e2e-test May 12, 2026 05:12 — with GitHub Actions Failure

tnsimon had a problem deploying to e2e-test May 12, 2026 06:06 — with GitHub Actions Failure

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve networkpolicy harness#57

Improve networkpolicy harness#57
tnsimon wants to merge 9 commits into
kaito-project:mainfrom
tnsimon:improve-networkpolicy-harness

tnsimon commented May 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

tnsimon commented May 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant