Improve networkpolicy harness#57
Open
tnsimon wants to merge 9 commits into
Open
Conversation
added 3 commits
May 12, 2026 11:25
Turn on NetworkPolicyEnabled (with keda + kaito-system as allowed ingress namespaces, matching the scaling case) on every case so every e2e run exercises the same locked-down dataplane that production users will deploy.
…ystem Every workload using the modelharness chart needs to let keda-kaito-scaler reach vLLM `num_requests_waiting` (for KEDA autoscaling) and kaito-system reach shadow pods directly (for gpu-node-mocker / kaito-workspace controller paths that bypass the apiserver). Bake that into the chart default instead of forcing every caller to repeat the same allowlist. Callers that need stricter isolation (e.g. the network-policy e2e cases that assert label-only cross-namespace spoofs are denied) remain free to override with an empty list.
…e toggle NetworkPolicy lockdown is now on by default in the modelharness chart, mirroring the production posture every consumer of this repo needs. The ModelDeploymentValues.NetworkPolicyEnabled bool and the NetworkPolicyAllowedNamespaces []string are removed — the chart default (currently keda + kaito-system) covers every workload, and InstallModelHarness / EnsureNamespace lose the corresponding parameters. Consumers that want to deviate (stricter or looser allowlist, disabled policy) can still override via helm --set against the chart values directly; no test case in this repo currently needs that.
Add kube-system (AKS managed Prometheus ama-metrics, Container Insights ama-logs) and monitoring (user-deployed Prometheus) to the default allowedIngressNamespaces so metric scrapers can reach EPP and shadow pods through the default-deny-ingress NetworkPolicy.
busybox `nc -z` exits non-zero for both "blocked" (silent drop / TIMEOUT) and "refused" (TCP RST — destination reachable, no listener), so the deny tests treat a misrouted EPP Service as a passing test. Switch to agnhost — the same probe binary the upstream Kubernetes NetworkPolicy conformance suite uses — whose documented exit codes distinguish the two: 0=connected, 1=TIMEOUT, 2=REFUSED, 3=DNS_ERROR, 4=OTHER. Deny tests now assert state=="blocked" via classifyResult() instead of just `!connected`, so a regression that silently routes the probe to an empty endpoint fails loud with state=="refused" rather than passing. Also bump the per-probe TCP timeout from 3s to 5s to absorb Cilium identity-propagation lag on 3-node AKS clusters where the source pod and EPP land on different nodes — the destination node may not have the source identity programmed into its policy map for several seconds after the source pod reports Ready (the working hypothesis for the CI-only "should DENY ingress from workload namespace A to workload namespace B" failure on PR kaito-project#57).
The kube-system deny test contradicts the modelharness chart's allowedIngressNamespaces default (which lists kube-system so kubelet exec, konnectivity-agent, and kube-proxy probes can reach inference pods). The agnhost connect probe correctly reported state="connected" EXIT=0 — kube-system genuinely IS allowed, by design. Flip the assertion to ALLOW and document the chart contract it locks in. If anyone removes kube-system from the allowlist, this test will fail with state="blocked" and direct them to values.yaml. Also confirms the agnhost migration is doing its job: previously this would have been hidden behind "EXIT!=0 → blocked", but now we see the real signal.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Reason for Change:
Enable network policy by default. Remove the struct member and control it via the deployment. Repeat the same for allowedNamespaces
Requirements
Issue Fixed:
Notes for Reviewers: