Skip to content

Improve networkpolicy harness#57

Open
tnsimon wants to merge 9 commits into
kaito-project:mainfrom
tnsimon:improve-networkpolicy-harness
Open

Improve networkpolicy harness#57
tnsimon wants to merge 9 commits into
kaito-project:mainfrom
tnsimon:improve-networkpolicy-harness

Conversation

@tnsimon
Copy link
Copy Markdown
Collaborator

@tnsimon tnsimon commented May 12, 2026

Reason for Change:
Enable network policy by default. Remove the struct member and control it via the deployment. Repeat the same for allowedNamespaces

Requirements

  • added unit tests and e2e tests (if applicable).

Issue Fixed:

Notes for Reviewers:

Simon Tien added 3 commits May 12, 2026 11:25
Turn on NetworkPolicyEnabled (with keda + kaito-system as allowed
ingress namespaces, matching the scaling case) on every case so
every e2e run exercises the same locked-down dataplane that
production users will deploy.
…ystem

Every workload using the modelharness chart needs to let
keda-kaito-scaler reach vLLM `num_requests_waiting` (for KEDA
autoscaling) and kaito-system reach shadow pods directly (for
gpu-node-mocker / kaito-workspace controller paths that bypass the
apiserver). Bake that into the chart default instead of forcing
every caller to repeat the same allowlist.

Callers that need stricter isolation (e.g. the network-policy e2e
cases that assert label-only cross-namespace spoofs are denied)
remain free to override with an empty list.
…e toggle

NetworkPolicy lockdown is now on by default in the modelharness chart,
mirroring the production posture every consumer of this repo needs. The
ModelDeploymentValues.NetworkPolicyEnabled bool and the
NetworkPolicyAllowedNamespaces []string are removed — the chart
default (currently keda + kaito-system) covers every workload, and
InstallModelHarness / EnsureNamespace lose the corresponding parameters.

Consumers that want to deviate (stricter or looser allowlist, disabled
policy) can still override via helm --set against the chart values
directly; no test case in this repo currently needs that.
Add kube-system (AKS managed Prometheus ama-metrics, Container Insights
ama-logs) and monitoring (user-deployed Prometheus) to the default
allowedIngressNamespaces so metric scrapers can reach EPP and shadow
pods through the default-deny-ingress NetworkPolicy.
busybox `nc -z` exits non-zero for both "blocked" (silent drop / TIMEOUT)
and "refused" (TCP RST — destination reachable, no listener), so the
deny tests treat a misrouted EPP Service as a passing test. Switch to
agnhost — the same probe binary the upstream Kubernetes NetworkPolicy
conformance suite uses — whose documented exit codes distinguish the
two: 0=connected, 1=TIMEOUT, 2=REFUSED, 3=DNS_ERROR, 4=OTHER.

Deny tests now assert state=="blocked" via classifyResult() instead of
just `!connected`, so a regression that silently routes the probe to an
empty endpoint fails loud with state=="refused" rather than passing.

Also bump the per-probe TCP timeout from 3s to 5s to absorb Cilium
identity-propagation lag on 3-node AKS clusters where the source pod
and EPP land on different nodes — the destination node may not have
the source identity programmed into its policy map for several seconds
after the source pod reports Ready (the working hypothesis for the
CI-only "should DENY ingress from workload namespace A to workload
namespace B" failure on PR kaito-project#57).
The kube-system deny test contradicts the modelharness chart's
allowedIngressNamespaces default (which lists kube-system so kubelet
exec, konnectivity-agent, and kube-proxy probes can reach inference
pods). The agnhost connect probe correctly reported state="connected"
EXIT=0 — kube-system genuinely IS allowed, by design.

Flip the assertion to ALLOW and document the chart contract it locks
in. If anyone removes kube-system from the allowlist, this test
will fail with state="blocked" and direct them to values.yaml.

Also confirms the agnhost migration is doing its job: previously this
would have been hidden behind "EXIT!=0 → blocked", but now we see the
real signal.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant