Skip to content

fix(eks): stabilize UDP NetworkPolicy e2e coverage#2666

Merged
chance-coleman merged 18 commits into
mainfrom
chance/fix-udp-tests
May 18, 2026
Merged

fix(eks): stabilize UDP NetworkPolicy e2e coverage#2666
chance-coleman merged 18 commits into
mainfrom
chance/fix-udp-tests

Conversation

@chance-coleman

@chance-coleman chance-coleman commented May 12, 2026

Copy link
Copy Markdown
Contributor

Description

Problem

The UDP NetworkPolicy E2E test was flaky on EKS for two independent reasons:

  1. Cross-node UDP traffic was not actually allowed by the EKS node security group configuration.
    The EKS test infra only allowed node-to-node UDP/53 by default, and the earlier SG change used the cluster security group as the source instead of a self-referencing node security group rule. That left cross-node UDP/5000 traffic unreliable.

  2. The test itself depended on a one-shot UDP listener started via execInPod.
    The old test launched nc -u -l -p 5000 at assertion time and immediately raced client sends against listener startup and exec WebSocket timing. That made the test flaky even when networking was healthy.

Fix

This PR fixes both issues with the smallest set of real changes:

  • adds a self-referencing UDP ingress rule to the EKS node security group so cross-node pod-to-pod UDP is allowed
  • changes the UDP server test workload to run a persistent nc loop inside the container
  • updates the UDP E2E test to:
    • fetch pod names in parallel in beforeAll() to avoid EKS hook timeout
    • assert delivery by polling the server log instead of relying on listener stdout
    • send UDP directly to the server pod IP so the test stays focused on NetworkPolicy behavior instead of kube-proxy UDP ClusterIP routing behavior

Why this approach

This keeps the test focused on what UDS Core owns:

  • UDP allow/deny NetworkPolicy behavior between pods
    It avoids unrelated sources of flake:
  • incorrect EKS cross-node UDP security group rules
  • execInPod listener startup timing
  • EKS UDP ClusterIP/DNAT service-path instability

Validation

Validated successfully across CI flavors after these changes.

Type of change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Other (security config, docs update, etc)

Checklist before merging

@chance-coleman chance-coleman self-assigned this May 12, 2026
@chance-coleman chance-coleman marked this pull request as ready for review May 14, 2026 14:33
@chance-coleman chance-coleman requested a review from a team as a code owner May 14, 2026 14:33
@greptile-apps

greptile-apps Bot commented May 14, 2026

Copy link
Copy Markdown
Contributor

Greptile Summary

This PR stabilises the UDP NetworkPolicy E2E test on EKS by fixing two independent root causes: missing cross-node UDP SG rules and a flaky one-shot listener pattern. The approach replaces one-shot nc with a persistent log-appending loop, switches assertions to poll the log file, and parallelises beforeAll pod-name fetches to avoid hook timeout on slow EKS API servers.

  • EKS infra (cluster.tf): adds a self-referencing node SG rule for UDP, though the port range (0-65535) is broader than the single test port (5000) that motivated the fix.
  • Test workload (app-curl.yaml): UDP server container now runs a persistent nc loop, eliminating the listen-before-exec race.
  • E2E spec (network.spec.ts): beforeAll pod fetches run in parallel via Promise.all, assertions now read the server log file instead of relying on nc stdout, and direct pod IP is used to avoid kube-proxy DNAT timing issues.

Confidence Score: 5/5

Safe to merge — all three changes are scoped to test infrastructure and do not touch production paths

All changes are confined to EKS test infra, a test-only workload manifest, and the E2E spec file. The logic changes are well-reasoned and validated in CI. Two minor observations exist — an overly broad SG port range and an early-exit condition in waitForUdpLog that is never reached — but neither causes test failures or behavioral regressions.

.github/test-infra/aws/eks/cluster.tf (UDP SG port range) and test/vitest/network.spec.ts (waitForUdpLog early-exit logic)

Important Files Changed

Filename Overview
.github/test-infra/aws/eks/cluster.tf Adds a self-referencing UDP ingress rule to the node security group; opens ports 0-65535 when only port 5000 is needed for the test
src/test/app-curl.yaml Replaces the idle sleep command with a persistent nc loop that appends received UDP data to /tmp/udp.log, enabling log-based assertion
test/vitest/network.spec.ts Parallelises beforeAll pod-name fetches, rewrites the UDP test to poll the server log file via pod IP, and adds clearUdpLog/readUdpLog/waitForUdpLog helpers; waitForUdpLog early-exit condition is dead code for the allowed path

Sequence Diagram

sequenceDiagram
    participant Test as E2E Test
    participant Client as udp-echo-client pod
    participant Server as udp-echo-server pod
    participant Log as /tmp/udp.log

    Note over Server,Log: Container runs persistent nc loop
    Server->>Log: nc -u -l -p 5000 -w 1 loop

    Test->>Server: clearUdpLog truncate log
    Test->>Client: execInPod send 3 pings to serverIP port 5000
    Client->>Server: UDP ping x3
    Server->>Log: append ping per nc session

    loop poll every 250ms up to 5s
        Test->>Log: readUdpLog
        Log-->>Test: log content
    end

    Test->>Test: expectUdpPingLog all lines equal ping

    Note over Test: Denied path
    Test->>Server: clearUdpLog
    Test->>Client: execInPod from deny-all namespace
    Client--xServer: blocked by NetworkPolicy

    loop poll every 250ms up to 2s
        Test->>Log: readUdpLog
        Log-->>Test: empty string
    end

    Test->>Test: expect deniedLog toBe empty
Loading

Reviews (2): Last reviewed commit: "final cleanup" | Re-trigger Greptile

Comment thread test/vitest/network.spec.ts Outdated
Comment thread test/vitest/network.spec.ts Outdated
@chance-coleman chance-coleman changed the title fix(tests): eliminate UDP NetworkPolicy test race condition on EKS fix(eks): stabilize UDP NetworkPolicy e2e coverage May 15, 2026
@chance-coleman

Copy link
Copy Markdown
Contributor Author

@greptileai review this PR

briantwatson
briantwatson previously approved these changes May 18, 2026
joelmccoy
joelmccoy previously approved these changes May 18, 2026
Comment thread test/vitest/network.spec.ts Outdated
Comment thread .github/test-infra/aws/eks/cluster.tf Outdated

@jasonwashburn jasonwashburn left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One nit and a copyright update, otherwise everything looks good to me!

@chance-coleman chance-coleman dismissed stale reviews from joelmccoy and briantwatson via e50b377 May 18, 2026 14:31

@jasonwashburn jasonwashburn left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good (pending CI)

@chance-coleman chance-coleman merged commit 3d45af4 into main May 18, 2026
34 of 41 checks passed
@chance-coleman chance-coleman deleted the chance/fix-udp-tests branch May 18, 2026 15:38
jasonwashburn pushed a commit that referenced this pull request May 27, 2026
🤖 I have created a release *beep* *boop*
---


##
[1.5.0](v1.4.0...v1.5.0)
(2026-05-26)


### Bug Fixes

* avoid virtual threads in Keycloak
([#2686](#2686))
([e07ddb2](e07ddb2))
* broken grafana tests
([#2696](#2696))
([202c8ac](202c8ac))
* **eks:** stabilize UDP NetworkPolicy e2e coverage
([#2666](#2666))
([3d45af4](3d45af4))


### Miscellaneous

* add 1.5.0 release notes
([#2700](#2700))
([197dc46](197dc46))
* **ci:** add test to verify loki able to flush to s3
([#2673](#2673))
([4783ffb](4783ffb))
* **deps:** migrate unicorn flavor images from RapidFort to Chainguard
([#2650](#2650))
([b0d4c87](b0d4c87))
* **deps:** update grafana
([#2584](#2584))
([f07a6a7](f07a6a7))
* **deps:** update grafana to v2.7.3
([#2691](#2691))
([0aaf351](0aaf351))
* **deps:** update iac support dependencies to v2.0.1
([#2677](#2677))
([40cf6a6](40cf6a6))
* **deps:** update iac-support-deps
([#2670](#2670))
([ab1b90d](ab1b90d))
* **deps:** update loki
([#2586](#2586))
([396bb53](396bb53))
* **deps:** update loki to v2.7.3
([#2690](#2690))
([6b773ed](6b773ed))
* **deps:** update prometheus-stack
([#2644](#2644))
([1bfbfaf](1bfbfaf))
* **deps:** update prometheus-stack
([#2684](#2684))
([1fae685](1fae685))
* **deps:** update prometheus-stack
([#2687](#2687))
([ceab924](ceab924))
* **deps:** update support-deps
([#2683](#2683))
([f725d10](f725d10))
* **deps:** update support-deps
([#2689](#2689))
([83622c3](83622c3))
* **deps:** update velero
([#2678](#2678))
([70f0106](70f0106))
* **docs:** add legacy upgrade notes and local demo deploy warning
([#2667](#2667))
([ded7c08](ded7c08))
* updating cert bundle
([#2675](#2675))
([7da8b6c](7da8b6c))


### Documentation

* add time-sync prereqs callout in docs
([#2679](#2679))
([3d45a2c](3d45a2c))

---
This PR was generated with [Release
Please](https://github.com/googleapis/release-please). See
[documentation](https://github.com/googleapis/release-please#release-please).

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants