ci: minimal CI changes for faster tests #13031

howardjohn · 2025-12-02T22:32:20Z

Description

This PR breaks out the improvements in #12993 to focus only on the (hopefully) non-controversial parts, which should have no impact on the test execution at all.

Use a faster helm status check
Avoid -vet which is slow and already runs
Add CGO_ENABLED=0 to avoid recompiling everything
Remove an expensive disk-cleanup step that we don't need (we have plenty of room)
Skip building rustformations and doing expensive apt operations for agentgateway test
Deploy kind and build images in parallel

For local runs...

Allow skipping test dump (no change to default behavior)
Fix how we detect which cluster to run to default to the current kubectl context

Comparison (sample size=1)

Test Name	Old Time	New Time	Change	% Change
cluster-multi-install	13m52s	10m25s	-3m27s	-24.9%
cluster-metrics	16m6s	11m39s	-4m27s	-27.6%
api-validation	17m57s	9m59s	-7m58s	-44.4%
cluster-four	20m23s	16m48s	-3m35s	-17.6%
agent-gateway-cluster	21m58s	12m38s	-9m20s	-56.8%
cluster-two	22m59s	20m23s	-2m36s	-11.3%
cluster-five	23m16s	18m20s	-4m56s	-21.2%
cluster-one	23m55s	20m5s	-3m50s	-16.0%
cluster-three	24m31s	19m9s	-5m22s	-21.9%

Change Type

/kind cleanup

Changelog

NONE

Additional Notes

Copilot

Pull request overview

This PR implements several optimizations to reduce CI test execution time by introducing parallel operations, conditional builds, and test configuration options.

Adds a SKIP_DUMP environment variable to disable expensive test state dumping when not needed
Introduces parallel execution in the setup-kind.sh script to run cluster setup concurrently with docker builds
Adds an agentgateway-specific build path that skips the expensive Envoy build for tests that don't require it

Reviewed changes

Copilot reviewed 12 out of 12 changed files in this pull request and generated 3 comments.

Show a summary per file

File	Description
test/testutils/env.go	Adds `SkipDump` constant for new environment variable to control test dumping behavior
test/helpers/kube_dump.go	Implements skip logic for test dumps when `SKIP_DUMP=true`
test/e2e/testutils/cluster/kind.go	Removes fallback kubeCtx logic (empty context now defaults to current kubeconfig context)
test/e2e/tests/base/base_suite.go	Reduces polling interval from 2s to 1s for faster test execution
test/e2e/test.go	Replaces runtime.SetFinalizer with t.Cleanup() for better test cleanup; adds custom releaseExists implementation that queries Kubernetes secrets directly instead of using Helm CLI
hack/kind/setup-kind.sh	Parallelizes cluster setup with docker builds; adds conditional build path for agentgateway tests
cmd/kgateway/Dockerfile.agentgateway	New minimal Dockerfile for agentgateway that excludes Envoy dependencies
Makefile	Removes `-race` flag from e2e tests; removes `-cpu=4` flag; adds build targets for agentgateway image
.github/workflows/e2e.yaml	Adds `agentgateway: 'true'` to agent-gateway-cluster test configuration
.github/actions/setup-kind-cluster/action.yaml	Adds `agentgateway` input parameter to control build path selection
.github/actions/prep-go-runner/action.yaml	Removes expensive apt-get package removal steps; adds timing to mod-download
.github/actions/kubernetes-e2e-tests/action.yaml	Sets `CGO_ENABLED=0` to enable Go cache reuse across test runs

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2025-12-02T22:42:30Z

test/e2e/tests/base/base_suite.go

 			LabelSelector: fmt.Sprintf("%s=%s", defaults.WellKnownAppLabel, name),
 			// Provide a longer timeout as the pod needs to be pulled and pass HCs
-		}, time.Second*60, time.Second*2)
+		}, time.Second*60, time.Second)


[nitpick] The polling interval has been reduced from 2 seconds to 1 second. While this may speed up tests that pass quickly, it doubles the polling frequency which could:

Increase load on the Kubernetes API server during tests

Lead to more flaky tests if the system is under load

Potentially hit rate limits in constrained environments

Consider whether the performance gain (potentially saving ~1 second per assertion) justifies the increased API load and potential for flakiness.

Suggested change

}, time.Second*60, time.Second)

}, time.Second*60, 2*time.Second)

Copilot · 2025-12-02T22:42:31Z

hack/kind/setup-kind.sh

+create_and_setup &
+KIND_PID=$!

 if [[ $SKIP_DOCKER == 'true' ]]; then
  # TODO(tim): refactor the Makefile & CI scripts so we're loading local
  # charts to real helm repos, and then we can remove this block.
  echo "SKIP_DOCKER=true, not building images or chart"
 else
  # 2. Make all the docker images and load them to the kind cluster
-  VERSION=$VERSION CLUSTER_NAME=$CLUSTER_NAME make kind-build-and-load
+  if [[ $AGENTGATEWAY == 'true' ]]; then
+    # Skip expensive envoy build
+    VERSION=$VERSION CLUSTER_NAME=$CLUSTER_NAME make kind-build-and-load-kgateway-agentgateway
+  else
+    VERSION=$VERSION CLUSTER_NAME=$CLUSTER_NAME make kind-build-and-load
+  fi

  # 3. Build the test helm chart, ensuring we have a chart in the `_test` folder
  VERSION=$VERSION make package-kgateway-charts
-
 fi

-# 5. Apply the Kubernetes Gateway API CRDs
-kubectl apply --server-side -f "https://github.com/kubernetes-sigs/gateway-api/releases/download/$CONFORMANCE_VERSION/$CONFORMANCE_CHANNEL-install.yaml"
-
-# 6. Apply the Kubernetes Gateway API Inference Extension CRDs
-kubectl apply --kustomize "https://github.com/kubernetes-sigs/gateway-api-inference-extension/config/crd?ref=$GIE_CRD_VERSION"
-
-# 7. Conformance test setup
-if [[ $CONFORMANCE == "true" ]]; then
-  echo "Running conformance test setup"
-  . $SCRIPT_DIR/setup-metalllb-on-kind.sh
-fi
+wait "$KIND_PID"


The create_and_setup function is run in the background (line 73) but the script has -e flag set. If the background process fails, the script won't immediately exit due to set -e. While wait "$KIND_PID" on line 93 will eventually check the exit status, the docker build steps (lines 82-90) will continue running even if cluster setup fails. This could lead to confusing error messages where docker build succeeds but fails to load images into a non-existent cluster. Consider adding error handling or checking the cluster is ready before starting the builds.

Copilot · 2025-12-02T22:42:31Z

test/e2e/test.go

+func (i *TestInstallation) releaseExists(ctx context.Context, releaseName, namespace string) bool {
+	l := &corev1.SecretList{}
+	if err := i.ClusterContext.Client.List(ctx, l, &client.ListOptions{
+		Namespace: namespace,
+		LabelSelector: labels.SelectorFromSet(map[string]string{
+			"owner": "helm",
+			"name":  releaseName,
+		}),
+	}); err != nil {
+		return false
+	}
+	return len(l.Items) > 0
+}


The new releaseExists implementation queries Kubernetes secrets directly instead of using the Helm CLI. While this may be faster, it's less robust than the original ReleaseExists method because:

It silently returns false on any error (line 325-326), which could mask real cluster connection issues

The label selector assumes Helm v3's storage format, which could break if Helm changes its internal implementation

The original method properly handles namespace scoping and validates the release is actually functional

Consider adding error logging at minimum, or evaluate if the performance gain justifies replacing the official Helm API.

puertomontt · 2025-12-03T17:09:38Z

was removing curl pod controversial?

howardjohn · 2025-12-03T18:21:57Z

was removing curl pod controversial?

Seems like no after meeting with some folks, but its at least changing the test behavior. This PR doesn't change any behavior, just makes it faster

Signed-off-by: John Howard <[email protected]>

jenshu · 2025-12-04T15:21:47Z

test/e2e/test.go

 	// Check if we should skip installation if the release already exists (PERSIST_INSTALL or FAIL_FAST_AND_PERSIST mode)
 	if testutils.ShouldPersistInstall() || testutils.ShouldFailFastAndPersist() {
-		if i.Actions.Helm().ReleaseExists(ctx, helmutils.ChartName, i.Metadata.InstallNamespace) {
+		if i.releaseExists(ctx, helmutils.ChartName, i.Metadata.InstallNamespace) {


can we just update Actions.Helm().ReleaseExists to use the new impl?

The reason I didn't do this is the Helm() action is for running the helm command so it doesn't have access to the cluster client at all. Not sure we want to do that broader change?

hmm, I think we can handle that in a follow-up if needed

jenshu · 2025-12-04T15:22:57Z

.github/actions/setup-kind-cluster/action.yaml

+  agentgateway:
+    required: false
+    default: "false"
+    description: Whether this test only needs agentgateway


"only needs agentgateway" meaning envoy is disabled?

jenshu · 2025-12-04T15:25:24Z

hack/kind/setup-kind.sh

+  # 6. Apply the Kubernetes Gateway API Inference Extension CRDs
+  kubectl apply --kustomize "https://github.com/kubernetes-sigs/gateway-api-inference-extension/config/crd?ref=$GIE_CRD_VERSION"
+
+  . $SCRIPT_DIR/setup-metalllb-on-kind.sh


should this be only if CONFORMANCE=true, like before?

metallb is used for all tests, which all set CONFORMANCE=true. I think its a legacy thing that is now obsolete

in that case it would be good to clean up the CONFORMANCE var if it's not used anymore (can be a follow-up)

jenshu · 2025-12-04T15:26:00Z

hack/kind/setup-kind.sh

+  # 5. Apply the Kubernetes Gateway API CRDs
+  kubectl apply --server-side -f "https://github.com/kubernetes-sigs/gateway-api/releases/download/$CONFORMANCE_VERSION/$CONFORMANCE_CHANNEL-install.yaml"
+
+  # 6. Apply the Kubernetes Gateway API Inference Extension CRDs
+  kubectl apply --kustomize "https://github.com/kubernetes-sigs/gateway-api-inference-extension/config/crd?ref=$GIE_CRD_VERSION"


nit: remove the 5. / 6. numbering

Signed-off-by: John Howard <[email protected]>

Copilot AI review requested due to automatic review settings December 2, 2025 22:32

Copilot started reviewing on behalf of howardjohn December 2, 2025 22:32 View session

Copilot finished reviewing on behalf of howardjohn December 2, 2025 22:37

howardjohn force-pushed the agw/fast-e2e-tmp-flat branch from 457863e to 2626c40 Compare December 2, 2025 22:41

Copilot AI reviewed Dec 2, 2025

View reviewed changes

howardjohn force-pushed the agw/fast-e2e-tmp-flat branch 2 times, most recently from a5b134a to 7d63ce6 Compare December 3, 2025 00:17

howardjohn force-pushed the agw/fast-e2e-tmp-flat branch from 7d63ce6 to 6be97df Compare December 3, 2025 17:20

howardjohn added 3 commits December 3, 2025 10:21

ci: minimal CI changes for faster tests

4981324

Signed-off-by: John Howard <[email protected]>

stop slow post-job

ff85c90

Signed-off-by: John Howard <[email protected]>

fix matrix

c052f25

Signed-off-by: John Howard <[email protected]>

howardjohn assigned timflannagan Dec 3, 2025

howardjohn force-pushed the agw/fast-e2e-tmp-flat branch from 6be97df to c052f25 Compare December 3, 2025 18:23

jenshu reviewed Dec 4, 2025

View reviewed changes

jenshu approved these changes Dec 4, 2025

View reviewed changes

howardjohn added this pull request to the merge queue Dec 4, 2025

Merged via the queue into kgateway-dev:main with commit 91bc630 Dec 4, 2025
30 checks passed

howardjohn deleted the agw/fast-e2e-tmp-flat branch December 4, 2025 16:32

howardjohn added a commit to howardjohn/kgateway that referenced this pull request Dec 5, 2025

ci: minimal CI changes for faster tests (kgateway-dev#13031)

ccac7d5

Signed-off-by: John Howard <[email protected]>

	}, time.Second*60, time.Second)
	}, time.Second60, 2time.Second)

ci: minimal CI changes for faster tests #13031

ci: minimal CI changes for faster tests #13031

Uh oh!

Conversation

howardjohn commented Dec 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Change Type

Changelog

Additional Notes

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Dec 2, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 2, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 2, 2025

Choose a reason for hiding this comment

Uh oh!

puertomontt commented Dec 3, 2025

Uh oh!

howardjohn commented Dec 3, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

howardjohn commented Dec 2, 2025 •

edited

Loading