Skip to content

Conversation

@howardjohn
Copy link
Contributor

@howardjohn howardjohn commented Dec 2, 2025

Description

This PR breaks out the improvements in #12993 to focus only on the (hopefully) non-controversial parts, which should have no impact on the test execution at all.

  • Use a faster helm status check
  • Avoid -vet which is slow and already runs
  • Add CGO_ENABLED=0 to avoid recompiling everything
  • Remove an expensive disk-cleanup step that we don't need (we have plenty of room)
  • Skip building rustformations and doing expensive apt operations for agentgateway test
  • Deploy kind and build images in parallel

For local runs...

  • Allow skipping test dump (no change to default behavior)
  • Fix how we detect which cluster to run to default to the current kubectl context

Comparison (sample size=1)

Test Name Old Time New Time Change % Change
cluster-multi-install 13m52s 10m25s -3m27s -24.9%
cluster-metrics 16m6s 11m39s -4m27s -27.6%
api-validation 17m57s 9m59s -7m58s -44.4%
cluster-four 20m23s 16m48s -3m35s -17.6%
agent-gateway-cluster 21m58s 12m38s -9m20s -56.8%
cluster-two 22m59s 20m23s -2m36s -11.3%
cluster-five 23m16s 18m20s -4m56s -21.2%
cluster-one 23m55s 20m5s -3m50s -16.0%
cluster-three 24m31s 19m9s -5m22s -21.9%

Change Type

/kind cleanup

Changelog

NONE

Additional Notes

Copilot AI review requested due to automatic review settings December 2, 2025 22:32
@gateway-bot gateway-bot added do-not-merge/description-invalid do-not-merge/kind-invalid Indicates a PR lacks a `kind/foo` label and requires one. do-not-merge/release-note-invalid Indicates that a PR should not merge because it's missing one of the release note labels. and removed do-not-merge/description-invalid labels Dec 2, 2025
@howardjohn howardjohn force-pushed the agw/fast-e2e-tmp-flat branch from 457863e to 2626c40 Compare December 2, 2025 22:41
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR implements several optimizations to reduce CI test execution time by introducing parallel operations, conditional builds, and test configuration options.

  • Adds a SKIP_DUMP environment variable to disable expensive test state dumping when not needed
  • Introduces parallel execution in the setup-kind.sh script to run cluster setup concurrently with docker builds
  • Adds an agentgateway-specific build path that skips the expensive Envoy build for tests that don't require it

Reviewed changes

Copilot reviewed 12 out of 12 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
test/testutils/env.go Adds SkipDump constant for new environment variable to control test dumping behavior
test/helpers/kube_dump.go Implements skip logic for test dumps when SKIP_DUMP=true
test/e2e/testutils/cluster/kind.go Removes fallback kubeCtx logic (empty context now defaults to current kubeconfig context)
test/e2e/tests/base/base_suite.go Reduces polling interval from 2s to 1s for faster test execution
test/e2e/test.go Replaces runtime.SetFinalizer with t.Cleanup() for better test cleanup; adds custom releaseExists implementation that queries Kubernetes secrets directly instead of using Helm CLI
hack/kind/setup-kind.sh Parallelizes cluster setup with docker builds; adds conditional build path for agentgateway tests
cmd/kgateway/Dockerfile.agentgateway New minimal Dockerfile for agentgateway that excludes Envoy dependencies
Makefile Removes -race flag from e2e tests; removes -cpu=4 flag; adds build targets for agentgateway image
.github/workflows/e2e.yaml Adds agentgateway: 'true' to agent-gateway-cluster test configuration
.github/actions/setup-kind-cluster/action.yaml Adds agentgateway input parameter to control build path selection
.github/actions/prep-go-runner/action.yaml Removes expensive apt-get package removal steps; adds timing to mod-download
.github/actions/kubernetes-e2e-tests/action.yaml Sets CGO_ENABLED=0 to enable Go cache reuse across test runs

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

LabelSelector: fmt.Sprintf("%s=%s", defaults.WellKnownAppLabel, name),
// Provide a longer timeout as the pod needs to be pulled and pass HCs
}, time.Second*60, time.Second*2)
}, time.Second*60, time.Second)
Copy link

Copilot AI Dec 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] The polling interval has been reduced from 2 seconds to 1 second. While this may speed up tests that pass quickly, it doubles the polling frequency which could:

  1. Increase load on the Kubernetes API server during tests
  2. Lead to more flaky tests if the system is under load
  3. Potentially hit rate limits in constrained environments

Consider whether the performance gain (potentially saving ~1 second per assertion) justifies the increased API load and potential for flakiness.

Suggested change
}, time.Second*60, time.Second)
}, time.Second*60, 2*time.Second)

Copilot uses AI. Check for mistakes.
Comment on lines +73 to +93
create_and_setup &
KIND_PID=$!

if [[ $SKIP_DOCKER == 'true' ]]; then
# TODO(tim): refactor the Makefile & CI scripts so we're loading local
# charts to real helm repos, and then we can remove this block.
echo "SKIP_DOCKER=true, not building images or chart"
else
# 2. Make all the docker images and load them to the kind cluster
VERSION=$VERSION CLUSTER_NAME=$CLUSTER_NAME make kind-build-and-load
if [[ $AGENTGATEWAY == 'true' ]]; then
# Skip expensive envoy build
VERSION=$VERSION CLUSTER_NAME=$CLUSTER_NAME make kind-build-and-load-kgateway-agentgateway
else
VERSION=$VERSION CLUSTER_NAME=$CLUSTER_NAME make kind-build-and-load
fi

# 3. Build the test helm chart, ensuring we have a chart in the `_test` folder
VERSION=$VERSION make package-kgateway-charts

fi

# 5. Apply the Kubernetes Gateway API CRDs
kubectl apply --server-side -f "https://github.com/kubernetes-sigs/gateway-api/releases/download/$CONFORMANCE_VERSION/$CONFORMANCE_CHANNEL-install.yaml"

# 6. Apply the Kubernetes Gateway API Inference Extension CRDs
kubectl apply --kustomize "https://github.com/kubernetes-sigs/gateway-api-inference-extension/config/crd?ref=$GIE_CRD_VERSION"

# 7. Conformance test setup
if [[ $CONFORMANCE == "true" ]]; then
echo "Running conformance test setup"
. $SCRIPT_DIR/setup-metalllb-on-kind.sh
fi
wait "$KIND_PID"
Copy link

Copilot AI Dec 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The create_and_setup function is run in the background (line 73) but the script has -e flag set. If the background process fails, the script won't immediately exit due to set -e. While wait "$KIND_PID" on line 93 will eventually check the exit status, the docker build steps (lines 82-90) will continue running even if cluster setup fails. This could lead to confusing error messages where docker build succeeds but fails to load images into a non-existent cluster. Consider adding error handling or checking the cluster is ready before starting the builds.

Copilot uses AI. Check for mistakes.
Comment on lines +317 to +329
func (i *TestInstallation) releaseExists(ctx context.Context, releaseName, namespace string) bool {
l := &corev1.SecretList{}
if err := i.ClusterContext.Client.List(ctx, l, &client.ListOptions{
Namespace: namespace,
LabelSelector: labels.SelectorFromSet(map[string]string{
"owner": "helm",
"name": releaseName,
}),
}); err != nil {
return false
}
return len(l.Items) > 0
}
Copy link

Copilot AI Dec 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new releaseExists implementation queries Kubernetes secrets directly instead of using the Helm CLI. While this may be faster, it's less robust than the original ReleaseExists method because:

  1. It silently returns false on any error (line 325-326), which could mask real cluster connection issues
  2. The label selector assumes Helm v3's storage format, which could break if Helm changes its internal implementation
  3. The original method properly handles namespace scoping and validates the release is actually functional

Consider adding error logging at minimum, or evaluate if the performance gain justifies replacing the official Helm API.

Copilot uses AI. Check for mistakes.
@gateway-bot gateway-bot added kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. release-note-none and removed do-not-merge/kind-invalid Indicates a PR lacks a `kind/foo` label and requires one. do-not-merge/release-note-invalid Indicates that a PR should not merge because it's missing one of the release note labels. labels Dec 2, 2025
@howardjohn howardjohn force-pushed the agw/fast-e2e-tmp-flat branch 2 times, most recently from a5b134a to 7d63ce6 Compare December 3, 2025 00:17
@puertomontt
Copy link
Contributor

was removing curl pod controversial?

@howardjohn howardjohn force-pushed the agw/fast-e2e-tmp-flat branch from 7d63ce6 to 6be97df Compare December 3, 2025 17:20
@howardjohn
Copy link
Contributor Author

was removing curl pod controversial?

Seems like no after meeting with some folks, but its at least changing the test behavior. This PR doesn't change any behavior, just makes it faster

Signed-off-by: John Howard <[email protected]>
Signed-off-by: John Howard <[email protected]>
// Check if we should skip installation if the release already exists (PERSIST_INSTALL or FAIL_FAST_AND_PERSIST mode)
if testutils.ShouldPersistInstall() || testutils.ShouldFailFastAndPersist() {
if i.Actions.Helm().ReleaseExists(ctx, helmutils.ChartName, i.Metadata.InstallNamespace) {
if i.releaseExists(ctx, helmutils.ChartName, i.Metadata.InstallNamespace) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we just update Actions.Helm().ReleaseExists to use the new impl?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason I didn't do this is the Helm() action is for running the helm command so it doesn't have access to the cluster client at all. Not sure we want to do that broader change?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, I think we can handle that in a follow-up if needed

agentgateway:
required: false
default: "false"
description: Whether this test only needs agentgateway
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"only needs agentgateway" meaning envoy is disabled?

# 6. Apply the Kubernetes Gateway API Inference Extension CRDs
kubectl apply --kustomize "https://github.com/kubernetes-sigs/gateway-api-inference-extension/config/crd?ref=$GIE_CRD_VERSION"

. $SCRIPT_DIR/setup-metalllb-on-kind.sh
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this be only if CONFORMANCE=true, like before?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

metallb is used for all tests, which all set CONFORMANCE=true. I think its a legacy thing that is now obsolete

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in that case it would be good to clean up the CONFORMANCE var if it's not used anymore (can be a follow-up)

Comment on lines +62 to +66
# 5. Apply the Kubernetes Gateway API CRDs
kubectl apply --server-side -f "https://github.com/kubernetes-sigs/gateway-api/releases/download/$CONFORMANCE_VERSION/$CONFORMANCE_CHANNEL-install.yaml"

# 6. Apply the Kubernetes Gateway API Inference Extension CRDs
kubectl apply --kustomize "https://github.com/kubernetes-sigs/gateway-api-inference-extension/config/crd?ref=$GIE_CRD_VERSION"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: remove the 5. / 6. numbering

@howardjohn howardjohn added this pull request to the merge queue Dec 4, 2025
Merged via the queue into kgateway-dev:main with commit 91bc630 Dec 4, 2025
30 checks passed
@howardjohn howardjohn deleted the agw/fast-e2e-tmp-flat branch December 4, 2025 16:32
howardjohn added a commit to howardjohn/kgateway that referenced this pull request Dec 5, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. release-note-none

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants