-
Notifications
You must be signed in to change notification settings - Fork 618
ci: minimal CI changes for faster tests #13031
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ci: minimal CI changes for faster tests #13031
Conversation
457863e to
2626c40
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR implements several optimizations to reduce CI test execution time by introducing parallel operations, conditional builds, and test configuration options.
- Adds a
SKIP_DUMPenvironment variable to disable expensive test state dumping when not needed - Introduces parallel execution in the setup-kind.sh script to run cluster setup concurrently with docker builds
- Adds an agentgateway-specific build path that skips the expensive Envoy build for tests that don't require it
Reviewed changes
Copilot reviewed 12 out of 12 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| test/testutils/env.go | Adds SkipDump constant for new environment variable to control test dumping behavior |
| test/helpers/kube_dump.go | Implements skip logic for test dumps when SKIP_DUMP=true |
| test/e2e/testutils/cluster/kind.go | Removes fallback kubeCtx logic (empty context now defaults to current kubeconfig context) |
| test/e2e/tests/base/base_suite.go | Reduces polling interval from 2s to 1s for faster test execution |
| test/e2e/test.go | Replaces runtime.SetFinalizer with t.Cleanup() for better test cleanup; adds custom releaseExists implementation that queries Kubernetes secrets directly instead of using Helm CLI |
| hack/kind/setup-kind.sh | Parallelizes cluster setup with docker builds; adds conditional build path for agentgateway tests |
| cmd/kgateway/Dockerfile.agentgateway | New minimal Dockerfile for agentgateway that excludes Envoy dependencies |
| Makefile | Removes -race flag from e2e tests; removes -cpu=4 flag; adds build targets for agentgateway image |
| .github/workflows/e2e.yaml | Adds agentgateway: 'true' to agent-gateway-cluster test configuration |
| .github/actions/setup-kind-cluster/action.yaml | Adds agentgateway input parameter to control build path selection |
| .github/actions/prep-go-runner/action.yaml | Removes expensive apt-get package removal steps; adds timing to mod-download |
| .github/actions/kubernetes-e2e-tests/action.yaml | Sets CGO_ENABLED=0 to enable Go cache reuse across test runs |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| LabelSelector: fmt.Sprintf("%s=%s", defaults.WellKnownAppLabel, name), | ||
| // Provide a longer timeout as the pod needs to be pulled and pass HCs | ||
| }, time.Second*60, time.Second*2) | ||
| }, time.Second*60, time.Second) |
Copilot
AI
Dec 2, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nitpick] The polling interval has been reduced from 2 seconds to 1 second. While this may speed up tests that pass quickly, it doubles the polling frequency which could:
- Increase load on the Kubernetes API server during tests
- Lead to more flaky tests if the system is under load
- Potentially hit rate limits in constrained environments
Consider whether the performance gain (potentially saving ~1 second per assertion) justifies the increased API load and potential for flakiness.
| }, time.Second*60, time.Second) | |
| }, time.Second*60, 2*time.Second) |
| create_and_setup & | ||
| KIND_PID=$! | ||
|
|
||
| if [[ $SKIP_DOCKER == 'true' ]]; then | ||
| # TODO(tim): refactor the Makefile & CI scripts so we're loading local | ||
| # charts to real helm repos, and then we can remove this block. | ||
| echo "SKIP_DOCKER=true, not building images or chart" | ||
| else | ||
| # 2. Make all the docker images and load them to the kind cluster | ||
| VERSION=$VERSION CLUSTER_NAME=$CLUSTER_NAME make kind-build-and-load | ||
| if [[ $AGENTGATEWAY == 'true' ]]; then | ||
| # Skip expensive envoy build | ||
| VERSION=$VERSION CLUSTER_NAME=$CLUSTER_NAME make kind-build-and-load-kgateway-agentgateway | ||
| else | ||
| VERSION=$VERSION CLUSTER_NAME=$CLUSTER_NAME make kind-build-and-load | ||
| fi | ||
|
|
||
| # 3. Build the test helm chart, ensuring we have a chart in the `_test` folder | ||
| VERSION=$VERSION make package-kgateway-charts | ||
|
|
||
| fi | ||
|
|
||
| # 5. Apply the Kubernetes Gateway API CRDs | ||
| kubectl apply --server-side -f "https://github.com/kubernetes-sigs/gateway-api/releases/download/$CONFORMANCE_VERSION/$CONFORMANCE_CHANNEL-install.yaml" | ||
|
|
||
| # 6. Apply the Kubernetes Gateway API Inference Extension CRDs | ||
| kubectl apply --kustomize "https://github.com/kubernetes-sigs/gateway-api-inference-extension/config/crd?ref=$GIE_CRD_VERSION" | ||
|
|
||
| # 7. Conformance test setup | ||
| if [[ $CONFORMANCE == "true" ]]; then | ||
| echo "Running conformance test setup" | ||
| . $SCRIPT_DIR/setup-metalllb-on-kind.sh | ||
| fi | ||
| wait "$KIND_PID" |
Copilot
AI
Dec 2, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The create_and_setup function is run in the background (line 73) but the script has -e flag set. If the background process fails, the script won't immediately exit due to set -e. While wait "$KIND_PID" on line 93 will eventually check the exit status, the docker build steps (lines 82-90) will continue running even if cluster setup fails. This could lead to confusing error messages where docker build succeeds but fails to load images into a non-existent cluster. Consider adding error handling or checking the cluster is ready before starting the builds.
| func (i *TestInstallation) releaseExists(ctx context.Context, releaseName, namespace string) bool { | ||
| l := &corev1.SecretList{} | ||
| if err := i.ClusterContext.Client.List(ctx, l, &client.ListOptions{ | ||
| Namespace: namespace, | ||
| LabelSelector: labels.SelectorFromSet(map[string]string{ | ||
| "owner": "helm", | ||
| "name": releaseName, | ||
| }), | ||
| }); err != nil { | ||
| return false | ||
| } | ||
| return len(l.Items) > 0 | ||
| } |
Copilot
AI
Dec 2, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The new releaseExists implementation queries Kubernetes secrets directly instead of using the Helm CLI. While this may be faster, it's less robust than the original ReleaseExists method because:
- It silently returns
falseon any error (line 325-326), which could mask real cluster connection issues - The label selector assumes Helm v3's storage format, which could break if Helm changes its internal implementation
- The original method properly handles namespace scoping and validates the release is actually functional
Consider adding error logging at minimum, or evaluate if the performance gain justifies replacing the official Helm API.
a5b134a to
7d63ce6
Compare
|
was removing curl pod controversial? |
7d63ce6 to
6be97df
Compare
Seems like no after meeting with some folks, but its at least changing the test behavior. This PR doesn't change any behavior, just makes it faster |
Signed-off-by: John Howard <[email protected]>
Signed-off-by: John Howard <[email protected]>
Signed-off-by: John Howard <[email protected]>
6be97df to
c052f25
Compare
| // Check if we should skip installation if the release already exists (PERSIST_INSTALL or FAIL_FAST_AND_PERSIST mode) | ||
| if testutils.ShouldPersistInstall() || testutils.ShouldFailFastAndPersist() { | ||
| if i.Actions.Helm().ReleaseExists(ctx, helmutils.ChartName, i.Metadata.InstallNamespace) { | ||
| if i.releaseExists(ctx, helmutils.ChartName, i.Metadata.InstallNamespace) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we just update Actions.Helm().ReleaseExists to use the new impl?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason I didn't do this is the Helm() action is for running the helm command so it doesn't have access to the cluster client at all. Not sure we want to do that broader change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm, I think we can handle that in a follow-up if needed
| agentgateway: | ||
| required: false | ||
| default: "false" | ||
| description: Whether this test only needs agentgateway |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"only needs agentgateway" meaning envoy is disabled?
| # 6. Apply the Kubernetes Gateway API Inference Extension CRDs | ||
| kubectl apply --kustomize "https://github.com/kubernetes-sigs/gateway-api-inference-extension/config/crd?ref=$GIE_CRD_VERSION" | ||
|
|
||
| . $SCRIPT_DIR/setup-metalllb-on-kind.sh |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should this be only if CONFORMANCE=true, like before?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
metallb is used for all tests, which all set CONFORMANCE=true. I think its a legacy thing that is now obsolete
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in that case it would be good to clean up the CONFORMANCE var if it's not used anymore (can be a follow-up)
| # 5. Apply the Kubernetes Gateway API CRDs | ||
| kubectl apply --server-side -f "https://github.com/kubernetes-sigs/gateway-api/releases/download/$CONFORMANCE_VERSION/$CONFORMANCE_CHANNEL-install.yaml" | ||
|
|
||
| # 6. Apply the Kubernetes Gateway API Inference Extension CRDs | ||
| kubectl apply --kustomize "https://github.com/kubernetes-sigs/gateway-api-inference-extension/config/crd?ref=$GIE_CRD_VERSION" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: remove the 5. / 6. numbering
Signed-off-by: John Howard <[email protected]>
Description
This PR breaks out the improvements in #12993 to focus only on the (hopefully) non-controversial parts, which should have no impact on the test execution at all.
aptoperations for agentgateway testFor local runs...
Comparison (sample size=1)
Change Type
/kind cleanup
Changelog
Additional Notes