Skip to content

Conversation

@howardjohn
Copy link
Contributor

@howardjohn howardjohn commented Nov 26, 2025

Description

Read #12981 for background.

This PR optimizes 2 e2e tests. I randomly picked these as samples, there is nothing special about these tests, and the same could be applied to most or all of our tests.

Before this PR:

--- PASS: TestAgentgatewayIntegration (35.97s)
    --- PASS: TestAgentgatewayIntegration/ApiKeyAuth (17.31s)
        --- PASS: TestAgentgatewayIntegration/ApiKeyAuth/TestGatewayPolicy (2.76s)
        --- PASS: TestAgentgatewayIntegration/ApiKeyAuth/TestRoutePolicy (0.84s)
    --- PASS: TestAgentgatewayIntegration/BasicAuth (18.53s)
        --- PASS: TestAgentgatewayIntegration/BasicAuth/TestGatewayPolicy (3.27s)
        --- PASS: TestAgentgatewayIntegration/BasicAuth/TestRoutePolicy (1.28s)
ok      github.com/kgateway-dev/kgateway/v2/test/e2e/tests      35.985s

After this PR:

2025-11-26T09:04:29.425069 --- PASS: TestAgentgatewayIntegration (0.22s)
2025-11-26T09:04:29.425072     --- PASS: TestAgentgatewayIntegration/BasicAuth (0.09s)
2025-11-26T09:04:29.425075         --- PASS: TestAgentgatewayIntegration/BasicAuth/TestGatewayPolicy (0.03s)
2025-11-26T09:04:29.425078         --- PASS: TestAgentgatewayIntegration/BasicAuth/TestRoutePolicy (0.03s)
2025-11-26T09:04:29.425081     --- PASS: TestAgentgatewayIntegration/ApiKeyAuth (0.10s)
2025-11-26T09:04:29.425084         --- PASS: TestAgentgatewayIntegration/ApiKeyAuth/TestGatewayPolicy (0.02s)
2025-11-26T09:04:29.425086         --- PASS: TestAgentgatewayIntegration/ApiKeyAuth/TestRoutePolicy (0.04s)

Changes I applied:

  • Deploy a single Gateway and test backend that is shared and re-used across tests (biggest change, by far). Replace curl pod with native Go code. Drops 35s->3s
  • Replace kubectl apply with Istio apply (cuts 200ms off each test)
  • Rebase on b96e132 (now merged) (cuts 100ms of each test)
  • Replace kubectl delete with istio delete: cuts 200ms
  • use istio apply on base setup as well, instead of just each test: cuts 300ms off suite
  • Parallel apply within each file (instead of across files): 50ms
  • Async delete: drops 20ms (note: at this point, the entire test is 50ms so this is a notable saving as well). I had to disable this though since it needs more fixes
  • Replace helm status with native code: Cuts 150ms off the test suite

edit: did some more.
--- PASS: TestAgentgatewayIntegration (1.15s)
--- PASS: TestAgentgatewayIntegration/ApiKeyAuth (0.16s)
--- PASS: TestAgentgatewayIntegration/ApiKeyAuth/TestGatewayPolicy (0.05s)
--- PASS: TestAgentgatewayIntegration/ApiKeyAuth/TestRoutePolicy (0.07s)
--- PASS: TestAgentgatewayIntegration/JwtAuth (0.21s)
--- PASS: TestAgentgatewayIntegration/JwtAuth/TestGatewayPolicy (0.04s)
--- PASS: TestAgentgatewayIntegration/JwtAuth/TestGatewayPolicyWithRbac (0.04s)
--- PASS: TestAgentgatewayIntegration/JwtAuth/TestRoutePolicy (0.04s)
--- PASS: TestAgentgatewayIntegration/JwtAuth/TestRoutePolicyWithRbac (0.04s)
--- PASS: TestAgentgatewayIntegration/CSRF (0.10s)
--- PASS: TestAgentgatewayIntegration/CSRF/TestGatewayLevelCSRF (0.06s)
--- PASS: TestAgentgatewayIntegration/RBAC (0.08s)
--- PASS: TestAgentgatewayIntegration/RBAC/TestRBACHeaderAuthorization (0.01s)
--- PASS: TestAgentgatewayIntegration/Transformation (0.11s)
--- PASS: TestAgentgatewayIntegration/Transformation/TestGatewayWithTransformedRoute (0.01s)
--- PASS: TestAgentgatewayIntegration/BackendTLSPolicy (0.27s)
--- PASS: TestAgentgatewayIntegration/BackendTLSPolicy/TestBackendTLSPolicyAndStatus (0.12s)
--- PASS: TestAgentgatewayIntegration/BasicAuth (0.20s)
--- PASS: TestAgentgatewayIntegration/BasicAuth/TestGatewayPolicy (0.04s)
--- PASS: TestAgentgatewayIntegration/BasicAuth/TestRoutePolicy (0.12s)

End to end full GHA flow: 8minutes
Actual time of testing: 23.956s

Change Type

/kind cleanup

Changelog

NONE

Additional Notes

@gateway-bot gateway-bot added kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. release-note-none labels Nov 26, 2025
// Install kgateway
testInstallation.InstallKgatewayFromLocalChart(ctx)

common.SetupBaseConfig(ctx, t, testInstallation, filepath.Join("manifests", "agent-gateway-base.yaml"))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Want to make you aware of this PR which added nightly e2e tests with different Gateway API versions. It mostly works by tagging tests with minimum versions, but it also introduced a mechanism to use different manifests for suite setup. The main use case is to switch between a gateway that defines "allowedListeners" and one that doesn't based on the API version.

I don't think its relevant for agentgateway at the moment, but may be in the future, and will be if we extend these improvements to the kgateway tests.

@howardjohn howardjohn marked this pull request as ready for review November 27, 2025 00:34
Copilot AI review requested due to automatic review settings November 27, 2025 00:34
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR optimizes e2e test execution by implementing several performance improvements, reducing test time from ~36s to ~0.2s for sample tests. The main optimization is deploying a single shared Gateway and backend that is reused across tests, replacing individual per-test deployments. Additional improvements include replacing kubectl with Istio's native apply/delete, using native Go HTTP requests instead of curl pods, and optimizing helper functions.

Key changes:

  • Shared test infrastructure with reusable Gateway/backend in agentgateway-base namespace
  • Native Go HTTP client implementation replacing curl pods
  • Istio client integration for faster resource operations
  • Test manifest updates to reference shared resources

Reviewed changes

Copilot reviewed 89 out of 92 changed files in this pull request and generated 8 comments.

Show a summary per file
File Description
test/e2e/common/base.go New shared gateway setup and native HTTP client
test/e2e/tests/base/base_suite.go Replaced kubectl with Istio apply/delete
test/e2e/testutils/cluster/kind.go Added IstioClient integration
pkg/utils/requestutils/curl/native_request.go New native Go HTTP request implementation
test/e2e/tests/manifests/agent-gateway-base.yaml New shared base resources
Various test manifests Updated to use shared agentgateway-base namespace
pkg/agentgateway/translator/ TLS/mTLS support improvements
go.mod Local replace directive and dependency updates

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review for a chance to win a $100 gift card. Take the survey.

Comment on lines 403 to 411
//for _, manifest := range testCase.Manifests {
// log.Errorf("howardjohn: apply..")
err := s.TestInstallation.ClusterContext.IstioClient.ApplyYAMLFiles("", testCase.Manifests...)
gomega.Expect(err).NotTo(gomega.HaveOccurred())
//gomega.Eventually(func() error {
// err := s.TestInstallation.Actions.Kubectl().ApplyFile(s.Ctx, manifest)
// return err
//}, 10*time.Second, 1*time.Second).Should(gomega.Succeed(), "can apply "+manifest)
//}
Copy link

Copilot AI Nov 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove commented-out debug code before merging. The commented code includes a debug log statement and the old implementation, which should be cleaned up for production code.

Copilot uses AI. Check for mistakes.
Comment on lines 456 to 463
// TODO: this is a bottleneck. Doing it async is great but we need barriers to make sure if we do
// delete(foo); add(foo) we end up in the right state.
// We also need to make sure we finish the deletion before exiting the process
//go func() {
err := s.TestInstallation.ClusterContext.IstioClient.DeleteYAMLFiles("", testCase.Manifests...)
gomega.Expect(err).NotTo(gomega.HaveOccurred())
//}()
return
Copy link

Copilot AI Nov 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Address the TODO comment about async deletion barriers. The commented async implementation and early return suggest incomplete work that needs proper synchronization before enabling.

Copilot uses AI. Check for mistakes.
Comment on lines 471 to 476
//for _, manifest := range testCase.Manifests {
//gomega.Eventually(func() error {
// err := s.TestInstallation.Actions.Kubectl().DeleteFileSafe(s.Ctx, manifest)
// return err
//}, 10*time.Second, 1*time.Second).Should(gomega.Succeed(), "can delete "+manifest)
//}
Copy link

Copilot AI Nov 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove commented-out old delete implementation code. This dead code should be cleaned up before merging.

Copilot uses AI. Check for mistakes.
go.mod Outdated
sigs.k8s.io/kind
)

replace github.com/agentgateway/agentgateway => /home/john/solo/agentgateway
Copy link

Copilot AI Nov 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove local replace directive that points to a developer's local filesystem path. This will break builds for anyone else and should not be committed.

Suggested change
replace github.com/agentgateway/agentgateway => /home/john/solo/agentgateway

Copilot uses AI. Check for mistakes.
Comment on lines +19 to +20
//agentgatewaySuiteRunner.Register("A2A", a2a.NewTestingSuite)
//agentgatewaySuiteRunner.Register("BasicRouting", agentgateway.NewTestingSuite)
Copy link

Copilot AI Nov 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Multiple test suites are commented out. If these are intentionally disabled for this draft PR, add a comment explaining why or create a tracking issue to re-enable them with the new optimizations.

Copilot uses AI. Check for mistakes.
Comment on lines +235 to +239
// Handle SNI with custom host resolution
// TODO
if c.sni != "" {
panic("sni is not implemented")
}
Copy link

Copilot AI Nov 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SNI support is not implemented and will panic at runtime if used. Either implement this feature or remove SNI as an option until it can be properly supported.

Copilot uses AI. Check for mistakes.
Comment on lines 33 to 34
kubeCtx := os.Getenv(testutils.KubeCtx)
if len(kubeCtx) == 0 {
kubeCtx = fmt.Sprintf("kind-%s", clusterName)
}

restCfg, err := kubeutils.GetRestConfigWithKubeContext(kubeCtx)
Copy link

Copilot AI Nov 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed the default context name fallback logic. If KubeCtx environment variable is empty, this will attempt to use an empty context which may fail. Consider restoring the default logic or documenting the requirement to set KubeCtx.

Copilot uses AI. Check for mistakes.
case *gatewayx.XListenerSet:
return wellknown.XListenerSetGVK
default:
panic("Uknown GVK")
Copy link

Copilot AI Nov 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Corrected spelling of 'Unknown' to 'Unknown'.

Suggested change
panic("Uknown GVK")
panic("Unknown GVK")

Copilot uses AI. Check for mistakes.
@yuval-k
Copy link
Contributor

yuval-k commented Dec 1, 2025

how do you make sure there's no test pollution? i.e. that a test passes and/or flakes because some policy from previous test didn't propagate yet for whatever reason (maybe async delete took its time)?
i know we had that problem with gloo-edge-v1

for example, when testing proxy protocol, no other test will pass until the proxy protocol is fully removed from the data plane

@howardjohn
Copy link
Contributor Author

how do you make sure there's no test pollution? i.e. that a test passes and/or flakes because some policy from previous test didn't propagate yet for whatever reason (maybe async delete took its time)? i know we had that problem with gloo-edge-v1

FWIW I disabled async delete, I don't think its worth the benefit. However, you can still have pollution due to eventual consistency (as could you before this change, though it was much less likely).

In my experience in Istio we would occasionally have this issue. Usually it shows up pretty quickly as flakes that are then diagnosed as "test was really broken and incorrectly passed sometimes". In kgw I think we also randomize test order which would make the chance of it correctly failing much much higher.

If we need to, we can add some additional barriers to check the config is cleaned up end to end (e.g. send requests that should fail?). Or, we can also make the tests use non-overlapping configs like distinct hostnames. Given 90% of the tests are just "apply a policy, send some traffic" we could also just have the framework entirely manage an assigned hostname too.

Signed-off-by: John Howard <[email protected]>
Signed-off-by: John Howard <[email protected]>
Signed-off-by: John Howard <[email protected]>
Signed-off-by: John Howard <[email protected]>
Signed-off-by: John Howard <[email protected]>
Signed-off-by: John Howard <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. release-note-none

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants