Optimize the cloud profiler E2E test suite by parallelizing the profi…#1400
Conversation
|
Skipping CI for Draft Pull Request. |
|
@yaozile123: The label(s) DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
There was a problem hiding this comment.
Code Review
This pull request updates the E2E test suites. In the cloud profiler test suite, tests are skipped when Zonal Buckets are enabled, and the profile verification checks are parallelized using goroutines. In the Istio test suite, the pod is configured to exclude port 443 from Istio redirection when Zonal Buckets are enabled. The review feedback points out an inefficiency where the Cloud Profiler client is repeatedly initialized inside a polling loop, which is now executed concurrently; it is recommended to initialize the client once and pass it as a parameter.
| g.Expect(err).NotTo(gomega.HaveOccurred(), fmt.Sprintf("failed to check gcsfuse profile for version %s", expectedVersion)) | ||
| g.Expect(gcsfuseOk).To(gomega.BeTrue(), fmt.Sprintf("gcsfuse profile does not exist yet for version %s", expectedVersion)) | ||
| // Check Sidecar Profile | ||
| sidecarOk, err := checkIfProfileExistForServiceAndVersion(ctx, sidecarServiceName, expectedVersion) |
There was a problem hiding this comment.
In checkIfProfileExistForServiceAndVersion, a new Cloud Profiler service client is created on every single invocation via cloudprofiler.NewService(ctx, ...). Since this helper is called inside gomega.Eventually (which polls every 10 seconds for up to 10 minutes), and now runs concurrently in two goroutines, this can result in up to 120 client allocations per test run. This is highly inefficient, leaks resources (connections/goroutines), and can lead to rate-limiting or transient authentication failures.
Consider refactoring checkIfProfileExistForServiceAndVersion to accept the pre-initialized *cloudprofiler.Service client as a parameter, and initialize it once at the start of testCaseCloudProfilerProfileCreation.
…les existence check and fix flakey istio test in zonal bucket
9790996 to
16a5cbc
Compare
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: amacaskill, yaozile123 The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
What type of PR is this?
/kind failing-test
/kind flake
What this PR does / why we need it:
This PR resolves the test hang in Istio Zonal Bucket E2E tests, and introduces execution improvements for the Cloud Profiler test suite.
1. Fix Infinite Hangs in Istio Tests with Zonal Buckets
443and leads to the following errorCrucially, the FUSE mount system call itself succeeds at the OS level, allowing the test Pod to transition to the
Runningstatus. However, when the workload actually attempts any read/write operations, the I/O hangs indefinitely. The test execution inside the Pod never fails or exits, causing the Ginkgo E2E test suite to hang silently until it hits the hard 4-hour suite timeout.traffic.sidecar.istio.io/excludeOutboundPorts: "443"when Zonal Buckets are enabled. This bypasses Envoy proxying for all port 443 outbound traffic, letting the DirectPath and GFE gRPC connections establish directly and stably.2. Cloud Profiler Test Suite Improvements & Cleanups
cloud_profilertests when Zonal Buckets are enabledWhich issue(s) this PR fixes:
Fixes #
Special notes for your reviewer:
I have tested on non managed driver, gke version 1.36, GCSFuse version
3.9.1I tested istio 10 times to make sure no flakiness and all the test cases passed
I have verified that the cloud profiler test will be skipped in ZB enabled tests
profiler test in non managed driver:
I also run the ZB enabled e2e test, all the test has been passed except the one in
workload-identity-federation. This test has been keep failing in the non managed testgrid and will be fixed in another PR.Does this PR introduce a user-facing change?: