Skip to content

Optimize the cloud profiler E2E test suite by parallelizing the profi…#1400

Merged
yaozile123 merged 1 commit into
GoogleCloudPlatform:mainfrom
yaozile123:fix-cloudprofiler-e2e
May 29, 2026
Merged

Optimize the cloud profiler E2E test suite by parallelizing the profi…#1400
yaozile123 merged 1 commit into
GoogleCloudPlatform:mainfrom
yaozile123:fix-cloudprofiler-e2e

Conversation

@yaozile123
Copy link
Copy Markdown
Collaborator

@yaozile123 yaozile123 commented May 28, 2026

What type of PR is this?
/kind failing-test
/kind flake

What this PR does / why we need it:
This PR resolves the test hang in Istio Zonal Bucket E2E tests, and introduces execution improvements for the Cloud Profiler test suite.

1. Fix Infinite Hangs in Istio Tests with Zonal Buckets

  • The Problem: Zonal Buckets exclusively use gRPC and cannot fallback to HTTP. When Istio is injected, Envoy intercepts GCS FUSE outbound traffic on port 443 and leads to the following error
Retrying for the error: rpc error: code = Unavailable desc = name resolver error: [xDS node id: C2P-1716144904150986009]: rpc error: code = Unavailable desc = listener resource error: [xDS node id: C2P-1716144904150986009]: [xDS node id: C2P-1716144904150986009]: xds: error received from xDS stream: rpc error: code = Unavailable desc = connection error: desc = "transport: authentication handshake failed: read tcp 10.60.4.54:41720->64.233.181.95:443: read: connection reset by peer"

Crucially, the FUSE mount system call itself succeeds at the OS level, allowing the test Pod to transition to the Running status. However, when the workload actually attempts any read/write operations, the I/O hangs indefinitely. The test execution inside the Pod never fails or exits, causing the Ginkgo E2E test suite to hang silently until it hits the hard 4-hour suite timeout.

  • The Fix: Added the annotation traffic.sidecar.istio.io/excludeOutboundPorts: "443" when Zonal Buckets are enabled. This bypasses Envoy proxying for all port 443 outbound traffic, letting the DirectPath and GFE gRPC connections establish directly and stably.

2. Cloud Profiler Test Suite Improvements & Cleanups

  • Disabled on ZB: Explicitly skips/disables the cloud_profiler tests when Zonal Buckets are enabled
  • Parallelized Profile Queries: Parallelized the verification queries for GCSFuse and sidecar mounter profiles using concurrent goroutines. This significantly improves profile retrieval efficiency and prevents the tests from hitting transient timeouts.

Which issue(s) this PR fixes:

Fixes #

Special notes for your reviewer:
I have tested on non managed driver, gke version 1.36, GCSFuse version 3.9.1
I tested istio 10 times to make sure no flakiness and all the test cases passed

export STAGINGVERSION=prow-gob-internal-boskos-yaozile-zb-dev
export REGISTRY=us-central1-docker.pkg.dev/yaozile-gke-dev/gcs-fuse-csi-driver
export ENABLE_ZB=true
export E2E_TEST_FOCUS=istio
export E2E_TEST_USE_GKE_MANAGED_DRIVER=false

make build-image-and-push-multi-arch
make e2e-test

  Ran 8 of 464 Specs in 47.921 seconds
  SUCCESS! -- 8 Passed | 0 Failed | 0 Pending | 456 Skipped


  Ginkgo ran 1 suite in 53.557267177s
  Test Suite Passed

I have verified that the cloud profiler test will be skipped in ZB enabled tests

S [SKIPPED] [1.008 seconds]
  E2E Test Suite [Driver: gcsfuse.csi.storage.gke.io] [Testpattern: CSI Ephemeral-volume (default fs)] cloudProfiler [It] cloud_profiler should only create profiles for sidecar when gcsfuse profiler is disabled
  /usr/local/google/home/yaozile/oss/gcs-fuse-csi-driver/test/e2e/testsuites/cloud_profiler.go:187

S [SKIPPED] [1.000 seconds]
E2E Test Suite [Driver: gcsfuse.csi.storage.gke.io] [Testpattern: CSI Ephemeral-volume (default fs)] cloudProfiler [It] cloud_profiler should create profiles for sidecar and gcsfuse
  /usr/local/google/home/yaozile/oss/gcs-fuse-csi-driver/test/e2e/testsuites/cloud_profiler.go:183

profiler test in non managed driver:

Ran 2 of 590 Specs in 1583.837 seconds
  SUCCESS! -- 2 Passed | 0 Failed | 0 Pending | 588 Skipped


  Ginkgo ran 1 suite in 26m38.371854753s
  Test Suite Passed

I also run the ZB enabled e2e test, all the test has been passed except the one in workload-identity-federation. This test has been keep failing in the non managed testgrid and will be fixed in another PR.

  Summarizing 1 Failure:
    [FAIL] E2E Test Suite [Driver: gcsfuse.csi.storage.gke.io] [Testpattern: CSI Ephemeral-volume (default fs)] workload-identity-federation [It] should fail GCS access after workload identity federation principal permissions are removed while pod is running
    /usr/local/google/home/yaozile/oss/gcs-fuse-csi-driver/test/e2e/testsuites/workload_identity_federation.go:334

  Ran 333 of 464 Specs in 3439.089 seconds
  FAIL -- 332 Passed | 1 Failed | 0 Pending | 131 Skipped


  Ginkgo ran 1 suite in 57m20.792985038s

Does this PR introduce a user-facing change?:

NONE

@google-oss-prow
Copy link
Copy Markdown

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@google-oss-prow
Copy link
Copy Markdown

@yaozile123: The label(s) kind/failing-test, kind/flake cannot be applied, because the repository doesn't have them.

Details

In response to this:

What type of PR is this?
/kind failing-test
/kind flake

What this PR does / why we need it:
This PR resolves the test hang in Istio Zonal Bucket E2E tests, and introduces execution improvements for the Cloud Profiler test suite.

1. Fix Infinite Hangs in Istio Tests with Zonal Buckets

  • The Problem: Zonal Buckets exclusively use gRPC and cannot fallback to HTTP. When Istio is injected, Envoy intercepts GCS FUSE outbound traffic on port 443 and leads to the following error
Retrying for the error: rpc error: code = Unavailable desc = name resolver error: [xDS node id: C2P-1716144904150986009]: rpc error: code = Unavailable desc = listener resource error: [xDS node id: C2P-1716144904150986009]: [xDS node id: C2P-1716144904150986009]: xds: error received from xDS stream: rpc error: code = Unavailable desc = connection error: desc = "transport: authentication handshake failed: read tcp 10.60.4.54:41720->64.233.181.95:443: read: connection reset by peer"

Crucially, the FUSE mount system call itself succeeds at the OS level, allowing the test Pod to transition to the Running status. However, when the workload actually attempts any read/write operations, the I/O hangs indefinitely. The test execution inside the Pod never fails or exits, causing the Ginkgo E2E test suite to hang silently until it hits the hard 4-hour suite timeout.

  • The Fix: Added the annotation traffic.sidecar.istio.io/excludeOutboundPorts: "443" when Zonal Buckets are enabled. This bypasses Envoy proxying for all port 443 outbound traffic, letting the DirectPath and GFE gRPC connections establish directly and stably.

2. Cloud Profiler Test Suite Improvements & Cleanups

  • Disabled on ZB: Explicitly skips/disables the cloud_profiler tests when Zonal Buckets are enabled
  • Parallelized Profile Queries: Parallelized the verification queries for GCSFuse and sidecar mounter profiles using concurrent goroutines. This significantly improves profile retrieval efficiency and prevents the tests from hitting transient timeouts.

Which issue(s) this PR fixes:

Fixes #

Special notes for your reviewer:
I have tested on non managed driver, gke version 1.36, GCSFuse version 3.9.1
I tested istio 10 times to make sure no flakiness and all the test cases passed

export STAGINGVERSION=prow-gob-internal-boskos-yaozile-zb-dev
export REGISTRY=us-central1-docker.pkg.dev/yaozile-gke-dev/gcs-fuse-csi-driver
export ENABLE_ZB=true
export E2E_TEST_FOCUS=istio
export E2E_TEST_USE_GKE_MANAGED_DRIVER=false

make build-image-and-push-multi-arch
make e2e-test

 Ran 8 of 464 Specs in 47.921 seconds
 SUCCESS! -- 8 Passed | 0 Failed | 0 Pending | 456 Skipped


 Ginkgo ran 1 suite in 53.557267177s
 Test Suite Passed

I have verified that the cloud profiler test will be skipped in ZB enabled tests

S [SKIPPED] [1.008 seconds]
 E2E Test Suite [Driver: gcsfuse.csi.storage.gke.io] [Testpattern: CSI Ephemeral-volume (default fs)] cloudProfiler [It] cloud_profiler should only create profiles for sidecar when gcsfuse profiler is disabled
 /usr/local/google/home/yaozile/oss/gcs-fuse-csi-driver/test/e2e/testsuites/cloud_profiler.go:187

 S [SKIPPED] [1.000 seconds]
 E2E Test Suite [Driver: gcsfuse.csi.storage.gke.io] [Testpattern: CSI Ephemeral-volume (default fs)] cloudProfiler [It] cloud_profiler should create profiles for sidecar and gcsfuse
 /usr/local/google/home/yaozile/oss/gcs-fuse-csi-driver/test/e2e/testsuites/cloud_profiler.go:183

Does this PR introduce a user-facing change?:

NONE

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request updates the E2E test suites. In the cloud profiler test suite, tests are skipped when Zonal Buckets are enabled, and the profile verification checks are parallelized using goroutines. In the Istio test suite, the pod is configured to exclude port 443 from Istio redirection when Zonal Buckets are enabled. The review feedback points out an inefficiency where the Cloud Profiler client is repeatedly initialized inside a polling loop, which is now executed concurrently; it is recommended to initialize the client once and pass it as a parameter.

Comment thread test/e2e/testsuites/cloud_profiler.go Outdated
g.Expect(err).NotTo(gomega.HaveOccurred(), fmt.Sprintf("failed to check gcsfuse profile for version %s", expectedVersion))
g.Expect(gcsfuseOk).To(gomega.BeTrue(), fmt.Sprintf("gcsfuse profile does not exist yet for version %s", expectedVersion))
// Check Sidecar Profile
sidecarOk, err := checkIfProfileExistForServiceAndVersion(ctx, sidecarServiceName, expectedVersion)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

In checkIfProfileExistForServiceAndVersion, a new Cloud Profiler service client is created on every single invocation via cloudprofiler.NewService(ctx, ...). Since this helper is called inside gomega.Eventually (which polls every 10 seconds for up to 10 minutes), and now runs concurrently in two goroutines, this can result in up to 120 client allocations per test run. This is highly inefficient, leaks resources (connections/goroutines), and can lead to rate-limiting or transient authentication failures.

Consider refactoring checkIfProfileExistForServiceAndVersion to accept the pre-initialized *cloudprofiler.Service client as a parameter, and initialize it once at the start of testCaseCloudProfilerProfileCreation.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

…les existence check and fix flakey istio test in zonal bucket
@yaozile123 yaozile123 force-pushed the fix-cloudprofiler-e2e branch from 9790996 to 16a5cbc Compare May 29, 2026 02:05
@yaozile123 yaozile123 marked this pull request as ready for review May 29, 2026 02:43
@yaozile123 yaozile123 requested a review from amacaskill May 29, 2026 03:08
@google-oss-prow
Copy link
Copy Markdown

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: amacaskill, yaozile123

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@yaozile123 yaozile123 merged commit 8f4431f into GoogleCloudPlatform:main May 29, 2026
12 of 14 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants