Improve test stability in CI environments #4365

thardeck · 2025-11-25T08:18:49Z

This PR addresses several intermittent test failures in CI by fixing race conditions, improving timeout handling, and adding proper health checks to test infrastructure.

Key Changes

Fixed Webhook Test Race Condition

The webhook test had a race condition where content was pushed before the GitRepo resource was created. When the git push triggered the webhook, the GitRepo didn't exist yet, causing the webhook to be silently ignored. With polling disabled (24h interval), the GitRepo never synced and tests timed out.

Solution: Reordered test setup to create the GitRepo resource first, then push content. This ensures the webhook handler finds the GitRepo when triggered.

Improved Test Infrastructure Reliability

Added startup and readiness probes to git-server nginx deployment (160s max startup time)
Increased PodReadyTimeout to 180s to accommodate the startup probe timing
Added wait for initial GitRepo sync before checking webhook deployments
Fixed webhook test git URL initialization (missing gitServerPort and gitProtocol)

Standardized Timeout Configuration

Introduced shared timeout constants across test suites for consistency:

testenv.LongTimeout (10 min) for multi-cluster operations
testenv.PodReadyTimeout (180s) for pod readiness
benchmarks.ShortTimeout (5 min) for simple operations
benchmarks.MediumTimeout (10 min) for moderate operations
benchmarks.LongTimeout (15 min) for complex operations

Signal Handling Test Improvements

Added deployment readiness checks before sending SIGTERM
Extracted parseTerminatedState() helper function to gracefully handle empty JSON output
Reduces redundancy by consolidating common logic

Testing

Webhook test now passes consistently (verified with multiple consecutive runs)
I had to use internal runners for nightly builds, because on the Github runners the git-server never came up, while it was working locally and on the internal runners.

Timeout duration: 10m0s
  kubectl -n default apply --wait -f /home/runner/work/fleet/fleet/e2e/assets/gitrepo/nginx_deployment.yaml
  kubectl -n default apply --wait -f /home/runner/work/fleet/fleet/e2e/assets/gitrepo/nginx_service.yaml
  kubectl -n default wait --for=condition=Ready pod --timeout=120s -l app=git-server
  result:error: timed out waiting for the condition on pods/git-server-565987d65d-m8pdj
   err:exit status 1
waitForPodReady (appName: git-server): error: timed out waiting for the condition on pods/git-server-565987d65d-m8pdj
, error: exit status 1error: exit status 1
  kubectl -n default wait --for=condition=Ready pod --timeout=120s -l app=git-server
  result:error: timed out waiting for the condition on pods/git-server-565987d65d-m8pdj
   err:exit status 1
waitForPodReady (appName: git-server): error: timed out waiting for the condition on pods/git-server-565987d65d-m8pdj
, error: exit status 1error: exit status 1
  kubectl -n default wait --for=condition=Ready pod --timeout=120s -l app=git-server
  result:error: timed out waiting for the condition on pods/git-server-565987d65d-m8pdj
   err:exit status 1
   [...]

Refers to #4246

Add deployment readiness checks before sending SIGTERM in signal handling tests to prevent 'deployment not found' errors. Introduce shared timeout and polling interval constants in testenv and benchmarks packages for consistent timeout configuration across all tests. Use shared timeout constants throughout e2e and benchmark tests: - testenv.LongTimeout (10 min) for multi-cluster resource cloning - testenv.PodReadyTimeout (180s) for pod readiness in infra setup - benchmarks.ShortTimeout (5 min) for simple operations - benchmarks.MediumTimeout (10 min) for moderate operations - benchmarks.LongTimeout (15 min) for complex operations Fix signal handling tests with parseTerminatedState() helper function to handle empty JSON output gracefully when pods haven't terminated yet, preventing JSON parsing errors. Reduces redundancy by extracting common logic into a single shared function. Configure polling intervals to balance responsiveness and resource usage in CI environments.

Add TCP health probes to git-server test infrastructure pod: - startupProbe: 160s max startup time (10s + 30*5s) - readinessProbe: standard 5s period checks

The webhook test was failing because it checked for the deployed pod immediately after creating the GitRepo, but with polling disabled (24h), the GitRepo never performed an initial sync. This caused the test to timeout waiting for a deployment that was never created. Added a wait for the GitRepo status to show a commit, ensuring the initial sync completes before checking for the deployed resources.

The webhook test was missing initialization of gitServerPort and gitProtocol variables, causing an 'invalid auth method' error when trying to clone the repository. This fix adds the missing initialization to match the pattern used in other tests.

because on the Github ones test was failing all the time, while it worked locally.

The webhook test was creating the GitRepo resource after pushing content, which caused a race condition. The git push would trigger the post-receive webhook, but the GitRepo resource didn't exist yet, so the webhook was ignored. With polling disabled (24h), the GitRepo never synced. Fixed by creating the GitRepo resource first, then pushing content. This ensures the webhook can find and update the GitRepo when triggered.

Copilot

Pull request overview

This PR improves test stability in CI environments by fixing race conditions, improving timeout handling, and enhancing test infrastructure reliability.

Key Changes

Fixed webhook test race condition by reordering GitRepo resource creation before content push
Added health probes to git-server nginx deployment and increased pod ready timeout to 180s
Standardized timeout configuration across test suites with new shared constants

Reviewed changes

Copilot reviewed 11 out of 11 changed files in this pull request and generated 4 comments.

Show a summary per file

File	Description
e2e/testenv/env.go	Added new timeout and polling interval constants (`LongTimeout`, `VeryLongTimeout`, `PollingInterval`, `LongPollingInterval`, `PodReadyTimeout`) for consistent test configuration
e2e/testenv/infra/cmd/setup.go	Updated `waitForPodReady` to use `testenv.PodReadyTimeout` instead of hardcoded 30s timeout
e2e/single-cluster/signals_test.go	Extracted `parseTerminatedState()` and `waitForPodReadyAndKill()` helper functions to reduce code duplication and improve signal handling test reliability
e2e/single-cluster/gitrepo_test.go	Fixed webhook test race condition by initializing `gitServerPort`/`gitProtocol`, creating GitRepo before pushing content, and adding wait for initial sync
e2e/multi-cluster/downstream_clone_objects_test.go	Added `LongTimeout` and `LongPollingInterval` to Eventually assertions for multi-cluster operations
e2e/assets/gitrepo/nginx_deployment.yaml	Added startup and readiness probes to nginx container and marked secret as non-optional
benchmarks/suite.go	Defined benchmark-specific timeout constants (`ShortTimeout`, `MediumTimeout`, `LongTimeout`) and polling intervals
benchmarks/targeting.go	Updated Eventually calls to use new timeout constants
benchmarks/gitrepo_bundle.go	Updated Eventually calls to use new timeout constants
benchmarks/deploy.go	Updated Eventually calls to use new timeout constants
.github/workflows/e2e-nightly-ci.yml	Changed runner from `ubuntu-latest` to custom runner specification for better resource allocation

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2025-11-27T16:32:59Z

e2e/testenv/env.go

 	Timeout       = 5 * time.Minute
 	ShortTimeout  = "5s"
 	MediumTimeout = "120s"
+
+	// LongTimeout is an extended timeout for slower CI environments
+	LongTimeout = 10 * time.Minute
+	// VeryLongTimeout is a very long timeout for slower CI environments
+	VeryLongTimeout = 15 * time.Minute
+
+	// PollingInterval is the polling interval for Eventually assertions
+	PollingInterval = 2 * time.Second
+	// LongPollingInterval is a longer polling interval for Eventually assertions
+	LongPollingInterval = 5 * time.Second
+
+	// PodReadyTimeout is the timeout for waiting for pods to become ready in infrastructure setup
+	// Set to 180s to accommodate startup probe (160s max: 30 failures × 5s period + 10s initial delay)
+	PodReadyTimeout = "180s"


Inconsistent timeout constant types: ShortTimeout and MediumTimeout are defined as strings ("5s", "120s"), while the newly added LongTimeout, VeryLongTimeout, PollingInterval, and LongPollingInterval are defined as time.Duration. This creates inconsistency in the codebase. Consider converting the existing string constants to time.Duration for consistency, or use string type for all timeout constants.

Copilot · 2025-11-27T16:33:00Z

e2e/testenv/env.go

+	LongPollingInterval = 5 * time.Second
+
+	// PodReadyTimeout is the timeout for waiting for pods to become ready in infrastructure setup
+	// Set to 180s to accommodate startup probe (160s max: 30 failures × 5s period + 10s initial delay)


[nitpick] The calculation in the comment is incorrect. With failureThreshold: 30, periodSeconds: 5, and initialDelaySeconds: 10, the maximum time is: 10s (initial delay) + (30 failures × 5s) = 10s + 150s = 160s. The comment states "30 failures × 5s period + 10s initial delay" which equals 160s, but this should actually be written as "10s initial delay + (30 failures × 5s period)" for clarity. However, the 180s timeout is still appropriate as it provides a 20s buffer.

Suggested change

// Set to 180s to accommodate startup probe (160s max: 30 failures × 5s period + 10s initial delay)

// Set to 180s to accommodate startup probe (160s max: 10s initial delay + (30 failures × 5s period))

Copilot · 2025-11-27T16:33:00Z

e2e/assets/gitrepo/nginx_deployment.yaml

+              port: 80
+            initialDelaySeconds: 10
+            periodSeconds: 5
+            timeoutSeconds: 3
+            failureThreshold: 30
+          readinessProbe:
+            tcpSocket:
+              port: 80


The startupProbe is configured to check port 80, but the container only exposes ports 4343 and 8080 (lines 32-33). This probe will likely always fail. Consider changing the port to 8080 or 4343 to match the exposed container ports.

Suggested change

port: 80

initialDelaySeconds: 10

periodSeconds: 5

timeoutSeconds: 3

failureThreshold: 30

readinessProbe:

tcpSocket:

port: 80

port: 8080

initialDelaySeconds: 10

periodSeconds: 5

timeoutSeconds: 3

failureThreshold: 30

readinessProbe:

tcpSocket:

port: 8080

Copilot · 2025-11-27T16:33:00Z

e2e/assets/gitrepo/nginx_deployment.yaml

+              port: 80
+            initialDelaySeconds: 10
+            periodSeconds: 5
+            timeoutSeconds: 3
+            failureThreshold: 30
+          readinessProbe:
+            tcpSocket:
+              port: 80


The readinessProbe is configured to check port 80, but the container only exposes ports 4343 and 8080 (lines 32-33). This probe will likely always fail. Consider changing the port to 8080 or 4343 to match the exposed container ports.

Suggested change

port: 80

initialDelaySeconds: 10

periodSeconds: 5

timeoutSeconds: 3

failureThreshold: 30

readinessProbe:

tcpSocket:

port: 80

port: 8080

initialDelaySeconds: 10

periodSeconds: 5

timeoutSeconds: 3

failureThreshold: 30

readinessProbe:

tcpSocket:

port: 8080

thardeck self-assigned this Nov 25, 2025

thardeck force-pushed the improve_stability_of_tests branch 18 times, most recently from 2cc2c1b to 4600e1e Compare November 27, 2025 14:43

thardeck added 5 commits November 27, 2025 16:34

test: Extract pod readiness in signals_test into a separate function

c5c4cb0

test: Add startup and readiness probes to nginx deployment

65324ee

Add TCP health probes to git-server test infrastructure pod: - startupProbe: 160s max startup time (10s + 30*5s) - readinessProbe: standard 5s period checks

thardeck force-pushed the improve_stability_of_tests branch from 4600e1e to 851607d Compare November 27, 2025 15:35

thardeck added 2 commits November 27, 2025 16:35

test: Use internal runners for nightly builds

d93daaf

because on the Github ones test was failing all the time, while it worked locally.

thardeck force-pushed the improve_stability_of_tests branch from 851607d to 0c2ac31 Compare November 27, 2025 15:37

thardeck marked this pull request as ready for review November 27, 2025 16:27

thardeck requested a review from a team as a code owner November 27, 2025 16:27

Copilot AI review requested due to automatic review settings November 27, 2025 16:27

thardeck added this to Fleet Nov 27, 2025

thardeck moved this to 👀 In review in Fleet Nov 27, 2025

Copilot started reviewing on behalf of thardeck November 27, 2025 16:28 View session

thardeck changed the title ~~test: Improve test stability in CI environments~~ Improve test stability in CI environments Nov 27, 2025

Copilot finished reviewing on behalf of thardeck November 27, 2025 16:31

Copilot AI reviewed Nov 27, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Improve test stability in CI environments #4365

Improve test stability in CI environments #4365

thardeck commented Nov 25, 2025 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Nov 27, 2025

Uh oh!

Copilot AI Nov 27, 2025

Uh oh!

Copilot AI Nov 27, 2025

Uh oh!

Copilot AI Nov 27, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	// Set to 180s to accommodate startup probe (160s max: 30 failures × 5s period + 10s initial delay)
	// Set to 180s to accommodate startup probe (160s max: 10s initial delay + (30 failures × 5s period))

Improve test stability in CI environments #4365

Are you sure you want to change the base?

Improve test stability in CI environments #4365

Conversation

thardeck commented Nov 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Key Changes

Fixed Webhook Test Race Condition

Improved Test Infrastructure Reliability

Standardized Timeout Configuration

Signal Handling Test Improvements

Testing

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Key Changes

Reviewed changes

Uh oh!

Copilot AI Nov 27, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 27, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 27, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 27, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

thardeck commented Nov 25, 2025 •

edited

Loading