Skip to content

Conversation

@thardeck
Copy link
Collaborator

@thardeck thardeck commented Nov 25, 2025

This PR addresses several intermittent test failures in CI by fixing race conditions, improving timeout handling, and adding proper health checks to test infrastructure.

Key Changes

Fixed Webhook Test Race Condition

The webhook test had a race condition where content was pushed before the GitRepo resource was created. When the git push triggered the webhook, the GitRepo didn't exist yet, causing the webhook to be silently ignored. With polling disabled (24h interval), the GitRepo never synced and tests timed out.

Solution: Reordered test setup to create the GitRepo resource first, then push content. This ensures the webhook handler finds the GitRepo when triggered.

Improved Test Infrastructure Reliability

  • Added startup and readiness probes to git-server nginx deployment (160s max startup time)
  • Increased PodReadyTimeout to 180s to accommodate the startup probe timing
  • Added wait for initial GitRepo sync before checking webhook deployments
  • Fixed webhook test git URL initialization (missing gitServerPort and gitProtocol)

Standardized Timeout Configuration

Introduced shared timeout constants across test suites for consistency:

  • testenv.LongTimeout (10 min) for multi-cluster operations
  • testenv.PodReadyTimeout (180s) for pod readiness
  • benchmarks.ShortTimeout (5 min) for simple operations
  • benchmarks.MediumTimeout (10 min) for moderate operations
  • benchmarks.LongTimeout (15 min) for complex operations

Signal Handling Test Improvements

  • Added deployment readiness checks before sending SIGTERM
  • Extracted parseTerminatedState() helper function to gracefully handle empty JSON output
  • Reduces redundancy by consolidating common logic

Testing

  • Webhook test now passes consistently (verified with multiple consecutive runs)
  • I had to use internal runners for nightly builds, because on the Github runners the git-server never came up, while it was working locally and on the internal runners.
Timeout duration: 10m0s
  kubectl -n default apply --wait -f /home/runner/work/fleet/fleet/e2e/assets/gitrepo/nginx_deployment.yaml
  kubectl -n default apply --wait -f /home/runner/work/fleet/fleet/e2e/assets/gitrepo/nginx_service.yaml
  kubectl -n default wait --for=condition=Ready pod --timeout=120s -l app=git-server
  result:error: timed out waiting for the condition on pods/git-server-565987d65d-m8pdj
   err:exit status 1
waitForPodReady (appName: git-server): error: timed out waiting for the condition on pods/git-server-565987d65d-m8pdj
, error: exit status 1error: exit status 1
  kubectl -n default wait --for=condition=Ready pod --timeout=120s -l app=git-server
  result:error: timed out waiting for the condition on pods/git-server-565987d65d-m8pdj
   err:exit status 1
waitForPodReady (appName: git-server): error: timed out waiting for the condition on pods/git-server-565987d65d-m8pdj
, error: exit status 1error: exit status 1
  kubectl -n default wait --for=condition=Ready pod --timeout=120s -l app=git-server
  result:error: timed out waiting for the condition on pods/git-server-565987d65d-m8pdj
   err:exit status 1
   [...]

Refers to #4246

@thardeck thardeck self-assigned this Nov 25, 2025
@thardeck thardeck force-pushed the improve_stability_of_tests branch 18 times, most recently from 2cc2c1b to 4600e1e Compare November 27, 2025 14:43
Add deployment readiness checks before sending SIGTERM in signal
handling tests to prevent 'deployment not found' errors.

Introduce shared timeout and polling interval constants in testenv
and benchmarks packages for consistent timeout configuration across
all tests.

Use shared timeout constants throughout e2e and benchmark tests:
- testenv.LongTimeout (10 min) for multi-cluster resource cloning
- testenv.PodReadyTimeout (180s) for pod readiness in infra setup
- benchmarks.ShortTimeout (5 min) for simple operations
- benchmarks.MediumTimeout (10 min) for moderate operations
- benchmarks.LongTimeout (15 min) for complex operations

Fix signal handling tests with parseTerminatedState() helper function
to handle empty JSON output gracefully when pods haven't terminated
yet, preventing JSON parsing errors. Reduces redundancy by extracting
common logic into a single shared function.

Configure polling intervals to balance responsiveness and resource
usage in CI environments.
Add TCP health probes to git-server test infrastructure pod:
- startupProbe: 160s max startup time (10s + 30*5s)
- readinessProbe: standard 5s period checks
The webhook test was failing because it checked for the deployed pod
immediately after creating the GitRepo, but with polling disabled (24h),
the GitRepo never performed an initial sync. This caused the test to
timeout waiting for a deployment that was never created.

Added a wait for the GitRepo status to show a commit, ensuring the
initial sync completes before checking for the deployed resources.
The webhook test was missing initialization of gitServerPort and
gitProtocol variables, causing an 'invalid auth method' error when
trying to clone the repository. This fix adds the missing initialization
to match the pattern used in other tests.
@thardeck thardeck force-pushed the improve_stability_of_tests branch from 4600e1e to 851607d Compare November 27, 2025 15:35
because on the Github ones test was failing all the time, while it
worked locally.
The webhook test was creating the GitRepo resource after pushing content,
which caused a race condition. The git push would trigger the post-receive
webhook, but the GitRepo resource didn't exist yet, so the webhook was
ignored. With polling disabled (24h), the GitRepo never synced.

Fixed by creating the GitRepo resource first, then pushing content. This
ensures the webhook can find and update the GitRepo when triggered.
@thardeck thardeck force-pushed the improve_stability_of_tests branch from 851607d to 0c2ac31 Compare November 27, 2025 15:37
@thardeck thardeck marked this pull request as ready for review November 27, 2025 16:27
@thardeck thardeck requested a review from a team as a code owner November 27, 2025 16:27
Copilot AI review requested due to automatic review settings November 27, 2025 16:27
@thardeck thardeck added this to Fleet Nov 27, 2025
@thardeck thardeck moved this to 👀 In review in Fleet Nov 27, 2025
@thardeck thardeck changed the title test: Improve test stability in CI environments Improve test stability in CI environments Nov 27, 2025
Copilot finished reviewing on behalf of thardeck November 27, 2025 16:31
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR improves test stability in CI environments by fixing race conditions, improving timeout handling, and enhancing test infrastructure reliability.

Key Changes

  • Fixed webhook test race condition by reordering GitRepo resource creation before content push
  • Added health probes to git-server nginx deployment and increased pod ready timeout to 180s
  • Standardized timeout configuration across test suites with new shared constants

Reviewed changes

Copilot reviewed 11 out of 11 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
e2e/testenv/env.go Added new timeout and polling interval constants (LongTimeout, VeryLongTimeout, PollingInterval, LongPollingInterval, PodReadyTimeout) for consistent test configuration
e2e/testenv/infra/cmd/setup.go Updated waitForPodReady to use testenv.PodReadyTimeout instead of hardcoded 30s timeout
e2e/single-cluster/signals_test.go Extracted parseTerminatedState() and waitForPodReadyAndKill() helper functions to reduce code duplication and improve signal handling test reliability
e2e/single-cluster/gitrepo_test.go Fixed webhook test race condition by initializing gitServerPort/gitProtocol, creating GitRepo before pushing content, and adding wait for initial sync
e2e/multi-cluster/downstream_clone_objects_test.go Added LongTimeout and LongPollingInterval to Eventually assertions for multi-cluster operations
e2e/assets/gitrepo/nginx_deployment.yaml Added startup and readiness probes to nginx container and marked secret as non-optional
benchmarks/suite.go Defined benchmark-specific timeout constants (ShortTimeout, MediumTimeout, LongTimeout) and polling intervals
benchmarks/targeting.go Updated Eventually calls to use new timeout constants
benchmarks/gitrepo_bundle.go Updated Eventually calls to use new timeout constants
benchmarks/deploy.go Updated Eventually calls to use new timeout constants
.github/workflows/e2e-nightly-ci.yml Changed runner from ubuntu-latest to custom runner specification for better resource allocation

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 17 to +33
Timeout = 5 * time.Minute
ShortTimeout = "5s"
MediumTimeout = "120s"

// LongTimeout is an extended timeout for slower CI environments
LongTimeout = 10 * time.Minute
// VeryLongTimeout is a very long timeout for slower CI environments
VeryLongTimeout = 15 * time.Minute

// PollingInterval is the polling interval for Eventually assertions
PollingInterval = 2 * time.Second
// LongPollingInterval is a longer polling interval for Eventually assertions
LongPollingInterval = 5 * time.Second

// PodReadyTimeout is the timeout for waiting for pods to become ready in infrastructure setup
// Set to 180s to accommodate startup probe (160s max: 30 failures × 5s period + 10s initial delay)
PodReadyTimeout = "180s"
Copy link

Copilot AI Nov 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Inconsistent timeout constant types: ShortTimeout and MediumTimeout are defined as strings ("5s", "120s"), while the newly added LongTimeout, VeryLongTimeout, PollingInterval, and LongPollingInterval are defined as time.Duration. This creates inconsistency in the codebase. Consider converting the existing string constants to time.Duration for consistency, or use string type for all timeout constants.

Copilot uses AI. Check for mistakes.
LongPollingInterval = 5 * time.Second

// PodReadyTimeout is the timeout for waiting for pods to become ready in infrastructure setup
// Set to 180s to accommodate startup probe (160s max: 30 failures × 5s period + 10s initial delay)
Copy link

Copilot AI Nov 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] The calculation in the comment is incorrect. With failureThreshold: 30, periodSeconds: 5, and initialDelaySeconds: 10, the maximum time is: 10s (initial delay) + (30 failures × 5s) = 10s + 150s = 160s. The comment states "30 failures × 5s period + 10s initial delay" which equals 160s, but this should actually be written as "10s initial delay + (30 failures × 5s period)" for clarity. However, the 180s timeout is still appropriate as it provides a 20s buffer.

Suggested change
// Set to 180s to accommodate startup probe (160s max: 30 failures × 5s period + 10s initial delay)
// Set to 180s to accommodate startup probe (160s max: 10s initial delay + (30 failures × 5s period))

Copilot uses AI. Check for mistakes.
Comment on lines +40 to +47
port: 80
initialDelaySeconds: 10
periodSeconds: 5
timeoutSeconds: 3
failureThreshold: 30
readinessProbe:
tcpSocket:
port: 80
Copy link

Copilot AI Nov 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The startupProbe is configured to check port 80, but the container only exposes ports 4343 and 8080 (lines 32-33). This probe will likely always fail. Consider changing the port to 8080 or 4343 to match the exposed container ports.

Suggested change
port: 80
initialDelaySeconds: 10
periodSeconds: 5
timeoutSeconds: 3
failureThreshold: 30
readinessProbe:
tcpSocket:
port: 80
port: 8080
initialDelaySeconds: 10
periodSeconds: 5
timeoutSeconds: 3
failureThreshold: 30
readinessProbe:
tcpSocket:
port: 8080

Copilot uses AI. Check for mistakes.
Comment on lines +40 to +47
port: 80
initialDelaySeconds: 10
periodSeconds: 5
timeoutSeconds: 3
failureThreshold: 30
readinessProbe:
tcpSocket:
port: 80
Copy link

Copilot AI Nov 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The readinessProbe is configured to check port 80, but the container only exposes ports 4343 and 8080 (lines 32-33). This probe will likely always fail. Consider changing the port to 8080 or 4343 to match the exposed container ports.

Suggested change
port: 80
initialDelaySeconds: 10
periodSeconds: 5
timeoutSeconds: 3
failureThreshold: 30
readinessProbe:
tcpSocket:
port: 80
port: 8080
initialDelaySeconds: 10
periodSeconds: 5
timeoutSeconds: 3
failureThreshold: 30
readinessProbe:
tcpSocket:
port: 8080

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: 👀 In review

Development

Successfully merging this pull request may close these issues.

2 participants