-
Notifications
You must be signed in to change notification settings - Fork 253
Improve test stability in CI environments #4365
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
2cc2c1b to
4600e1e
Compare
Add deployment readiness checks before sending SIGTERM in signal handling tests to prevent 'deployment not found' errors. Introduce shared timeout and polling interval constants in testenv and benchmarks packages for consistent timeout configuration across all tests. Use shared timeout constants throughout e2e and benchmark tests: - testenv.LongTimeout (10 min) for multi-cluster resource cloning - testenv.PodReadyTimeout (180s) for pod readiness in infra setup - benchmarks.ShortTimeout (5 min) for simple operations - benchmarks.MediumTimeout (10 min) for moderate operations - benchmarks.LongTimeout (15 min) for complex operations Fix signal handling tests with parseTerminatedState() helper function to handle empty JSON output gracefully when pods haven't terminated yet, preventing JSON parsing errors. Reduces redundancy by extracting common logic into a single shared function. Configure polling intervals to balance responsiveness and resource usage in CI environments.
Add TCP health probes to git-server test infrastructure pod: - startupProbe: 160s max startup time (10s + 30*5s) - readinessProbe: standard 5s period checks
The webhook test was failing because it checked for the deployed pod immediately after creating the GitRepo, but with polling disabled (24h), the GitRepo never performed an initial sync. This caused the test to timeout waiting for a deployment that was never created. Added a wait for the GitRepo status to show a commit, ensuring the initial sync completes before checking for the deployed resources.
The webhook test was missing initialization of gitServerPort and gitProtocol variables, causing an 'invalid auth method' error when trying to clone the repository. This fix adds the missing initialization to match the pattern used in other tests.
4600e1e to
851607d
Compare
because on the Github ones test was failing all the time, while it worked locally.
The webhook test was creating the GitRepo resource after pushing content, which caused a race condition. The git push would trigger the post-receive webhook, but the GitRepo resource didn't exist yet, so the webhook was ignored. With polling disabled (24h), the GitRepo never synced. Fixed by creating the GitRepo resource first, then pushing content. This ensures the webhook can find and update the GitRepo when triggered.
851607d to
0c2ac31
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR improves test stability in CI environments by fixing race conditions, improving timeout handling, and enhancing test infrastructure reliability.
Key Changes
- Fixed webhook test race condition by reordering GitRepo resource creation before content push
- Added health probes to git-server nginx deployment and increased pod ready timeout to 180s
- Standardized timeout configuration across test suites with new shared constants
Reviewed changes
Copilot reviewed 11 out of 11 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| e2e/testenv/env.go | Added new timeout and polling interval constants (LongTimeout, VeryLongTimeout, PollingInterval, LongPollingInterval, PodReadyTimeout) for consistent test configuration |
| e2e/testenv/infra/cmd/setup.go | Updated waitForPodReady to use testenv.PodReadyTimeout instead of hardcoded 30s timeout |
| e2e/single-cluster/signals_test.go | Extracted parseTerminatedState() and waitForPodReadyAndKill() helper functions to reduce code duplication and improve signal handling test reliability |
| e2e/single-cluster/gitrepo_test.go | Fixed webhook test race condition by initializing gitServerPort/gitProtocol, creating GitRepo before pushing content, and adding wait for initial sync |
| e2e/multi-cluster/downstream_clone_objects_test.go | Added LongTimeout and LongPollingInterval to Eventually assertions for multi-cluster operations |
| e2e/assets/gitrepo/nginx_deployment.yaml | Added startup and readiness probes to nginx container and marked secret as non-optional |
| benchmarks/suite.go | Defined benchmark-specific timeout constants (ShortTimeout, MediumTimeout, LongTimeout) and polling intervals |
| benchmarks/targeting.go | Updated Eventually calls to use new timeout constants |
| benchmarks/gitrepo_bundle.go | Updated Eventually calls to use new timeout constants |
| benchmarks/deploy.go | Updated Eventually calls to use new timeout constants |
| .github/workflows/e2e-nightly-ci.yml | Changed runner from ubuntu-latest to custom runner specification for better resource allocation |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| Timeout = 5 * time.Minute | ||
| ShortTimeout = "5s" | ||
| MediumTimeout = "120s" | ||
|
|
||
| // LongTimeout is an extended timeout for slower CI environments | ||
| LongTimeout = 10 * time.Minute | ||
| // VeryLongTimeout is a very long timeout for slower CI environments | ||
| VeryLongTimeout = 15 * time.Minute | ||
|
|
||
| // PollingInterval is the polling interval for Eventually assertions | ||
| PollingInterval = 2 * time.Second | ||
| // LongPollingInterval is a longer polling interval for Eventually assertions | ||
| LongPollingInterval = 5 * time.Second | ||
|
|
||
| // PodReadyTimeout is the timeout for waiting for pods to become ready in infrastructure setup | ||
| // Set to 180s to accommodate startup probe (160s max: 30 failures × 5s period + 10s initial delay) | ||
| PodReadyTimeout = "180s" |
Copilot
AI
Nov 27, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Inconsistent timeout constant types: ShortTimeout and MediumTimeout are defined as strings ("5s", "120s"), while the newly added LongTimeout, VeryLongTimeout, PollingInterval, and LongPollingInterval are defined as time.Duration. This creates inconsistency in the codebase. Consider converting the existing string constants to time.Duration for consistency, or use string type for all timeout constants.
| LongPollingInterval = 5 * time.Second | ||
|
|
||
| // PodReadyTimeout is the timeout for waiting for pods to become ready in infrastructure setup | ||
| // Set to 180s to accommodate startup probe (160s max: 30 failures × 5s period + 10s initial delay) |
Copilot
AI
Nov 27, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nitpick] The calculation in the comment is incorrect. With failureThreshold: 30, periodSeconds: 5, and initialDelaySeconds: 10, the maximum time is: 10s (initial delay) + (30 failures × 5s) = 10s + 150s = 160s. The comment states "30 failures × 5s period + 10s initial delay" which equals 160s, but this should actually be written as "10s initial delay + (30 failures × 5s period)" for clarity. However, the 180s timeout is still appropriate as it provides a 20s buffer.
| // Set to 180s to accommodate startup probe (160s max: 30 failures × 5s period + 10s initial delay) | |
| // Set to 180s to accommodate startup probe (160s max: 10s initial delay + (30 failures × 5s period)) |
| port: 80 | ||
| initialDelaySeconds: 10 | ||
| periodSeconds: 5 | ||
| timeoutSeconds: 3 | ||
| failureThreshold: 30 | ||
| readinessProbe: | ||
| tcpSocket: | ||
| port: 80 |
Copilot
AI
Nov 27, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The startupProbe is configured to check port 80, but the container only exposes ports 4343 and 8080 (lines 32-33). This probe will likely always fail. Consider changing the port to 8080 or 4343 to match the exposed container ports.
| port: 80 | |
| initialDelaySeconds: 10 | |
| periodSeconds: 5 | |
| timeoutSeconds: 3 | |
| failureThreshold: 30 | |
| readinessProbe: | |
| tcpSocket: | |
| port: 80 | |
| port: 8080 | |
| initialDelaySeconds: 10 | |
| periodSeconds: 5 | |
| timeoutSeconds: 3 | |
| failureThreshold: 30 | |
| readinessProbe: | |
| tcpSocket: | |
| port: 8080 |
| port: 80 | ||
| initialDelaySeconds: 10 | ||
| periodSeconds: 5 | ||
| timeoutSeconds: 3 | ||
| failureThreshold: 30 | ||
| readinessProbe: | ||
| tcpSocket: | ||
| port: 80 |
Copilot
AI
Nov 27, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The readinessProbe is configured to check port 80, but the container only exposes ports 4343 and 8080 (lines 32-33). This probe will likely always fail. Consider changing the port to 8080 or 4343 to match the exposed container ports.
| port: 80 | |
| initialDelaySeconds: 10 | |
| periodSeconds: 5 | |
| timeoutSeconds: 3 | |
| failureThreshold: 30 | |
| readinessProbe: | |
| tcpSocket: | |
| port: 80 | |
| port: 8080 | |
| initialDelaySeconds: 10 | |
| periodSeconds: 5 | |
| timeoutSeconds: 3 | |
| failureThreshold: 30 | |
| readinessProbe: | |
| tcpSocket: | |
| port: 8080 |
This PR addresses several intermittent test failures in CI by fixing race conditions, improving timeout handling, and adding proper health checks to test infrastructure.
Key Changes
Fixed Webhook Test Race Condition
The webhook test had a race condition where content was pushed before the GitRepo resource was created. When the git push triggered the webhook, the GitRepo didn't exist yet, causing the webhook to be silently ignored. With polling disabled (24h interval), the GitRepo never synced and tests timed out.
Solution: Reordered test setup to create the GitRepo resource first, then push content. This ensures the webhook handler finds the GitRepo when triggered.
Improved Test Infrastructure Reliability
PodReadyTimeoutto 180s to accommodate the startup probe timinggitServerPortandgitProtocol)Standardized Timeout Configuration
Introduced shared timeout constants across test suites for consistency:
testenv.LongTimeout(10 min) for multi-cluster operationstestenv.PodReadyTimeout(180s) for pod readinessbenchmarks.ShortTimeout(5 min) for simple operationsbenchmarks.MediumTimeout(10 min) for moderate operationsbenchmarks.LongTimeout(15 min) for complex operationsSignal Handling Test Improvements
parseTerminatedState()helper function to gracefully handle empty JSON outputTesting
Refers to #4246