-
Notifications
You must be signed in to change notification settings - Fork 2.4k
Fix flaky tests in SplitIndexIT #20265
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
…c on both source and target.After restoring rebalancing in finally, wait for the cluster to settle and for those tasks to drain only if the test body succeeded, so it won't hide real failures. Signed-off-by: Joe Liu <[email protected]>
📝 WalkthroughWalkthroughA test file for split index operations is modified to add retention lease and global checkpoint sync interval settings with longer intervals (1 hour). A new helper method verifies no in-flight recovery or retention lease sync tasks exist. Test logic is enhanced with success flags to conditionally verify these settings are properly applied during split and resize operations. Changes
Estimated code review effort🎯 2 (Simple) | ⏱️ ~10 minutes
Suggested labels
Suggested reviewers
Poem
Pre-merge checks and finishing touches❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (1)
server/src/internalClusterTest/java/org/opensearch/action/admin/indices/create/SplitIndexIT.java (1)
574-575: Minor: Inconsistent use of fully qualified class names.These lines use the fully qualified
org.opensearch.index.IndexServicewhiletestCreateSplitIndexuses justIndexService(which is already imported at line 64). Consider using the short form for consistency.- .put(org.opensearch.index.IndexService.RETENTION_LEASE_SYNC_INTERVAL_SETTING.getKey(), longSyncInterval) - .put(org.opensearch.index.IndexService.GLOBAL_CHECKPOINT_SYNC_INTERVAL_SETTING.getKey(), longSyncInterval) + .put(IndexService.RETENTION_LEASE_SYNC_INTERVAL_SETTING.getKey(), longSyncInterval) + .put(IndexService.GLOBAL_CHECKPOINT_SYNC_INTERVAL_SETTING.getKey(), longSyncInterval)The same applies to lines 594-595 and 629-630.
📜 Review details
Configuration used: defaults
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
server/src/internalClusterTest/java/org/opensearch/action/admin/indices/create/SplitIndexIT.java(6 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (20)
- GitHub Check: gradle-check
- GitHub Check: Analyze (java)
- GitHub Check: assemble (21, windows-latest)
- GitHub Check: assemble (21, ubuntu-latest)
- GitHub Check: assemble (21, ubuntu-24.04-arm)
- GitHub Check: assemble (25, windows-latest)
- GitHub Check: assemble (25, ubuntu-24.04-arm)
- GitHub Check: precommit (21, ubuntu-24.04-arm)
- GitHub Check: precommit (21, windows-2025, true)
- GitHub Check: precommit (25, macos-15)
- GitHub Check: detect-breaking-change
- GitHub Check: precommit (25, ubuntu-latest)
- GitHub Check: precommit (25, macos-15-intel)
- GitHub Check: assemble (25, ubuntu-latest)
- GitHub Check: precommit (25, ubuntu-24.04-arm)
- GitHub Check: precommit (25, windows-latest)
- GitHub Check: precommit (21, windows-latest)
- GitHub Check: precommit (21, macos-15)
- GitHub Check: precommit (21, macos-15-intel)
- GitHub Check: precommit (21, ubuntu-latest)
🔇 Additional comments (5)
server/src/internalClusterTest/java/org/opensearch/action/admin/indices/create/SplitIndexIT.java (5)
397-411: Well-implemented helper for detecting in-flight tasks.The helper correctly uses
assertBusy()to poll for task completion and filters for the specific task types mentioned in the issue (retention_lease_sync and recovery tasks). The error message helpfully includes the detected tasks for debugging.
415-443: Long sync intervals correctly applied to source index.The 1-hour sync interval for retention lease and global checkpoint sync prevents background tasks from triggering during test execution, directly addressing the flaky test root cause.
455-472: Success flag pattern and target settings correctly applied.The success flag ensures verification only runs on successful test completion, and the sync intervals are consistently applied to the target index during resize.
542-551: Cleanup logic correctly implemented.The success flag gates verification to avoid false failures. Checking index existence before
ensureGreenprevents errors if the test failed before index creation. The verification runs before re-enabling rebalancing, which is correct since re-enabling could trigger new tasks.
559-666: Test restructuring correctly implements the fix pattern.The test now:
- Disables rebalancing at the start
- Applies long sync intervals consistently to source and target indices
- Uses the success flag pattern for conditional verification
- Properly orders cleanup: verification before re-enabling rebalancing
The restructuring preserves the original test logic while adding the flaky test fix.
|
❌ Gradle check result for 836615c: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
Description
Fix 2 flaky tests
org.opensearch.action.admin.indices.create.SplitIndexIT.testCreateSplitIndexandorg.opensearch.action.admin.indices.create.SplitIndexIT.testCreateSplitWithIndexSortIn the tests, the teardown failure is consistent with background retention-lease / recovery transport tasks still running when the test returns, we can see retention_lease_sync and internal:index/shard/recovery/start_recovery in the “pending tasks” dump. In this test, the triggers are: the test enables/disables routing rebalancing, and the indices keep a short sync interval (often inherited from indexSettings()), so tasks keep getting scheduled, and the test exits immediately after re-enabling rebalancing without waiting for the cluster to quiesce.
Related Issues
Resolves #19341
Check List
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.