Skip to content

Conversation

@liuguoqingfz
Copy link
Contributor

@liuguoqingfz liuguoqingfz commented Dec 16, 2025

Description

Fix 2 flaky tests org.opensearch.action.admin.indices.create.SplitIndexIT.testCreateSplitIndex and org.opensearch.action.admin.indices.create.SplitIndexIT.testCreateSplitWithIndexSort
In the tests, the teardown failure is consistent with background retention-lease / recovery transport tasks still running when the test returns, we can see retention_lease_sync and internal:index/shard/recovery/start_recovery in the “pending tasks” dump. In this test, the triggers are: the test enables/disables routing rebalancing, and the indices keep a short sync interval (often inherited from indexSettings()), so tasks keep getting scheduled, and the test exits immediately after re-enabling rebalancing without waiting for the cluster to quiesce.

Related Issues

Resolves #19341

Check List

  • Functionality includes testing.
  • API changes companion pull request created, if applicable.
  • Public documentation issue/PR created, if applicable.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

…c on both source and target.After restoring rebalancing in finally, wait for the cluster to settle and for those tasks to drain only if the test body succeeded, so it won't hide real failures.

Signed-off-by: Joe Liu <[email protected]>
@liuguoqingfz liuguoqingfz requested a review from a team as a code owner December 16, 2025 20:36
@github-actions github-actions bot added >test-failure Test failure from CI, local build, etc. autocut flaky-test Random test failure that succeeds on second run labels Dec 16, 2025
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Dec 16, 2025

📝 Walkthrough

Walkthrough

A test file for split index operations is modified to add retention lease and global checkpoint sync interval settings with longer intervals (1 hour). A new helper method verifies no in-flight recovery or retention lease sync tasks exist. Test logic is enhanced with success flags to conditionally verify these settings are properly applied during split and resize operations.

Changes

Cohort / File(s) Change Summary
Split Index Test Stabilization
server/src/internalClusterTest/java/org/opensearch/action/admin/indices/create/SplitIndexIT.java
Adds helper method assertNoInFlightRecoveryOrRetentionLeaseSync to verify no in-flight retention lease sync/recovery tasks. Introduces longSyncInterval setting ("1h") applied to IndexService.RETENTION_LEASE_SYNC_INTERVAL_SETTING and IndexService.GLOBAL_CHECKPOINT_SYNC_INTERVAL_SETTING across test paths. Implements success flag guards in cleanup logic. Extends testCreateSplitIndex and testCreateSplitWithIndexSort to validate sync interval settings are correctly applied to target indices after split and to verify segment ordering.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

  • Single test file with localized changes to multiple test methods
  • Helper method addition is straightforward assertion logic
  • Settings propagation follows consistent pattern across test paths
  • Focus review on verifying sync interval values are correct and helper method covers necessary verification scenarios

Suggested labels

flaky-test, Indexing, bug

Suggested reviewers

  • sachinpkale
  • dbwiddis
  • mch2
  • msfroh
  • gbbafna
  • cwperks
  • kotwanikunal

Poem

🐰 A flaky test we now have tamed,
With sync intervals long-delayed and named,
Retention leases rest without a care,
As assertions guard each split repair,
No more tests stumble in the midnight air! ✨

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (4 passed)
Check name Status Explanation
Linked Issues check ✅ Passed The code changes directly address the flaky tests by introducing longer sync intervals and verification steps to ensure cluster quiescence before teardown, resolving the test failures reported in issue #19341.
Out of Scope Changes check ✅ Passed All changes are confined to internal test logic in SplitIndexIT.java, consisting of helper methods and sync interval configurations needed to stabilize the flaky tests without modifying public APIs or production code.
Title check ✅ Passed The title clearly and directly summarizes the main change: fixing flaky tests in SplitIndexIT. It is concise, specific, and accurately reflects the primary purpose of the changeset.
Description check ✅ Passed The description provides comprehensive detail: identifies the specific flaky tests, explains the root cause (retention-lease/recovery tasks), describes the fix approach (longer sync intervals and quiescence verification), and references issue #19341. All required template sections are addressed.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
server/src/internalClusterTest/java/org/opensearch/action/admin/indices/create/SplitIndexIT.java (1)

574-575: Minor: Inconsistent use of fully qualified class names.

These lines use the fully qualified org.opensearch.index.IndexService while testCreateSplitIndex uses just IndexService (which is already imported at line 64). Consider using the short form for consistency.

-                    .put(org.opensearch.index.IndexService.RETENTION_LEASE_SYNC_INTERVAL_SETTING.getKey(), longSyncInterval)
-                    .put(org.opensearch.index.IndexService.GLOBAL_CHECKPOINT_SYNC_INTERVAL_SETTING.getKey(), longSyncInterval)
+                    .put(IndexService.RETENTION_LEASE_SYNC_INTERVAL_SETTING.getKey(), longSyncInterval)
+                    .put(IndexService.GLOBAL_CHECKPOINT_SYNC_INTERVAL_SETTING.getKey(), longSyncInterval)

The same applies to lines 594-595 and 629-630.

📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 799fb9b and 836615c.

📒 Files selected for processing (1)
  • server/src/internalClusterTest/java/org/opensearch/action/admin/indices/create/SplitIndexIT.java (6 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (20)
  • GitHub Check: gradle-check
  • GitHub Check: Analyze (java)
  • GitHub Check: assemble (21, windows-latest)
  • GitHub Check: assemble (21, ubuntu-latest)
  • GitHub Check: assemble (21, ubuntu-24.04-arm)
  • GitHub Check: assemble (25, windows-latest)
  • GitHub Check: assemble (25, ubuntu-24.04-arm)
  • GitHub Check: precommit (21, ubuntu-24.04-arm)
  • GitHub Check: precommit (21, windows-2025, true)
  • GitHub Check: precommit (25, macos-15)
  • GitHub Check: detect-breaking-change
  • GitHub Check: precommit (25, ubuntu-latest)
  • GitHub Check: precommit (25, macos-15-intel)
  • GitHub Check: assemble (25, ubuntu-latest)
  • GitHub Check: precommit (25, ubuntu-24.04-arm)
  • GitHub Check: precommit (25, windows-latest)
  • GitHub Check: precommit (21, windows-latest)
  • GitHub Check: precommit (21, macos-15)
  • GitHub Check: precommit (21, macos-15-intel)
  • GitHub Check: precommit (21, ubuntu-latest)
🔇 Additional comments (5)
server/src/internalClusterTest/java/org/opensearch/action/admin/indices/create/SplitIndexIT.java (5)

397-411: Well-implemented helper for detecting in-flight tasks.

The helper correctly uses assertBusy() to poll for task completion and filters for the specific task types mentioned in the issue (retention_lease_sync and recovery tasks). The error message helpfully includes the detected tasks for debugging.


415-443: Long sync intervals correctly applied to source index.

The 1-hour sync interval for retention lease and global checkpoint sync prevents background tasks from triggering during test execution, directly addressing the flaky test root cause.


455-472: Success flag pattern and target settings correctly applied.

The success flag ensures verification only runs on successful test completion, and the sync intervals are consistently applied to the target index during resize.


542-551: Cleanup logic correctly implemented.

The success flag gates verification to avoid false failures. Checking index existence before ensureGreen prevents errors if the test failed before index creation. The verification runs before re-enabling rebalancing, which is correct since re-enabling could trigger new tasks.


559-666: Test restructuring correctly implements the fix pattern.

The test now:

  1. Disables rebalancing at the start
  2. Applies long sync intervals consistently to source and target indices
  3. Uses the success flag pattern for conditional verification
  4. Properly orders cleanup: verification before re-enabling rebalancing

The restructuring preserves the original test logic while adding the flaky test fix.

@github-actions
Copy link
Contributor

❌ Gradle check result for 836615c: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

@sandeshkr419 sandeshkr419 changed the title Fix flaky tests for issue 19341 Fix flaky tests in SplitIndexIT Dec 24, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

autocut flaky-test Random test failure that succeeds on second run skip-changelog >test-failure Test failure from CI, local build, etc.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[AUTOCUT] Gradle Check Flaky Test Report for SplitIndexIT

2 participants