Skip to content

Conversation

@liuguoqingfz
Copy link
Contributor

@liuguoqingfz liuguoqingfz commented Dec 16, 2025

Description

Fixed a flaky test that has race: the code stops a data node and immediately call indexRandom(). Until the master publishes a new cluster state and the primary is re-assigned and the stopped node is fully removed from routing, some operations inside indexRandom e.g. bulk/refresh/flush can still target the old node ID, producing NoNodeAvailableException and therefore a shard failure.

Related Issues

Resolves #19784

Check List

  • Functionality includes testing.
  • API changes companion pull request created, if applicable.
  • Public documentation issue/PR created, if applicable.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Summary by CodeRabbit

  • Tests
    • Enhanced stability checks in replica shard allocation tests by ensuring the cluster reaches a stable state before proceeding with operations. This includes verifying cluster health status and confirming no active shard movements are in progress.

✏️ Tip: You can customize this high-level summary in your review settings.

…stable health state, yellow is expected with 1 replica and only 1 data node remaining, with no initializing/relocating shards, then do the indexRandom

Signed-off-by: Joe Liu <[email protected]>
@liuguoqingfz liuguoqingfz requested a review from a team as a code owner December 16, 2025 14:44
@github-actions github-actions bot added >test-failure Test failure from CI, local build, etc. autocut flaky-test Random test failure that succeeds on second run labels Dec 16, 2025
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Dec 16, 2025

Walkthrough

Added cluster health stabilization check in the testPreferCopyWithHighestMatchingOperations test method. After stopping a higher-matching node, the test now waits for yellow status and ensures no shards are initializing or relocating before proceeding with indexing assertions.

Changes

Cohort / File(s) Change Summary
Cluster Health Stabilization in Test
server/src/internalClusterTest/java/org/opensearch/gateway/ReplicaShardAllocatorIT.java
Added import for ClusterHealthResponse; introduced cluster health wait in testPreferCopyWithHighestMatchingOperations to ensure cluster stabilizes (yellow status, no initializing/relocating shards) before proceeding with test assertions

Estimated code review effort

🎯 1 (Trivial) | ⏱️ ~5 minutes

  • Single file modification in test code
  • Straightforward addition of a standard cluster stability check
  • Logic is simple and follows established test patterns

Suggested labels

flaky-test, bug

Suggested reviewers

  • sachinpkale
  • kotwanikunal
  • cwperks
  • dbwiddis
  • msfroh
  • andrross

Poem

🐰 A flaky test once danced about,
But now it waits without a doubt.
With yellow health and shards at rest,
Stability reigns—this test's the best! 🌟

Pre-merge checks and finishing touches

✅ Passed checks (5 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately and concisely describes the main objective—fixing a flaky test related to issue 19784—and is fully aligned with the changeset modifications.
Description check ✅ Passed The PR description is complete, providing a clear explanation of the root cause (race condition), the problem (NoNodeAvailableException), the solution applied (cluster health wait), and linking to the related issue #19784.
Linked Issues check ✅ Passed The changes in the PR directly address the flaky test reported in issue #19784 by adding stabilization logic (cluster health wait) to prevent race conditions in testPreferCopyWithHighestMatchingOperations.
Out of Scope Changes check ✅ Passed All changes are scoped to fixing the flaky test; only ReplicaShardAllocatorIT.java is modified with the addition of a cluster health wait after stopping a node, directly addressing the race condition issue.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
server/src/internalClusterTest/java/org/opensearch/gateway/ReplicaShardAllocatorIT.java (1)

374-382: Excellent fix for the race condition.

The cluster health stabilization wait correctly addresses the flaky test issue. By waiting for yellow status and ensuring no shards are initializing or relocating after stopping the node, the test now guarantees that the cluster has fully processed the node departure and reassigned the primary before calling indexRandom(). This prevents operations from targeting the stopped node, eliminating the NoNodeAvailableException.

The timeout assertion with a clear message is also well-placed.

Consider adding an explicit timeout for clarity, though the default timeout via .get() is consistent with other tests in this file:

 ClusterHealthResponse postStopHealth = client().admin()
     .cluster()
     .prepareHealth(indexName)
     .setWaitForYellowStatus()
     .setWaitForNoInitializingShards(true)
     .setWaitForNoRelocatingShards(true)
+    .setTimeout(TimeValue.timeValueSeconds(30))
     .get();
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between e798353 and 01d5e66.

📒 Files selected for processing (1)
  • server/src/internalClusterTest/java/org/opensearch/gateway/ReplicaShardAllocatorIT.java (2 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (20)
  • GitHub Check: gradle-check
  • GitHub Check: precommit (21, windows-latest)
  • GitHub Check: precommit (25, windows-latest)
  • GitHub Check: precommit (21, ubuntu-24.04-arm)
  • GitHub Check: precommit (21, ubuntu-latest)
  • GitHub Check: precommit (25, macos-15-intel)
  • GitHub Check: precommit (25, macos-15)
  • GitHub Check: precommit (21, windows-2025, true)
  • GitHub Check: precommit (25, ubuntu-24.04-arm)
  • GitHub Check: precommit (21, macos-15)
  • GitHub Check: precommit (21, macos-15-intel)
  • GitHub Check: precommit (25, ubuntu-latest)
  • GitHub Check: assemble (21, ubuntu-latest)
  • GitHub Check: detect-breaking-change
  • GitHub Check: assemble (25, ubuntu-latest)
  • GitHub Check: assemble (21, windows-latest)
  • GitHub Check: assemble (25, windows-latest)
  • GitHub Check: assemble (25, ubuntu-24.04-arm)
  • GitHub Check: assemble (21, ubuntu-24.04-arm)
  • GitHub Check: Analyze (java)
🔇 Additional comments (1)
server/src/internalClusterTest/java/org/opensearch/gateway/ReplicaShardAllocatorIT.java (1)

35-35: LGTM: Necessary import for the fix.

The import is required for the cluster health stabilization logic added below.

@github-actions
Copy link
Contributor

❌ Gradle check result for 01d5e66: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

@github-actions
Copy link
Contributor

❌ Gradle check result for 01d5e66: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

autocut flaky-test Random test failure that succeeds on second run skip-changelog >test-failure Test failure from CI, local build, etc.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[AUTOCUT] Gradle Check Flaky Test Report for ReplicaShardAllocatorIT

2 participants