Fixed a flaky test for issue 19784 #20258

liuguoqingfz · 2025-12-16T14:44:01Z

Description

Fixed a flaky test that has race: the code stops a data node and immediately call indexRandom(). Until the master publishes a new cluster state and the primary is re-assigned and the stopped node is fully removed from routing, some operations inside indexRandom e.g. bulk/refresh/flush can still target the old node ID, producing NoNodeAvailableException and therefore a shard failure.

Related Issues

Resolves #19784

Check List

Functionality includes testing.
API changes companion pull request created, if applicable.
Public documentation issue/PR created, if applicable.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Summary by CodeRabbit

Tests
- Enhanced stability checks in replica shard allocation tests by ensuring the cluster reaches a stable state before proceeding with operations. This includes verifying cluster health status and confirming no active shard movements are in progress.

_{✏️ Tip: You can customize this high-level summary in your review settings.}

…stable health state, yellow is expected with 1 replica and only 1 data node remaining, with no initializing/relocating shards, then do the indexRandom Signed-off-by: Joe Liu <[email protected]>

coderabbitai · 2025-12-16T14:44:26Z

Walkthrough

Added cluster health stabilization check in the testPreferCopyWithHighestMatchingOperations test method. After stopping a higher-matching node, the test now waits for yellow status and ensures no shards are initializing or relocating before proceeding with indexing assertions.

Changes

Cohort / File(s)	Change Summary
Cluster Health Stabilization in Test `server/src/internalClusterTest/java/org/opensearch/gateway/ReplicaShardAllocatorIT.java`	Added import for `ClusterHealthResponse`; introduced cluster health wait in `testPreferCopyWithHighestMatchingOperations` to ensure cluster stabilizes (yellow status, no initializing/relocating shards) before proceeding with test assertions

Estimated code review effort

🎯 1 (Trivial) | ⏱️ ~5 minutes

Single file modification in test code
Straightforward addition of a standard cluster stability check
Logic is simple and follows established test patterns

Suggested labels

flaky-test, bug

Suggested reviewers

sachinpkale
kotwanikunal
cwperks
dbwiddis
msfroh
andrross

Poem

🐰 A flaky test once danced about,
But now it waits without a doubt.
With yellow health and shards at rest,
Stability reigns—this test's the best! 🌟

Pre-merge checks and finishing touches

✅ Passed checks (5 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title accurately and concisely describes the main objective—fixing a flaky test related to issue 19784—and is fully aligned with the changeset modifications.
Description check	✅ Passed	The PR description is complete, providing a clear explanation of the root cause (race condition), the problem (NoNodeAvailableException), the solution applied (cluster health wait), and linking to the related issue #19784.
Linked Issues check	✅ Passed	The changes in the PR directly address the flaky test reported in issue #19784 by adding stabilization logic (cluster health wait) to prevent race conditions in testPreferCopyWithHighestMatchingOperations.
Out of Scope Changes check	✅ Passed	All changes are scoped to fixing the flaky test; only ReplicaShardAllocatorIT.java is modified with the addition of a cluster health wait after stopping a node, directly addressing the race condition issue.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (1)

server/src/internalClusterTest/java/org/opensearch/gateway/ReplicaShardAllocatorIT.java (1)
374-382: Excellent fix for the race condition.

The cluster health stabilization wait correctly addresses the flaky test issue. By waiting for yellow status and ensuring no shards are initializing or relocating after stopping the node, the test now guarantees that the cluster has fully processed the node departure and reassigned the primary before calling indexRandom(). This prevents operations from targeting the stopped node, eliminating the NoNodeAvailableException.

The timeout assertion with a clear message is also well-placed.

Consider adding an explicit timeout for clarity, though the default timeout via .get() is consistent with other tests in this file:
 ClusterHealthResponse postStopHealth = client().admin()
     .cluster()
     .prepareHealth(indexName)
     .setWaitForYellowStatus()
     .setWaitForNoInitializingShards(true)
     .setWaitForNoRelocatingShards(true)
+    .setTimeout(TimeValue.timeValueSeconds(30))
     .get();

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between e798353 and 01d5e66.

📒 Files selected for processing (1)

server/src/internalClusterTest/java/org/opensearch/gateway/ReplicaShardAllocatorIT.java (2 hunks)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (20)

GitHub Check: gradle-check
GitHub Check: precommit (21, windows-latest)
GitHub Check: precommit (25, windows-latest)
GitHub Check: precommit (21, ubuntu-24.04-arm)
GitHub Check: precommit (21, ubuntu-latest)
GitHub Check: precommit (25, macos-15-intel)
GitHub Check: precommit (25, macos-15)
GitHub Check: precommit (21, windows-2025, true)
GitHub Check: precommit (25, ubuntu-24.04-arm)
GitHub Check: precommit (21, macos-15)
GitHub Check: precommit (21, macos-15-intel)
GitHub Check: precommit (25, ubuntu-latest)
GitHub Check: assemble (21, ubuntu-latest)
GitHub Check: detect-breaking-change
GitHub Check: assemble (25, ubuntu-latest)
GitHub Check: assemble (21, windows-latest)
GitHub Check: assemble (25, windows-latest)
GitHub Check: assemble (25, ubuntu-24.04-arm)
GitHub Check: assemble (21, ubuntu-24.04-arm)
GitHub Check: Analyze (java)

🔇 Additional comments (1)

server/src/internalClusterTest/java/org/opensearch/gateway/ReplicaShardAllocatorIT.java (1)

35-35: LGTM: Necessary import for the fix.

The import is required for the cluster health stabilization logic added below.

github-actions · 2025-12-16T15:55:42Z

❌ Gradle check result for 01d5e66: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

github-actions · 2025-12-17T22:29:44Z

❌ Gradle check result for 01d5e66: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

after stopping nodeWithHigherMatching, wait for the index to reach a …

01d5e66

…stable health state, yellow is expected with 1 replica and only 1 data node remaining, with no initializing/relocating shards, then do the indexRandom Signed-off-by: Joe Liu <[email protected]>

liuguoqingfz requested a review from a team as a code owner December 16, 2025 14:44

github-actions bot added >test-failure Test failure from CI, local build, etc. autocut flaky-test Random test failure that succeeds on second run labels Dec 16, 2025

coderabbitai bot reviewed Dec 16, 2025

View reviewed changes

andrross added the skip-changelog label Dec 17, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fixed a flaky test for issue 19784 #20258

Fixed a flaky test for issue 19784 #20258

Uh oh!

liuguoqingfz commented Dec 16, 2025 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Dec 16, 2025 •

edited

Loading

Uh oh!

coderabbitai bot left a comment

Uh oh!

github-actions bot commented Dec 16, 2025

Uh oh!

github-actions bot commented Dec 17, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Fixed a flaky test for issue 19784 #20258

Are you sure you want to change the base?

Fixed a flaky test for issue 19784 #20258

Uh oh!

Conversation

liuguoqingfz commented Dec 16, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Related Issues

Check List

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Dec 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Suggested labels

Suggested reviewers

Poem

Pre-merge checks and finishing touches

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Dec 16, 2025

Uh oh!

github-actions bot commented Dec 17, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

liuguoqingfz commented Dec 16, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Dec 16, 2025 •

edited

Loading