-
Notifications
You must be signed in to change notification settings - Fork 2.4k
Fixed a flaky test for issue 19784 #20258
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
…stable health state, yellow is expected with 1 replica and only 1 data node remaining, with no initializing/relocating shards, then do the indexRandom Signed-off-by: Joe Liu <[email protected]>
WalkthroughAdded cluster health stabilization check in the Changes
Estimated code review effort🎯 1 (Trivial) | ⏱️ ~5 minutes
Suggested labels
Suggested reviewers
Poem
Pre-merge checks and finishing touches✅ Passed checks (5 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (1)
server/src/internalClusterTest/java/org/opensearch/gateway/ReplicaShardAllocatorIT.java (1)
374-382: Excellent fix for the race condition.The cluster health stabilization wait correctly addresses the flaky test issue. By waiting for yellow status and ensuring no shards are initializing or relocating after stopping the node, the test now guarantees that the cluster has fully processed the node departure and reassigned the primary before calling
indexRandom(). This prevents operations from targeting the stopped node, eliminating theNoNodeAvailableException.The timeout assertion with a clear message is also well-placed.
Consider adding an explicit timeout for clarity, though the default timeout via
.get()is consistent with other tests in this file:ClusterHealthResponse postStopHealth = client().admin() .cluster() .prepareHealth(indexName) .setWaitForYellowStatus() .setWaitForNoInitializingShards(true) .setWaitForNoRelocatingShards(true) + .setTimeout(TimeValue.timeValueSeconds(30)) .get();
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
server/src/internalClusterTest/java/org/opensearch/gateway/ReplicaShardAllocatorIT.java(2 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (20)
- GitHub Check: gradle-check
- GitHub Check: precommit (21, windows-latest)
- GitHub Check: precommit (25, windows-latest)
- GitHub Check: precommit (21, ubuntu-24.04-arm)
- GitHub Check: precommit (21, ubuntu-latest)
- GitHub Check: precommit (25, macos-15-intel)
- GitHub Check: precommit (25, macos-15)
- GitHub Check: precommit (21, windows-2025, true)
- GitHub Check: precommit (25, ubuntu-24.04-arm)
- GitHub Check: precommit (21, macos-15)
- GitHub Check: precommit (21, macos-15-intel)
- GitHub Check: precommit (25, ubuntu-latest)
- GitHub Check: assemble (21, ubuntu-latest)
- GitHub Check: detect-breaking-change
- GitHub Check: assemble (25, ubuntu-latest)
- GitHub Check: assemble (21, windows-latest)
- GitHub Check: assemble (25, windows-latest)
- GitHub Check: assemble (25, ubuntu-24.04-arm)
- GitHub Check: assemble (21, ubuntu-24.04-arm)
- GitHub Check: Analyze (java)
🔇 Additional comments (1)
server/src/internalClusterTest/java/org/opensearch/gateway/ReplicaShardAllocatorIT.java (1)
35-35: LGTM: Necessary import for the fix.The import is required for the cluster health stabilization logic added below.
|
❌ Gradle check result for 01d5e66: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
|
❌ Gradle check result for 01d5e66: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
Description
Fixed a flaky test that has race: the code stops a data node and immediately call indexRandom(). Until the master publishes a new cluster state and the primary is re-assigned and the stopped node is fully removed from routing, some operations inside indexRandom e.g. bulk/refresh/flush can still target the old node ID, producing NoNodeAvailableException and therefore a shard failure.
Related Issues
Resolves #19784
Check List
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.
Summary by CodeRabbit
✏️ Tip: You can customize this high-level summary in your review settings.