-
Notifications
You must be signed in to change notification settings - Fork 2.3k
Fixed a flaky test that is order dependent #19762
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
c3707dc to
c964095
Compare
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #19762 +/- ##
============================================
- Coverage 73.19% 73.17% -0.03%
+ Complexity 70946 70924 -22
============================================
Files 5735 5735
Lines 324654 324654
Branches 46962 46962
============================================
- Hits 237643 237556 -87
- Misses 67875 67906 +31
- Partials 19136 19192 +56 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
| // Ensure test1 primaries are placed before adding other indices (prevents starvation) | ||
| assertBusy(() -> { | ||
| ClusterState s = client().admin().cluster().prepareState().get().getState(); | ||
| int primariesStarted = 0, unassigned = 0; | ||
| for (IndexRoutingTable irt : s.getRoutingTable()) { | ||
| if (irt.getIndex().getName().equals("test1")) { | ||
| for (IndexShardRoutingTable isrt : irt) { | ||
| for (ShardRouting sr : isrt) { | ||
| if (sr.primary() && sr.started()) primariesStarted++; | ||
| if (sr.unassigned()) unassigned++; | ||
| } | ||
| } | ||
| } | ||
| } | ||
| assertEquals(3, primariesStarted); // 3 primaries started | ||
| assertEquals(3, unassigned); // 3 unassigned (the replicas) | ||
| }); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you replace this with a call to the ensureYellow("test1") helper method in the parent test class? The index should be red until all primaries are assigned, and will be yellow if replicas are unassigned.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, better to reuse existing logic, instead of writing custom
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
replaced with ensureYellow("test1"), please take a look again.
| client().admin() | ||
| .cluster() | ||
| .prepareUpdateSettings() | ||
| .setTransientSettings(Settings.builder().put("cluster.routing.allocation.disk.threshold_enabled", false).build()) | ||
| .get(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need this change if shard limit of 6 is the reason for unassigned primary shards?
| // Ensure test1 primaries are placed before adding other indices (prevents starvation) | ||
| assertBusy(() -> { | ||
| ClusterState s = client().admin().cluster().prepareState().get().getState(); | ||
| int primariesStarted = 0, unassigned = 0; | ||
| for (IndexRoutingTable irt : s.getRoutingTable()) { | ||
| if (irt.getIndex().getName().equals("test1")) { | ||
| for (IndexShardRoutingTable isrt : irt) { | ||
| for (ShardRouting sr : isrt) { | ||
| if (sr.primary() && sr.started()) primariesStarted++; | ||
| if (sr.unassigned()) unassigned++; | ||
| } | ||
| } | ||
| } | ||
| } | ||
| assertEquals(3, primariesStarted); // 3 primaries started | ||
| assertEquals(3, unassigned); // 3 unassigned (the replicas) | ||
| }); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, better to reuse existing logic, instead of writing custom
|
This PR is stalled because it has been open for 30 days with no activity. |
WalkthroughTwo test files are updated to address test stability and improve file-handling logic. One adds explicit synchronization to ensure primary shard assignment in a flaky test; the other refactors directory copying from Stream-based to FileVisitor pattern with enhanced error handling. Changes
Estimated code review effort🎯 2 (Simple) | ⏱️ ~12 minutes
Suggested reviewers
Poem
Pre-merge checks and finishing touches❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
0542007 to
b281285
Compare
|
❌ Gradle check result for b281285: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
Fixed formatting issue with extra empty line Signed-off-by: Joe Liu <[email protected]> removed unnecessary calls and replaced with ensureYellow()
b281285 to
2662a61
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (1)
server/src/internalClusterTest/java/org/opensearch/index/shard/IndexShardIT.java (1)
326-370:walkFileTree+SimpleFileVisitorgives a clearer and safer data_path copyThe new
Files.walkFileTree(indexDataPath, new SimpleFileVisitor<>() { ... })implementation is a good improvement over the previous stream-based walk:
preVisitDirectoryensures the target directory structure undernewIndexDataPathis created before any copies.visitFilepreserves the intent to skip.lockfiles and otherwise copy everything, and it treats aNoSuchFileExceptionduring copy as a hard failure viafail(), which is reasonable given that refresh/flush are disabled just above.visitFileFailedignoringNoSuchFileExceptionwhile terminating on other IO errors is a sensible balance between robustness and not hiding serious problems.One small nit: the existing comment still refers to
Files.walkeven though we now useFiles.walkFileTree. If you want to keep the comment accurate, you could tweak it as follows:- // race condition: async flush may cause translog file deletion resulting in an inconsistent stream from - // Files.walk below during copy phase + // race condition: async flush may cause translog file deletion resulting in an inconsistent file-tree walk + // during the copy phase
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
server/src/internalClusterTest/java/org/opensearch/cluster/routing/allocation/decider/ShardsLimitAllocationDeciderIT.java(1 hunks)server/src/internalClusterTest/java/org/opensearch/index/shard/IndexShardIT.java(2 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (20)
- GitHub Check: gradle-check
- GitHub Check: assemble (25, ubuntu-24.04-arm)
- GitHub Check: assemble (25, windows-latest)
- GitHub Check: assemble (25, ubuntu-latest)
- GitHub Check: assemble (21, windows-latest)
- GitHub Check: assemble (21, ubuntu-24.04-arm)
- GitHub Check: assemble (21, ubuntu-latest)
- GitHub Check: precommit (25, windows-latest)
- GitHub Check: precommit (21, ubuntu-24.04-arm)
- GitHub Check: precommit (21, windows-2025, true)
- GitHub Check: detect-breaking-change
- GitHub Check: precommit (21, windows-latest)
- GitHub Check: precommit (21, macos-15-intel)
- GitHub Check: precommit (21, macos-15)
- GitHub Check: precommit (25, ubuntu-latest)
- GitHub Check: precommit (21, ubuntu-latest)
- GitHub Check: precommit (25, ubuntu-24.04-arm)
- GitHub Check: precommit (25, macos-15)
- GitHub Check: precommit (25, macos-15-intel)
- GitHub Check: Analyze (java)
🔇 Additional comments (2)
server/src/internalClusterTest/java/org/opensearch/cluster/routing/allocation/decider/ShardsLimitAllocationDeciderIT.java (1)
250-252: Synchronizing ontest1health correctly removes the order-dependent flakinessCalling
ensureYellow("test1")here guarantees all three primaries fortest1are allocated (replicas may remain unassigned) beforetest2/test3are created, so the final expectations of 17 assigned shards and 3 unassigned shards ontest1no longer depend on inter-index allocation ordering under the per-node shard limits. This is a targeted and sufficient fix for the described flakiness.server/src/internalClusterTest/java/org/opensearch/index/shard/IndexShardIT.java (1)
102-111: New NIO imports correctly support the FileVisitor-based copy logicThe added
FileVisitResult,NoSuchFileException,SimpleFileVisitor, andBasicFileAttributesimports align with the newwalkFileTreeimplementation below and look consistent—no issues here.
Description
Allocation is concurrent and order-dependent. Sometimes test2/test3 fill up node capacity (the 6-shards-per-node cap) before all three test1 primaries get a slot. Then one test1 primary stays unassigned too, and you see 16 (or even 14) assigned instead of 17.
Related Issues
Resolves #19726
Check List
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.
Summary by CodeRabbit
✏️ Tip: You can customize this high-level summary in your review settings.