Skip to content

Conversation

@liuguoqingfz
Copy link
Contributor

@liuguoqingfz liuguoqingfz commented Oct 24, 2025

Description

Allocation is concurrent and order-dependent. Sometimes test2/test3 fill up node capacity (the 6-shards-per-node cap) before all three test1 primaries get a slot. Then one test1 primary stays unassigned too, and you see 16 (or even 14) assigned instead of 17.

Related Issues

Resolves #19726

Check List

  • Functionality includes testing.
  • API changes companion pull request created, if applicable.
  • Public documentation issue/PR created, if applicable.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Summary by CodeRabbit

  • Tests
    • Enhanced test synchronization for shard allocation scenarios to ensure proper primary assignment.
    • Improved test resilience for directory operations with robust error handling for file system edge cases.

✏️ Tip: You can customize this high-level summary in your review settings.

@liuguoqingfz liuguoqingfz requested a review from a team as a code owner October 24, 2025 14:01
@github-actions github-actions bot added >test-failure Test failure from CI, local build, etc. autocut flaky-test Random test failure that succeeds on second run labels Oct 24, 2025
@github-actions
Copy link
Contributor

✅ Gradle check result for c964095: SUCCESS

@codecov
Copy link

codecov bot commented Oct 24, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 73.17%. Comparing base (0c89456) to head (c964095).
⚠️ Report is 128 commits behind head on main.

Additional details and impacted files
@@             Coverage Diff              @@
##               main   #19762      +/-   ##
============================================
- Coverage     73.19%   73.17%   -0.03%     
+ Complexity    70946    70924      -22     
============================================
  Files          5735     5735              
  Lines        324654   324654              
  Branches      46962    46962              
============================================
- Hits         237643   237556      -87     
- Misses        67875    67906      +31     
- Partials      19136    19192      +56     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Comment on lines 259 to 275
// Ensure test1 primaries are placed before adding other indices (prevents starvation)
assertBusy(() -> {
ClusterState s = client().admin().cluster().prepareState().get().getState();
int primariesStarted = 0, unassigned = 0;
for (IndexRoutingTable irt : s.getRoutingTable()) {
if (irt.getIndex().getName().equals("test1")) {
for (IndexShardRoutingTable isrt : irt) {
for (ShardRouting sr : isrt) {
if (sr.primary() && sr.started()) primariesStarted++;
if (sr.unassigned()) unassigned++;
}
}
}
}
assertEquals(3, primariesStarted); // 3 primaries started
assertEquals(3, unassigned); // 3 unassigned (the replicas)
});
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you replace this with a call to the ensureYellow("test1") helper method in the parent test class? The index should be red until all primaries are assigned, and will be yellow if replicas are unassigned.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, better to reuse existing logic, instead of writing custom

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

replaced with ensureYellow("test1"), please take a look again.

Comment on lines 241 to 245
client().admin()
.cluster()
.prepareUpdateSettings()
.setTransientSettings(Settings.builder().put("cluster.routing.allocation.disk.threshold_enabled", false).build())
.get();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need this change if shard limit of 6 is the reason for unassigned primary shards?

Comment on lines 259 to 275
// Ensure test1 primaries are placed before adding other indices (prevents starvation)
assertBusy(() -> {
ClusterState s = client().admin().cluster().prepareState().get().getState();
int primariesStarted = 0, unassigned = 0;
for (IndexRoutingTable irt : s.getRoutingTable()) {
if (irt.getIndex().getName().equals("test1")) {
for (IndexShardRoutingTable isrt : irt) {
for (ShardRouting sr : isrt) {
if (sr.primary() && sr.started()) primariesStarted++;
if (sr.unassigned()) unassigned++;
}
}
}
}
assertEquals(3, primariesStarted); // 3 primaries started
assertEquals(3, unassigned); // 3 unassigned (the replicas)
});
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, better to reuse existing logic, instead of writing custom

@opensearch-trigger-bot
Copy link
Contributor

This PR is stalled because it has been open for 30 days with no activity.

@liuguoqingfz liuguoqingfz requested a review from sohami as a code owner December 15, 2025 10:12
@coderabbitai
Copy link

coderabbitai bot commented Dec 15, 2025

Walkthrough

Two test files are updated to address test stability and improve file-handling logic. One adds explicit synchronization to ensure primary shard assignment in a flaky test; the other refactors directory copying from Stream-based to FileVisitor pattern with enhanced error handling.

Changes

Cohort / File(s) Summary
Test synchronization enhancement
server/src/internalClusterTest/java/org/opensearch/cluster/routing/allocation/decider/ShardsLimitAllocationDeciderIT.java
Adds ensureYellow("test1") call after creating test1 index to ensure primary shard assignment and improve test stability.
File copying refactoring
server/src/internalClusterTest/java/org/opensearch/index/shard/IndexShardIT.java
Replaces Stream-based Files.walk with Files.walkFileTree and SimpleFileVisitor for directory copying. Adds logic to create target directories, skip .lock files, and handle NoSuchFileException with logging. Implements method overrides for preVisitDirectory, visitFile, and visitFileFailed. Updates imports for FileVisitResult, NoSuchFileException, SimpleFileVisitor, and BasicFileAttributes.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~12 minutes

  • ShardsLimitAllocationDeciderIT.java: Single-line test synchronization change addressing documented test flakiness—straightforward to validate.
  • IndexShardIT.java: Refactoring of file-copying logic requires careful review of the FileVisitor implementation and exception-handling semantics to ensure correctness and behavior parity with the previous Stream-based approach.

Suggested reviewers

  • msfroh
  • mch2
  • dbwiddis
  • shwetathareja
  • sachinpkale
  • kotwanikunal
  • cwperks
  • ashking94
  • owaiskazi19

Poem

🐰 A sync call steadies the wavering test,
While FileVisitor walks the tree with zest,
.lock files skip and folders align,
Our test suite shines—more stable, more fine! ✨

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title 'Fixed a flaky test that is order dependent' directly addresses the main change—fixing a test flakiness issue caused by order-dependent shard allocation.
Description check ✅ Passed The description adequately explains the root cause (concurrent allocation filling node capacity before test1 primaries are assigned) and its symptom (16 or 14 shards instead of 17), with a linked issue reference, though the description could be more detailed.
Linked Issues check ✅ Passed The PR addresses the flaky test in ShardsLimitAllocationDeciderIT [#19726] by adding an ensureYellow('test1') synchronization step to ensure test1 primaries are allocated before other tests run, directly mitigating the order-dependent test failure.
Out of Scope Changes check ✅ Passed The changes to ShardsLimitAllocationDeciderIT.java and IndexShardIT.java are both within scope: the first fixes the flaky test [#19726], and the second improves file copying robustness in test infrastructure.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions
Copy link
Contributor

❌ Gradle check result for b281285: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Fixed formatting issue with extra empty line

Signed-off-by: Joe Liu <[email protected]>

removed unnecessary calls and replaced with ensureYellow()
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
server/src/internalClusterTest/java/org/opensearch/index/shard/IndexShardIT.java (1)

326-370: walkFileTree + SimpleFileVisitor gives a clearer and safer data_path copy

The new Files.walkFileTree(indexDataPath, new SimpleFileVisitor<>() { ... }) implementation is a good improvement over the previous stream-based walk:

  • preVisitDirectory ensures the target directory structure under newIndexDataPath is created before any copies.
  • visitFile preserves the intent to skip .lock files and otherwise copy everything, and it treats a NoSuchFileException during copy as a hard failure via fail(), which is reasonable given that refresh/flush are disabled just above.
  • visitFileFailed ignoring NoSuchFileException while terminating on other IO errors is a sensible balance between robustness and not hiding serious problems.

One small nit: the existing comment still refers to Files.walk even though we now use Files.walkFileTree. If you want to keep the comment accurate, you could tweak it as follows:

-        // race condition: async flush may cause translog file deletion resulting in an inconsistent stream from
-        // Files.walk below during copy phase
+        // race condition: async flush may cause translog file deletion resulting in an inconsistent file-tree walk
+        // during the copy phase
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between ee59cb0 and 2662a61.

📒 Files selected for processing (2)
  • server/src/internalClusterTest/java/org/opensearch/cluster/routing/allocation/decider/ShardsLimitAllocationDeciderIT.java (1 hunks)
  • server/src/internalClusterTest/java/org/opensearch/index/shard/IndexShardIT.java (2 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (20)
  • GitHub Check: gradle-check
  • GitHub Check: assemble (25, ubuntu-24.04-arm)
  • GitHub Check: assemble (25, windows-latest)
  • GitHub Check: assemble (25, ubuntu-latest)
  • GitHub Check: assemble (21, windows-latest)
  • GitHub Check: assemble (21, ubuntu-24.04-arm)
  • GitHub Check: assemble (21, ubuntu-latest)
  • GitHub Check: precommit (25, windows-latest)
  • GitHub Check: precommit (21, ubuntu-24.04-arm)
  • GitHub Check: precommit (21, windows-2025, true)
  • GitHub Check: detect-breaking-change
  • GitHub Check: precommit (21, windows-latest)
  • GitHub Check: precommit (21, macos-15-intel)
  • GitHub Check: precommit (21, macos-15)
  • GitHub Check: precommit (25, ubuntu-latest)
  • GitHub Check: precommit (21, ubuntu-latest)
  • GitHub Check: precommit (25, ubuntu-24.04-arm)
  • GitHub Check: precommit (25, macos-15)
  • GitHub Check: precommit (25, macos-15-intel)
  • GitHub Check: Analyze (java)
🔇 Additional comments (2)
server/src/internalClusterTest/java/org/opensearch/cluster/routing/allocation/decider/ShardsLimitAllocationDeciderIT.java (1)

250-252: Synchronizing on test1 health correctly removes the order-dependent flakiness

Calling ensureYellow("test1") here guarantees all three primaries for test1 are allocated (replicas may remain unassigned) before test2/test3 are created, so the final expectations of 17 assigned shards and 3 unassigned shards on test1 no longer depend on inter-index allocation ordering under the per-node shard limits. This is a targeted and sufficient fix for the described flakiness.

server/src/internalClusterTest/java/org/opensearch/index/shard/IndexShardIT.java (1)

102-111: New NIO imports correctly support the FileVisitor-based copy logic

The added FileVisitResult, NoSuchFileException, SimpleFileVisitor, and BasicFileAttributes imports align with the new walkFileTree implementation below and look consistent—no issues here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

autocut flaky-test Random test failure that succeeds on second run >test-failure Test failure from CI, local build, etc.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[AUTOCUT] Gradle Check Flaky Test Report for ShardsLimitAllocationDeciderIT

3 participants