Skip to content

Conversation

@cwperks
Copy link
Member

@cwperks cwperks commented Jan 6, 2026

Description

This PR is a similar attempt to #19762 to fix flaky tests in ShardsLimitAllocationDeciderIT

See example failure here: https://build.ci.opensearch.org/job/gradle-check/69672/testReport/junit/org.opensearch.cluster.routing.allocation.decider/ShardsLimitAllocationDeciderIT/testCombinedClusterAndIndexSpecificShardLimits__p0___opensearch_experimental_feature_writable_warm_index_enabled___true___/

>  ./gradlew ':server:internalClusterTest' --tests 'org.opensearch.cluster.routing.allocation.decider.ShardsLimitAllocationDeciderIT.testCombinedClusterAndIndexSpecificShardLimits' -Dtests.iters=5

java.lang.AssertionError: Total assigned shards should be 17 expected:<17> but was:<16>
	at __randomizedtesting.SeedInfo.seed([228B86D5A534EC39:A0832C873E69A8FB]:0)
	at org.junit.Assert.fail(Assert.java:89)
	at org.junit.Assert.failNotEquals(Assert.java:835)
	at org.junit.Assert.assertEquals(Assert.java:647)
	at org.opensearch.cluster.routing.allocation.decider.ShardsLimitAllocationDeciderIT.lambda$testCombinedClusterAndIndexSpecificShardLimits$0(ShardsLimitAllocationDeciderIT.java:303)
	at org.opensearch.test.OpenSearchTestCase.assertBusy(OpenSearchTestCase.java:1193)
	at org.opensearch.test.OpenSearchTestCase.assertBusy(OpenSearchTestCase.java:1166)
	at org.opensearch.cluster.routing.allocation.decider.ShardsLimitAllocationDeciderIT.testCombinedClusterAndIndexSpecificShardLimits(ShardsLimitAllocationDeciderIT.java:263)
	at java.base/jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:104)
	at java.base/java.lang.reflect.Method.invoke(Method.java:565)
	at com.carrotsearch.randomizedtesting.RandomizedRunner.invoke(RandomizedRunner.java:1750)
	at com.carrotsearch.randomizedtesting.RandomizedRunner$8.evaluate(RandomizedRunner.java:938)
	at com.carrotsearch.randomizedtesting.RandomizedRunner$9.evaluate(RandomizedRunner.java:974)
	at com.carrotsearch.randomizedtesting.RandomizedRunner$10.evaluate(RandomizedRunner.java:988)
	at org.opensearch.test.OpenSearchTestClusterRule$1.evaluate(OpenSearchTestClusterRule.java:369)
	at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
	at org.junit.rules.RunRules.evaluate(RunRules.java:20)

Related Issues

Resolves #19726

Stable with 100x runs. Fails a few runs consistently in a batch of 100 iters without this fix.

Check List

  • Functionality includes testing.
  • API changes companion pull request created, if applicable.
  • Public documentation issue/PR created, if applicable.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Summary by CodeRabbit

  • Bug Fixes
    • Improved reliability of cluster shard-allocation tests by waiting for cluster stabilization, forcing a reroute after index creation, tightening shard-distribution checks, and adding a 60s timeout to assertions to reduce flaky failures and improve test stability.

✏️ Tip: You can customize this high-level summary in your review settings.

@cwperks cwperks requested a review from a team as a code owner January 6, 2026 19:54
@github-actions github-actions bot added >test-failure Test failure from CI, local build, etc. autocut flaky-test Random test failure that succeeds on second run labels Jan 6, 2026
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Jan 6, 2026

📝 Walkthrough

Walkthrough

The PR adds a changelog entry and updates a test: ShardsLimitAllocationDeciderIT.testCombinedClusterAndIndexSpecificShardLimits now waits for cluster stabilization, forces a reroute after index creation, adjusts shard-distribution assertions to count only nodes that actually own shards, and wraps final assertions with a 60s timeout.

Changes

Cohort / File(s) Summary
Test Stabilization & Assertions
server/src/internalClusterTest/java/org/opensearch/cluster/routing/allocation/decider/ShardsLimitAllocationDeciderIT.java
Wait for cluster stabilization and call a forced reroute after creating indices; change shard-distribution check to count only nodes that own shards (expecting three nodes with shard counts 6, 6, and 5); wrap final assertions in a 60s timeout.
Changelog
CHANGELOG.md
Added a Fixed entry noting flaky test failures in ShardsLimitAllocationDeciderIT with associated PR link.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Suggested reviewers

  • andrross
  • reta
  • dbwiddis
  • kotwanikunal
  • msfroh

Poem

🐰 I waited, I rerouted, I counted each shard,
Three nodes now balanced, their burdens not hard.
A timeout to guard the test's patient art,
I hop with a grin — the cluster's in part. 🥕✨

Pre-merge checks

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately and concisely describes the main fix being applied: resolving flaky test failures in ShardsLimitAllocationDeciderIT.
Description check ✅ Passed The PR description comprehensively covers the problem, provides a failure example, references the related issue, and demonstrates the fix stability. All required template sections are present.
Linked Issues check ✅ Passed The PR directly addresses the flaky test failures documented in issue #19726 by implementing fixes to the testCombinedClusterAndIndexSpecificShardLimits test method.
Out of Scope Changes check ✅ Passed All changes are directly scoped to fixing the flaky test: modifications to ShardsLimitAllocationDeciderIT test logic and a CHANGELOG entry documenting the fix.

📜 Recent review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between e8fba2b and bc221a8.

📒 Files selected for processing (1)
  • CHANGELOG.md
🚧 Files skipped from review as they are similar to previous changes (1)
  • CHANGELOG.md
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (21)
  • GitHub Check: gradle-check
  • GitHub Check: precommit (25, ubuntu-latest)
  • GitHub Check: precommit (25, ubuntu-24.04-arm)
  • GitHub Check: assemble (21, ubuntu-24.04-arm)
  • GitHub Check: assemble (21, windows-latest)
  • GitHub Check: detect-breaking-change
  • GitHub Check: precommit (25, macos-15)
  • GitHub Check: precommit (21, windows-2025, true)
  • GitHub Check: precommit (25, macos-15-intel)
  • GitHub Check: precommit (21, macos-15)
  • GitHub Check: precommit (21, windows-latest)
  • GitHub Check: precommit (25, windows-latest)
  • GitHub Check: assemble (25, ubuntu-latest)
  • GitHub Check: precommit (21, ubuntu-24.04-arm)
  • GitHub Check: assemble (25, windows-latest)
  • GitHub Check: assemble (21, ubuntu-latest)
  • GitHub Check: assemble (25, ubuntu-24.04-arm)
  • GitHub Check: precommit (21, macos-15-intel)
  • GitHub Check: precommit (21, ubuntu-latest)
  • GitHub Check: Analyze (java)
  • GitHub Check: Mend Security Check

Comment @coderabbitai help to get the list of available commands and usage tips.

Signed-off-by: Craig Perkins <[email protected]>
@github-actions
Copy link
Contributor

github-actions bot commented Jan 6, 2026

❌ Gradle check result for e8fba2b: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

@github-actions
Copy link
Contributor

github-actions bot commented Jan 6, 2026

❌ Gradle check result for e8fba2b: null

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

@github-actions
Copy link
Contributor

github-actions bot commented Jan 7, 2026

❕ Gradle check result for bc221a8: UNSTABLE

Please review all flaky tests that succeeded after retry and create an issue if one does not already exist to track the flaky failure.

@codecov
Copy link

codecov bot commented Jan 7, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 73.20%. Comparing base (d866be8) to head (bc221a8).
⚠️ Report is 8 commits behind head on main.

Additional details and impacted files
@@             Coverage Diff              @@
##               main   #20375      +/-   ##
============================================
- Coverage     73.30%   73.20%   -0.10%     
+ Complexity    71777    71707      -70     
============================================
  Files          5784     5784              
  Lines        328141   328153      +12     
  Branches      47269    47270       +1     
============================================
- Hits         240531   240214     -317     
- Misses        68329    68652     +323     
- Partials      19281    19287       +6     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

autocut flaky-test Random test failure that succeeds on second run >test-failure Test failure from CI, local build, etc.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[AUTOCUT] Gradle Check Flaky Test Report for ShardsLimitAllocationDeciderIT

1 participant