Skip to content

Conversation

@liuguoqingfz
Copy link
Contributor

@liuguoqingfz liuguoqingfz commented Dec 16, 2025

Description

Fixed the test since this suite-timeout is almost certainly coming from the test hanging because diskCacheAliases is a plain ArrayList that it mutates from multiple threads, which is not thread-safe and can corrupt the list or leave it shorter than expected. Both countDownLatch.await() calls have no timeout, so if any worker thread throws before calling countDown(), the test will wait until the suite timeout kills it. The code also doesn’t capture worker exceptions, so failures can silently prevent latches from reaching zero.

Related Issues

Resolves #19722

Check List

  • Functionality includes testing.
  • API changes companion pull request created, if applicable.
  • Public documentation issue/PR created, if applicable.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Summary by CodeRabbit

  • Tests
    • Enhanced concurrent cache creation and cleanup testing with improved error detection and verification mechanisms to ensure robust cache lifecycle operations under stress conditions.

✏️ Tip: You can customize this high-level summary in your review settings.

…lly.Captures any worker exception and fails the test.Adds bounded await() timeouts so it fails fast instead of hanging.

Signed-off-by: Joe Liu <[email protected]>
@liuguoqingfz liuguoqingfz requested a review from a team as a code owner December 16, 2025 17:48
@github-actions github-actions bot added >test-failure Test failure from CI, local build, etc. autocut flaky-test Random test failure that succeeds on second run labels Dec 16, 2025
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Dec 16, 2025

Walkthrough

This change refactors a concurrent test in EhcacheDiskCacheManagerTests to address flakiness. It replaces a simpler concurrency test with a multi-phase approach using Phaser and CountDownLatch for synchronization, introduces per-thread error capture, improves cache creation and closure verification, and uses structured thread pools for better control.

Changes

Cohort / File(s) Summary
Test robustness improvement
plugins/cache-ehcache/src/test/java/org/opensearch/cache/store/disk/EhcacheDiskCacheManagerTests.java
Replaced sequential concurrency test with multi-phase concurrent scenario; added Phaser and CountDownLatch synchronization for Phase 1 (concurrent cache creation) and Phase 2 (concurrent cache closure); introduced per-thread error capture via AtomicReference; changed path handling to compute per-test paths; replaced ArrayList with array-based storage; added structured thread pools (cache-create-, cache-close-); updated assertions to verify cache creation via getCacheManagerMap() and cleanup via null checks and doesCacheManagerExist() validation.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

  • Focus on verifying the Phaser and CountDownLatch synchronization logic is correct and covers all edge cases
  • Ensure timeout values are appropriate and error propagation from worker threads is complete
  • Review the updated assertions and path computation for correctness

Suggested labels

flaky-test, bug

Suggested reviewers

  • dbwiddis
  • cwperks
  • msfroh
  • sachinpkale
  • shwetathareja

Poem

🐰 Flaky tests were stumbling 'round,
With timing issues all unbound,
Phase by phase, now synchronized tight,
Phasers and Latches make it right,
Concurrent critters race with care,
No more chaos in the air!

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title 'Fix a flaky test for issue 19722' directly relates to the primary change—making the concurrent test robust by fixing thread-safety and timeout issues.
Description check ✅ Passed The description provides a clear explanation of the issue (ArrayList thread-safety, missing timeouts, unhandled exceptions) and follows the template with all required sections completed.
Linked Issues check ✅ Passed The PR fully addresses issue #19722 by fixing the concurrent test flakiness through thread-safe structures, timeout protection, and proper exception handling.
Out of Scope Changes check ✅ Passed All changes are focused on fixing the flaky test in EhcacheDiskCacheManagerTests; no unrelated modifications are present.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (2)
plugins/cache-ehcache/src/test/java/org/opensearch/cache/store/disk/EhcacheDiskCacheManagerTests.java (2)

49-50: Import AtomicReference for consistency.

AtomicReference uses a fully qualified name while CountDownLatch (line 19) and Phaser (line 20) are imported. For consistency, add the import.

Add this import at the top:

+import java.util.concurrent.atomic.AtomicReference;

Then simplify the declarations:

-final java.util.concurrent.atomic.AtomicReference<Throwable> createFailure =
-    new java.util.concurrent.atomic.AtomicReference<>();
+final AtomicReference<Throwable> createFailure = new AtomicReference<>();
-final java.util.concurrent.atomic.AtomicReference<Throwable> closeFailure = new java.util.concurrent.atomic.AtomicReference<>();
+final AtomicReference<Throwable> closeFailure = new AtomicReference<>();

Also applies to: 83-83


41-41: Path construction style differs from other tests in this file.

Line 41 uses resolve("request_cache") while line 124 uses string concatenation (+ "/request_cache"). The resolve() approach is more correct and portable, but the inconsistency within the file is worth noting.

Consider updating the other test methods (lines 124, 145) to use the same resolve() pattern for consistency, though this can be done in a separate refactoring PR.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 799fb9b and 01a377b.

📒 Files selected for processing (1)
  • plugins/cache-ehcache/src/test/java/org/opensearch/cache/store/disk/EhcacheDiskCacheManagerTests.java (1 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (20)
  • GitHub Check: gradle-check
  • GitHub Check: assemble (21, ubuntu-24.04-arm)
  • GitHub Check: assemble (21, ubuntu-latest)
  • GitHub Check: assemble (25, ubuntu-latest)
  • GitHub Check: assemble (25, windows-latest)
  • GitHub Check: assemble (21, windows-latest)
  • GitHub Check: assemble (25, ubuntu-24.04-arm)
  • GitHub Check: precommit (21, macos-15)
  • GitHub Check: precommit (25, windows-latest)
  • GitHub Check: precommit (21, windows-2025, true)
  • GitHub Check: precommit (21, ubuntu-24.04-arm)
  • GitHub Check: precommit (21, ubuntu-latest)
  • GitHub Check: precommit (25, ubuntu-latest)
  • GitHub Check: precommit (25, ubuntu-24.04-arm)
  • GitHub Check: precommit (21, macos-15-intel)
  • GitHub Check: precommit (25, macos-15)
  • GitHub Check: precommit (25, macos-15-intel)
  • GitHub Check: precommit (21, windows-latest)
  • GitHub Check: Analyze (java)
  • GitHub Check: detect-breaking-change
🔇 Additional comments (1)
plugins/cache-ehcache/src/test/java/org/opensearch/cache/store/disk/EhcacheDiskCacheManagerTests.java (1)

38-106: Excellent fix for the flaky test!

The refactored test correctly addresses all three issues from the PR description:

  1. ✓ Thread-safety: array-based storage (line 44) eliminates concurrent modification issues
  2. ✓ Timeouts: 60-second timeouts on await calls (lines 73, 99) prevent indefinite hangs
  3. ✓ Error propagation: AtomicReference captures worker exceptions (lines 64, 91) and fails the test explicitly (lines 74-76, 100-102)

The two-phase concurrency model with Phaser synchronization ensures threads start simultaneously, making the test more effective at detecting race conditions in EhcacheDiskCacheManager.

@github-actions
Copy link
Contributor

✅ Gradle check result for 01a377b: SUCCESS

@codecov
Copy link

codecov bot commented Dec 16, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 73.23%. Comparing base (930ae74) to head (01a377b).
⚠️ Report is 3 commits behind head on main.

Additional details and impacted files
@@             Coverage Diff              @@
##               main   #20261      +/-   ##
============================================
+ Coverage     73.20%   73.23%   +0.02%     
- Complexity    71745    71797      +52     
============================================
  Files          5795     5795              
  Lines        328304   328304              
  Branches      47283    47283              
============================================
+ Hits         240334   240427      +93     
+ Misses        68663    68614      -49     
+ Partials      19307    19263      -44     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

autocut flaky-test Random test failure that succeeds on second run >test-failure Test failure from CI, local build, etc.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[AUTOCUT] Gradle Check Flaky Test Report for EhcacheDiskCacheManagerTests

1 participant