Skip to content

Comments

Flaky test fix#1575

Open
bowenlan-amzn wants to merge 2 commits intoopensearch-project:mainfrom
bowenlan-amzn:flaky2
Open

Flaky test fix#1575
bowenlan-amzn wants to merge 2 commits intoopensearch-project:mainfrom
bowenlan-amzn:flaky2

Conversation

@bowenlan-amzn
Copy link
Member

@bowenlan-amzn bowenlan-amzn commented Jan 26, 2026

Description

Fix flaky multi-node ISM tests by waiting for cluster GREEN before index cleanup

Summary

  • Wait for cluster health GREEN before deleting indices in multi-node test cleanup
  • Prevents ClosedByInterruptException that can permanently corrupt node lock
  • Fixes intermittent CI failures causing 1h40m runs instead of normal 25m

Test plan

  • Existing multi-node ISM integration tests pass
  • Monitor CI for reduced flaky test failures in multi-node-test-workflow.yml

Description

Problem

Multi-node ISM integration tests occasionally fail with cascade errors after a ClosedByInterruptException corrupts a node's file lock. Once corrupted, the affected node cannot perform any shard operations, causing all subsequent tests to fail.

Root cause timeline:

  1. Test cleanup calls wipeAllIndices() which deletes .opendistro-ism-config
  2. The index deletion occurs while shard recovery is still in progress (cluster YELLOW)
  3. OpenSearch cancels the recovery via CancellableThreads.cancel() → Thread.interrupt()
  4. The interrupt hits during FileChannel.size() in node lock validation
  5. Java NIO closes the channel on interrupt (ClosedByInterruptException)
  6. The node lock becomes permanently invalid - node is "brain dead"

Evidence from logs:
19:27:43,350 - .opendistro-ism-config shard allocated
19:27:43,538 - Index deletion started (only 188ms later!)
19:27:43,593 - ClosedByInterruptException - node lock invalidated
19:27:43,602 - "FileLock invalidated by an external force"

Solution

Wait for cluster health to be GREEN before deleting indices in multi-node tests. When cluster is GREEN, all shards are allocated with no recoveries in progress, making index deletion safe.

Related Issues

Resolves #[Issue number to be closed when this PR is merged]

Check List

  • New functionality includes testing.
  • New functionality has been documented.
  • API changes companion pull request created.
  • Commits are signed per the DCO using --signoff.
  • Public documentation issue/PR created.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Signed-off-by: bowenlan-amzn <bowenlan23@gmail.com>
@bowenlan-amzn bowenlan-amzn marked this pull request as ready for review January 26, 2026 03:27
@codecov
Copy link

codecov bot commented Jan 26, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 76.49%. Comparing base (c21cad5) to head (6673f39).

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1575      +/-   ##
==========================================
- Coverage   76.51%   76.49%   -0.03%     
==========================================
  Files         381      381              
  Lines       18069    18069              
  Branches     2503     2503              
==========================================
- Hits        13826    13821       -5     
- Misses       2975     2983       +8     
+ Partials     1268     1265       -3     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant