[CELEBORN-2362] Fix Flaky CI/CD by afterincomparableyum · Pull Request #3737 · apache/celeborn

afterincomparableyum · 2026-06-14T05:44:58Z

What changes were proposed in this pull request?

Address several independent root causes of CI/CD flakiness, spanning the
build tooling, JaCoCo instrumentation, the test mini-cluster, and the
process-wide shuffle client.

Build & coverage:

build/mvn: default to archive.apache.org instead of the closer.lua
mirror redirector, validate each download is a real tarball (closer.lua
intermittently returns an HTML mirror page with HTTP 200), and retry up
to 3 times.
pom.xml: exclude io/netty/** from JaCoCo instrumentation. JaCoCo's
agent rejects Netty 4.2's already-enhanced JFR event classes, tearing
down channels mid-write and flaking the Flink integration tests.

Client correctness:

ShuffleClient: make the process-wide singleton app-aware. Track the
appUniqueId of the live instance and rebuild it when a later app/suite
requests a different id (and clear it in reset()), so a straggler from a
previous app can no longer bind to the wrong LifecycleManager through the
shared static client.
ShuffleClientImpl: guard the revive path against a null oldLoc before
removing it from pushExcludedWorkers, avoiding an NPE when no previous
location is mapped.
ApplicationHeartbeater: start the heartbeat timer after one interval
instead of at delay 0, so the first heartbeat does not fire before the
application is fully registered.

Mini-cluster & test harness:

MiniClusterFeature: cap randomly selected ports below the ephemeral
floor (32768) to avoid a TOCTOU race with OS-assigned/TIME_WAIT ports
that look free at selection time but fail to bind. Also rework worker
startup: recreate the worker on each retry, handle InterruptedException
explicitly (stop + interrupt + rethrow), and back off exponentially
between attempts so a transient port collision retries cleanly.
RatisMasterStatusSystemSuiteJ: re-point each server to a fresh storage
directory on every start attempt, since Ratis releases the directory
lock asynchronously and a retry would otherwise hit "directory is
already locked".

Spark/Flink integration suites:

SparkTestBase / CelebornHashCheckDiskSuite: stop any active
SparkSession and reset the static ShuffleClient between suites/tests, so
a leaked SparkContext can no longer keep an old LifecycleManager alive
and corrupt a later suite's shuffle 0.
spark-it suites: reduce worker count from 5 to 3 (the MiniCluster
default) to cut the CPU/thread footprint of the serial single-JVM
suites that otherwise starves RPC/fetch handlers past the 240s timeout.
LifecycleManagerUnregisterShuffleSuite: shorten the expired-check
interval to 5s so the shuffle unregister runs within the eventually()
window with retry margin.
HybridShuffleWordCountTest: wait (eventually) for the async LocalFlusher
to drain before asserting on-disk file length equals the logical
length, instead of reading the file mid-flush.

Also, I fix flaky JVMQuakeSuite hang by driving the GC deficit bucket deterministically via an extracted JVMQuake.checkAndDump instead of inducing real GC pressure

Why are the changes needed?

CI/CD is always failing. With this, CI/CD rarely fails.

Does this PR resolve a correctness bug?

Yes

Does this PR introduce any user-facing change?

Yes

How was this patch tested?

CI/CD

codecov · 2026-06-14T18:10:37Z

Codecov Report

❌ Patch coverage is 0% with 14 lines in your changes missing coverage. Please review.
✅ Project coverage is 57.38%. Comparing base (b4cb5a0) to head (b4d6dc4).
⚠️ Report is 69 commits behind head on main.

Files with missing lines	Patch %	Lines
...java/org/apache/celeborn/client/ShuffleClient.java	0.00%	11 Missing ⚠️
.../org/apache/celeborn/client/ShuffleClientImpl.java	0.00%	2 Missing ⚠️
...pache/celeborn/client/ApplicationHeartbeater.scala	0.00%	1 Missing ⚠️

Additional details and impacted files

@@             Coverage Diff              @@
##               main    #3737      +/-   ##
============================================
- Coverage     66.91%   57.38%   -9.53%     
- Complexity        0      214     +214     
============================================
  Files           358      395      +37     
  Lines         21986    27822    +5836     
  Branches       1946     2712     +766     
============================================
+ Hits          14710    15962    +1252     
- Misses         6262    10706    +4444     
- Partials       1014     1154     +140

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

SteNicholas · 2026-06-16T13:05:13Z

@afterincomparableyum, is it ready? This fix is helpful for CI running.

Address several independent root causes of CI/CD flakiness: - build/mvn: default to archive.apache.org instead of the closer.lua mirror redirector, validate each download is a real tarball (closer.lua intermittently returns an HTML mirror page with HTTP 200), and retry up to 3 times. - pom.xml: exclude io/netty/** from JaCoCo instrumentation. JaCoCo's agent rejects Netty 4.2's already-enhanced JFR event classes, tearing down channels mid-write and flaking the Flink integration tests. - MiniClusterFeature: cap randomly selected ports below the ephemeral floor (32768) to avoid a TOCTOU race with OS-assigned/TIME_WAIT ports that look free at selection time but fail to bind. - RatisMasterStatusSystemSuiteJ: re-point each server to a fresh storage directory on every start attempt, since Ratis releases the directory lock asynchronously and a retry would otherwise hit "directory is already locked". - spark-it suites: reduce worker count from 5 to 3 (the MiniCluster default) to cut the CPU/thread footprint of the serial single-JVM suites that otherwise starves RPC/fetch handlers past the 240s timeout. - LifecycleManagerUnregisterShuffleSuite: shorten the expired-check interval to 5s so the shuffle unregister runs within the eventually() window with retry margin.

afterincomparableyum · 2026-06-16T13:48:43Z

@SteNicholas it is ready now, this should resolve a lot of the CI/CD flakes. There may be 1 failing every now and then, but this is an improvement from before.

Copilot

Pull request overview

This PR targets CI/CD flakiness across the build toolchain, coverage instrumentation, and integration test stability (Spark/Flink) by reducing resource contention, eliminating known race conditions, and making startup/cleanup more deterministic.

Changes:

Harden build tooling and coverage: make Maven download deterministic + validated/retried; exclude io/netty/** from JaCoCo instrumentation to avoid Netty JFR class instrumentation failures.
Stabilize mini-cluster + integration tests: avoid ephemeral-port collisions, improve worker startup retry behavior, and enforce Spark/ShuffleClient cleanup between suites.
Improve client/runtime correctness: adjust heartbeat scheduling, guard revive logic, and make the process-wide ShuffleClient singleton app-aware.

Reviewed changes

Copilot reviewed 13 out of 13 changed files in this pull request and generated 4 comments.

Show a summary per file

File	Description
worker/src/test/scala/org/apache/celeborn/service/deploy/MiniClusterFeature.scala	Bounds random port selection below ephemeral range; improves worker startup retry/cleanup behavior.
tests/spark-it/src/test/scala/org/apache/celeborn/tests/spark/SparkTestBase.scala	Reduces worker count; adds SparkSession + ShuffleClient cleanup between suites.
tests/spark-it/src/test/scala/org/apache/celeborn/tests/spark/ShuffleFallbackSuite.scala	Reduces worker count used by the suite to lower contention.
tests/spark-it/src/test/scala/org/apache/celeborn/tests/spark/memory/MemorySparkTestBase.scala	Reduces worker count for memory-storage suites to reduce CPU/thread pressure.
tests/spark-it/src/test/scala/org/apache/celeborn/tests/spark/CelebornHashCheckDiskSuite.scala	Adds per-test SparkSession + ShuffleClient reset to prevent cross-test interference.
tests/spark-it/src/test/scala/org/apache/celeborn/tests/client/LifecycleManagerUnregisterShuffleSuite.scala	Shortens shuffle expired-check interval to fit within `eventually()` window reliably.
tests/flink-it/src/test/scala/org/apache/celeborn/tests/flink/HybridShuffleWordCountTest.scala	Waits for async local flushing to complete before asserting on-disk file sizes.
pom.xml	Excludes Netty classes from JaCoCo agent instrumentation to prevent runtime channel failures.
master/src/test/java/org/apache/celeborn/service/deploy/master/clustermeta/ha/RatisMasterStatusSystemSuiteJ.java	Reconfigures Ratis storage dirs on retries to avoid async lock-release flakiness.
client/src/main/scala/org/apache/celeborn/client/ApplicationHeartbeater.scala	Delays first heartbeat by one interval to avoid firing before full registration.
client/src/main/java/org/apache/celeborn/client/ShuffleClientImpl.java	Guards revive path against null old locations to avoid NPEs.
client/src/main/java/org/apache/celeborn/client/ShuffleClient.java	Makes static ShuffleClient singleton app-aware (rebuild on app id change, clear on reset).
build/mvn	Makes Maven download deterministic (archive), validates tarball bodies, and retries downloads.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

afterincomparableyum · 2026-06-16T14:28:30Z

I addressed CoPilot comments @SteNicholas

afterincomparableyum · 2026-06-16T15:33:42Z

I’ll revert the CoPilot suggestions that broke CI in a few hours and ping when ready again

afterincomparableyum · 2026-06-17T03:13:40Z

Ping @SteNicholas PR is ready for review again.

SteNicholas

Code review — 6 findings (2 bugs, 2 cleanup, 2 minor).

Top issues:

Resource leak — the new appUniqueId-mismatch branch in ShuffleClient.get() replaces _instance without calling shutdown() on the old one, leaking transport connections, thread pools, and RPC references.
NPE risk — the outer guard calls appUniqueId.equals(…) which will NPE if the parameter is null.

Also noted (not in diff, no inline comment):

MasterClusterFeature.selectRandomPort() (line 53) and RatisMasterStatusSystemSuiteJ (line 134) still use selectRandomInt(1024, 65535) — the ephemeral-port collision fix applied to MiniClusterFeature was not propagated to these files.

See inline comments for details.

SteNicholas · 2026-06-17T21:22:10Z

@afterincomparableyum, could you please take a look at above comments?

…fleclient, adjust mvn build script, and also remove duplicate code in disk suite.

afterincomparableyum · 2026-06-18T04:26:08Z

Ping @SteNicholas I've addressed comments.

…nistically via an extracted JVMQuake.checkAndDump instead of inducing real GC pressure

github-actions Bot added module:client module:worker module:spark labels Jun 14, 2026

afterincomparableyum force-pushed the flaky-cicd branch from 141ae29 to 8197efa Compare June 14, 2026 16:06

github-actions Bot added the module:tests label Jun 14, 2026

github-actions Bot added module:spark and removed module:spark labels Jun 14, 2026

afterincomparableyum force-pushed the flaky-cicd branch 4 times, most recently from 9937f86 to dfa6397 Compare June 15, 2026 04:45

github-actions Bot removed the module:spark label Jun 15, 2026

afterincomparableyum changed the title ~~[CELEBORN-XXXX][WIP] fix flaky CI/CD~~ [CELEBORN-2362] Fix Flaky CI/CD Jun 15, 2026

afterincomparableyum force-pushed the flaky-cicd branch 5 times, most recently from 6855de2 to 35123c3 Compare June 15, 2026 13:39

github-actions Bot added kind:build module:master labels Jun 15, 2026

afterincomparableyum force-pushed the flaky-cicd branch from 3d7bb21 to 3d8b984 Compare June 16, 2026 13:47

afterincomparableyum marked this pull request as ready for review June 16, 2026 13:47

SteNicholas requested a review from Copilot June 16, 2026 13:56

Copilot started reviewing on behalf of SteNicholas June 16, 2026 14:02 View session

Copilot AI reviewed Jun 16, 2026

View reviewed changes

Comment thread client/src/main/java/org/apache/celeborn/client/ShuffleClient.java Outdated

Comment thread worker/src/test/scala/org/apache/celeborn/service/deploy/MiniClusterFeature.scala Outdated

Comment thread build/mvn Outdated

Comment thread build/mvn

address copilot comments

4283c1c

afterincomparableyum force-pushed the flaky-cicd branch from 62529b0 to 4283c1c Compare June 17, 2026 02:37

SteNicholas reviewed Jun 17, 2026

View reviewed changes

afterincomparableyum added 2 commits June 17, 2026 21:18

add comment explaining why we do not need instance shutdown() in shuf…

23ad790

…fleclient, adjust mvn build script, and also remove duplicate code in disk suite.

stay below port limits

b4d6dc4

Fix flaky JVMQuakeSuite hang by driving the GC deficit bucket determi…

e82023b

…nistically via an extracted JVMQuake.checkAndDump instead of inducing real GC pressure

afterincomparableyum force-pushed the flaky-cicd branch from bd8a3e4 to e82023b Compare June 19, 2026 00:07

Conversation

afterincomparableyum commented Jun 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR resolve a correctness bug?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

codecov Bot commented Jun 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

SteNicholas commented Jun 16, 2026

Uh oh!

afterincomparableyum commented Jun 16, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

afterincomparableyum commented Jun 16, 2026

Uh oh!

afterincomparableyum commented Jun 16, 2026

Uh oh!

afterincomparableyum commented Jun 17, 2026

Uh oh!

SteNicholas left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

SteNicholas commented Jun 17, 2026

Uh oh!

afterincomparableyum commented Jun 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

afterincomparableyum commented Jun 14, 2026 •

edited

Loading

codecov Bot commented Jun 14, 2026 •

edited

Loading