[CELEBORN-2362] Fix Flaky CI/CD#3737
Conversation
141ae29 to
8197efa
Compare
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #3737 +/- ##
============================================
- Coverage 66.91% 57.38% -9.53%
- Complexity 0 214 +214
============================================
Files 358 395 +37
Lines 21986 27822 +5836
Branches 1946 2712 +766
============================================
+ Hits 14710 15962 +1252
- Misses 6262 10706 +4444
- Partials 1014 1154 +140 ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
9937f86 to
dfa6397
Compare
6855de2 to
35123c3
Compare
|
@afterincomparableyum, is it ready? This fix is helpful for CI running. |
Address several independent root causes of CI/CD flakiness:
- build/mvn: default to archive.apache.org instead of the closer.lua
mirror redirector, validate each download is a real tarball (closer.lua
intermittently returns an HTML mirror page with HTTP 200), and retry up
to 3 times.
- pom.xml: exclude io/netty/** from JaCoCo instrumentation. JaCoCo's
agent rejects Netty 4.2's already-enhanced JFR event classes, tearing
down channels mid-write and flaking the Flink integration tests.
- MiniClusterFeature: cap randomly selected ports below the ephemeral
floor (32768) to avoid a TOCTOU race with OS-assigned/TIME_WAIT ports
that look free at selection time but fail to bind.
- RatisMasterStatusSystemSuiteJ: re-point each server to a fresh storage
directory on every start attempt, since Ratis releases the directory
lock asynchronously and a retry would otherwise hit "directory is
already locked".
- spark-it suites: reduce worker count from 5 to 3 (the MiniCluster
default) to cut the CPU/thread footprint of the serial single-JVM
suites that otherwise starves RPC/fetch handlers past the 240s timeout.
- LifecycleManagerUnregisterShuffleSuite: shorten the expired-check
interval to 5s so the shuffle unregister runs within the eventually()
window with retry margin.
3d7bb21 to
3d8b984
Compare
|
@SteNicholas it is ready now, this should resolve a lot of the CI/CD flakes. There may be 1 failing every now and then, but this is an improvement from before. |
There was a problem hiding this comment.
Pull request overview
This PR targets CI/CD flakiness across the build toolchain, coverage instrumentation, and integration test stability (Spark/Flink) by reducing resource contention, eliminating known race conditions, and making startup/cleanup more deterministic.
Changes:
- Harden build tooling and coverage: make Maven download deterministic + validated/retried; exclude
io/netty/**from JaCoCo instrumentation to avoid Netty JFR class instrumentation failures. - Stabilize mini-cluster + integration tests: avoid ephemeral-port collisions, improve worker startup retry behavior, and enforce Spark/ShuffleClient cleanup between suites.
- Improve client/runtime correctness: adjust heartbeat scheduling, guard revive logic, and make the process-wide ShuffleClient singleton app-aware.
Reviewed changes
Copilot reviewed 13 out of 13 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| worker/src/test/scala/org/apache/celeborn/service/deploy/MiniClusterFeature.scala | Bounds random port selection below ephemeral range; improves worker startup retry/cleanup behavior. |
| tests/spark-it/src/test/scala/org/apache/celeborn/tests/spark/SparkTestBase.scala | Reduces worker count; adds SparkSession + ShuffleClient cleanup between suites. |
| tests/spark-it/src/test/scala/org/apache/celeborn/tests/spark/ShuffleFallbackSuite.scala | Reduces worker count used by the suite to lower contention. |
| tests/spark-it/src/test/scala/org/apache/celeborn/tests/spark/memory/MemorySparkTestBase.scala | Reduces worker count for memory-storage suites to reduce CPU/thread pressure. |
| tests/spark-it/src/test/scala/org/apache/celeborn/tests/spark/CelebornHashCheckDiskSuite.scala | Adds per-test SparkSession + ShuffleClient reset to prevent cross-test interference. |
| tests/spark-it/src/test/scala/org/apache/celeborn/tests/client/LifecycleManagerUnregisterShuffleSuite.scala | Shortens shuffle expired-check interval to fit within eventually() window reliably. |
| tests/flink-it/src/test/scala/org/apache/celeborn/tests/flink/HybridShuffleWordCountTest.scala | Waits for async local flushing to complete before asserting on-disk file sizes. |
| pom.xml | Excludes Netty classes from JaCoCo agent instrumentation to prevent runtime channel failures. |
| master/src/test/java/org/apache/celeborn/service/deploy/master/clustermeta/ha/RatisMasterStatusSystemSuiteJ.java | Reconfigures Ratis storage dirs on retries to avoid async lock-release flakiness. |
| client/src/main/scala/org/apache/celeborn/client/ApplicationHeartbeater.scala | Delays first heartbeat by one interval to avoid firing before full registration. |
| client/src/main/java/org/apache/celeborn/client/ShuffleClientImpl.java | Guards revive path against null old locations to avoid NPEs. |
| client/src/main/java/org/apache/celeborn/client/ShuffleClient.java | Makes static ShuffleClient singleton app-aware (rebuild on app id change, clear on reset). |
| build/mvn | Makes Maven download deterministic (archive), validates tarball bodies, and retries downloads. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
I addressed CoPilot comments @SteNicholas |
|
I’ll revert the CoPilot suggestions that broke CI in a few hours and ping when ready again |
62529b0 to
4283c1c
Compare
|
Ping @SteNicholas PR is ready for review again. |
SteNicholas
left a comment
There was a problem hiding this comment.
Code review — 6 findings (2 bugs, 2 cleanup, 2 minor).
Top issues:
- Resource leak — the new
appUniqueId-mismatch branch inShuffleClient.get()replaces_instancewithout callingshutdown()on the old one, leaking transport connections, thread pools, and RPC references. - NPE risk — the outer guard calls
appUniqueId.equals(…)which will NPE if the parameter is null.
Also noted (not in diff, no inline comment):
MasterClusterFeature.selectRandomPort()(line 53) andRatisMasterStatusSystemSuiteJ(line 134) still useselectRandomInt(1024, 65535)— the ephemeral-port collision fix applied toMiniClusterFeaturewas not propagated to these files.
See inline comments for details.
|
@afterincomparableyum, could you please take a look at above comments? |
…fleclient, adjust mvn build script, and also remove duplicate code in disk suite.
|
Ping @SteNicholas I've addressed comments. |
…nistically via an extracted JVMQuake.checkAndDump instead of inducing real GC pressure
bd8a3e4 to
e82023b
Compare
What changes were proposed in this pull request?
Address several independent root causes of CI/CD flakiness, spanning the
build tooling, JaCoCo instrumentation, the test mini-cluster, and the
process-wide shuffle client.
Build & coverage:
build/mvn: default to archive.apache.org instead of the closer.lua
mirror redirector, validate each download is a real tarball (closer.lua
intermittently returns an HTML mirror page with HTTP 200), and retry up
to 3 times.
pom.xml: exclude io/netty/** from JaCoCo instrumentation. JaCoCo's
agent rejects Netty 4.2's already-enhanced JFR event classes, tearing
down channels mid-write and flaking the Flink integration tests.
Client correctness:
ShuffleClient: make the process-wide singleton app-aware. Track the
appUniqueId of the live instance and rebuild it when a later app/suite
requests a different id (and clear it in reset()), so a straggler from a
previous app can no longer bind to the wrong LifecycleManager through the
shared static client.
ShuffleClientImpl: guard the revive path against a null oldLoc before
removing it from pushExcludedWorkers, avoiding an NPE when no previous
location is mapped.
ApplicationHeartbeater: start the heartbeat timer after one interval
instead of at delay 0, so the first heartbeat does not fire before the
application is fully registered.
Mini-cluster & test harness:
MiniClusterFeature: cap randomly selected ports below the ephemeral
floor (32768) to avoid a TOCTOU race with OS-assigned/TIME_WAIT ports
that look free at selection time but fail to bind. Also rework worker
startup: recreate the worker on each retry, handle InterruptedException
explicitly (stop + interrupt + rethrow), and back off exponentially
between attempts so a transient port collision retries cleanly.
RatisMasterStatusSystemSuiteJ: re-point each server to a fresh storage
directory on every start attempt, since Ratis releases the directory
lock asynchronously and a retry would otherwise hit "directory is
already locked".
Spark/Flink integration suites:
SparkTestBase / CelebornHashCheckDiskSuite: stop any active
SparkSession and reset the static ShuffleClient between suites/tests, so
a leaked SparkContext can no longer keep an old LifecycleManager alive
and corrupt a later suite's shuffle 0.
spark-it suites: reduce worker count from 5 to 3 (the MiniCluster
default) to cut the CPU/thread footprint of the serial single-JVM
suites that otherwise starves RPC/fetch handlers past the 240s timeout.
LifecycleManagerUnregisterShuffleSuite: shorten the expired-check
interval to 5s so the shuffle unregister runs within the eventually()
window with retry margin.
HybridShuffleWordCountTest: wait (eventually) for the async LocalFlusher
to drain before asserting on-disk file length equals the logical
length, instead of reading the file mid-flush.
Also, I fix flaky JVMQuakeSuite hang by driving the GC deficit bucket deterministically via an extracted JVMQuake.checkAndDump instead of inducing real GC pressure
Why are the changes needed?
CI/CD is always failing. With this, CI/CD rarely fails.
Does this PR resolve a correctness bug?
Does this PR introduce any user-facing change?
How was this patch tested?
CI/CD