[CELEBORN-1896] delete data from failed to fetch shuffles by CodingCat · Pull Request #3109 · apache/celeborn

CodingCat · 2025-02-19T00:42:54Z

What changes were proposed in this pull request?

it's a joint work with @YutingWang98

currently we have to wait for spark shuffle object gc to clean disk space occupied by celeborn shuffles

As a result, if a shuffle is failed to fetch and retried , the disk space occupied by the failed attempt cannot really be cleaned , we hit this issue internally when we have to deal with 100s of TB level shuffles in a single spark application, any hiccup in servers can double even triple the disk usage

this PR implements the mechanism to delete files from failed-to-fetch shuffles

the main idea is actually simple, it triggers clean up in LifecycleManager when it applies for a new celeborn shuffle id for a retried shuffle write stage

the tricky part is that is to avoid delete shuffle files when it is referred by multiple downstream stages: the PR introduces RunningStageManager to track the dependency between stages

Why are the changes needed?

saving disk space

Does this PR introduce any user-facing change?

a new config

How was this patch tested?

we manually delete some files

from the above screenshot we can see that originally we have shuffle 0, 1 and after 1 faced with chunk fetch failure, it triggers a retry of 0 (shuffle 2), but at this moment, 0 has been deleted from the workers

in the logs, we can see that in the middle the application , the unregister shuffle request was sent for shuffle 0

codecov · 2025-03-09T20:13:47Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 63.62%. Comparing base (4bacd1f) to head (eed6ba5).
Report is 62 commits behind head on main.

Additional details and impacted files

@@             Coverage Diff             @@
##             main    #3109       +/-   ##
===========================================
+ Coverage   32.63%   63.62%   +30.99%     
===========================================
  Files         341      343        +2     
  Lines       20422    20819      +397     
  Branches     1820     1835       +15     
===========================================
+ Hits         6663    13243     +6580     
+ Misses      13387     6617     -6770     
- Partials      372      959      +587

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

FMX · 2025-03-19T08:34:55Z

+      appShuffleIdentifier: String): Unit = {
+    val Array(appShuffleId, stageId, _) = appShuffleIdentifier.split('-')
+    lifecycleManager.get().getShuffleIdMapping.get(appShuffleId.toInt).foreach {
+      case (pastAppShuffleIdentifier, (celebornShuffleId, _)) => {


You have skipped the input parameter of celebornShuffleId.

FMX · 2025-03-19T08:38:41Z

+    lifecycleManager.compareAndSet(null, ref)
+  }
+
+  private def noRunningDownstreamStage(shuffleId: Int): Boolean = {


The input parameter should be celebornShuffleId.

oh, yes, fixed

FMX · 2025-03-19T08:38:59Z

+          || onlyCurrentStageReferred(celebornShuffleId, stageId.toInt)
+          || noRunningDownstreamStage(celebornShuffleId)
+          || !committedSuccessfully(celebornShuffleId)) {
+          val Array(_, stageId, attemptId) = pastAppShuffleIdentifier.split('-')


Unused definition.

FMX · 2025-03-19T08:40:47Z

+    ret
+  }
+
+  private val cleanerThread = new Thread() {


Can be replaced by newDaemonSingleThreadScheduledExecutor and scheduleWithFixedDelay.

Here can be more parameters to change to clean failed shuffle interval.

FMX · 2025-03-19T08:48:38Z

  }

+  // expecting celeborn shuffle id and application shuffle identifier
+  @volatile private var getShuffleIdForWriterCallback: Option[BiConsumer[Integer, String]] = None


Suggested change

@volatile private var getShuffleIdForWriterCallback: Option[BiConsumer[Integer, String]] = None

@volatile private var validateCelebornShuffleIdForClean: Option[BiConsumer[Integer, String]] = None

FMX · 2025-03-19T08:49:10Z

+    getShuffleIdForWriterCallback = Some(callback)
+  }
+  // expecting celeborn shuffle id and application shuffle identifier
+  @volatile private var getShuffleIdForReaderCallback: Option[BiConsumer[Integer, String]] = None


Suggested change

@volatile private var getShuffleIdForReaderCallback: Option[BiConsumer[Integer, String]] = None

@volatile private var recordShuffleIdReference: Option[BiConsumer[Integer, String]] = None

RexXiong · 2025-04-14T02:52:57Z

+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.spark.scheduler


Is the package org.apache.spark.scheduler truly necessary in Celeborn?

yes, we need to access runningStages in DAGScheduler which is a private[scheduler] variable

RexXiong · 2025-04-14T02:58:57Z

+
+  def addShuffleIdReferringStage(celebornShuffleId: Int, appShuffleIdentifier: String): Unit = {
+    // this is only implemented/tested with Spark for now
+    val Array(_, stageId, _) = appShuffleIdentifier.split('-')


Can we put this together with SparkUtils.appShuffleIdentifier

you mean make it as a method of SparkUtils?

we may not be able to do that since SparkUtils is a spark version specific class, but for the reason mentioned above, FailedShuffleCleaner has to be in spark-common

Could we try to move to SparkCommonUtils?

sure, moved

RexXiong · 2025-04-14T03:24:42Z

+    FailedShuffleCleaner.addShuffleIdToBeCleaned(appShuffleIdentifier);
+  }
+
+  public static void addShuffleIdRefCount(


Better rename addShuffleIdRefCount to addShuffleIdReferringStage?

RexXiong · 2025-04-14T03:26:45Z

+  def addShuffleIdReferringStage(celebornShuffleId: Int, appShuffleIdentifier: String): Unit = {
+    // this is only implemented/tested with Spark for now
+    val Array(_, stageId, _) = appShuffleIdentifier.split('-')
+    celebornShuffleIdToReferringStages.putIfAbsent(celebornShuffleId, new mutable.HashSet[Int]())


should be ConcurrentSet?

ah, as mentioned at #3109 (comment), I found I still need an explicit lock over this HashSet, so ConcurrentHashSet may not help here

FMX · 2025-04-14T03:10:23Z

+  }
+
+  public static void addShuffleIdRefCount(
+      LifecycleManager lifecycleManager, int celebornShuffeId, String appShuffleIdentifier) {


Here is a typo. celebornShuffeId --> celebornShuffleId

oops, fixed

FMX · 2025-04-14T03:13:13Z

            lifecycleManager.registerInvalidatedBroadcastCallback(
                shuffleId -> SparkUtils.invalidateSerializedGetReducerFileGroupResponse(shuffleId));
          }
+          if (lifecycleManager.conf().clientFetchCleanFailedShuffle()) {


These lines are duplicates of lines 159~172.

oops, some merge error, fixed

CodingCat · 2025-04-18T21:29:22Z

@FMX @RexXiong thank you for the review, I addressed the comments, would you please take another look?

RexXiong · 2025-04-21T03:04:25Z

+class RunningStageManagerImpl extends RunningStageManager {
+  private def dagScheduler = SparkContext.getActive.get.dagScheduler
+  override def isRunningStage(stageId: Int): Boolean = {
+    dagScheduler.runningStages.map(_.id).contains(stageId)


How about use reflect to get the value of runningStages? IMO we would better not name a package from other project. See SparkCommonUtils

sounds reasonable, changed

RexXiong · 2025-04-21T03:37:10Z

+
+  def addShuffleIdReferringStage(celebornShuffleId: Int, appShuffleIdentifier: String): Unit = {
+    // this is only implemented/tested with Spark for now
+    val Array(_, stageId, _) = appShuffleIdentifier.split('-')


Could we try to move to SparkCommonUtils?

CodingCat · 2025-04-21T21:39:05Z

@RexXiong I addressed all comments, please let me know about any further suggestions

RexXiong · 2025-04-25T09:00:32Z

Thanks @CodingCat I left some comments, please take a look when you have time.

RexXiong

LGTM, thanks!

turboFei · 2025-04-28T21:52:55Z

+      throws ClassNotFoundException, NoSuchFieldException, IllegalAccessException {
+    Class<?> stageClass = Class.forName("org.apache.spark.scheduler.Stage");
+    idField = stageClass.getDeclaredField("id");
+    idField.setAccessible(true);


Could you use the DynFields, likes:

celeborn/client-spark/spark-2/src/main/java/org/apache/spark/shuffle/celeborn/SparkUtils.java

Lines 213 to 214 in a211056

private static final DynFields.UnboundField shuffleIdToMapStage_FIELD =

DynFields.builder().hiddenImpl(DAGScheduler.class, "shuffleIdToMapStage").build();

And it can be static.

turboFei · 2025-04-28T21:54:54Z

+    try {
+      DAGScheduler dagScheduler = SparkContext$.MODULE$.getActive().get().dagScheduler();
+      Class<?> dagSchedulerClz = SparkContext$.MODULE$.getActive().get().dagScheduler().getClass();
+      Field runningStagesField = dagSchedulerClz.getDeclaredField("runningStages");


DynFields

turboFei · 2025-04-28T23:55:13Z

+  }
+
+  private var cleanerThreadPool = ThreadUtils.newDaemonSingleThreadScheduledExecutor(
+    "failedShuffleCleanerThreadPool")


Seems it would always launch the pool even the feature is not enabled

turboFei · 2025-04-28T23:56:31Z


            lifecycleManager.registerShuffleTrackerCallback(
                shuffleId -> SparkUtils.unregisterAllMapOutput(mapOutputTracker, shuffleId));
-


nit: unnecessary change

CodingCat · 2025-04-29T14:08:33Z

@YutingWang98 taking a look for the upstream version?

YutingWang98 · 2025-04-29T17:42:42Z

@YutingWang98 taking a look for the upstream version?

Thanks! lgtm

CodingCat · 2025-04-29T23:13:32Z

@turboFei thanks for the review! anything else I need to address?

turboFei · 2025-04-29T23:32:36Z

Left some comments in CodingCat#1

turboFei · 2025-04-29T23:32:58Z

gentle ping @FMX

CodingCat · 2025-05-17T01:12:51Z

i removed some tests from the PR as they are pretty flaky in github CI even tho they have been running internally for months without issues , and I cannot reproduce the failures in my laptop

the current failures not really related to my PR

turboFei

Overall LGTM, thanks for the efforts.
Left some comments

…CelebornFetchFailureSuite.scala

turboFei · 2025-05-20T15:36:46Z

-        .config(updateSparkConf(sparkConf, ShuffleMode.HASH))
-        .config("spark.sql.shuffle.partitions", 2)
-        .config("spark.celeborn.shuffle.forceFallback.partition.enabled", false)
-        .config("spark.celeborn.client.spark.stageRerun.enabled", "false")


The UT is broken. @CodingCat

Some UT has special config, for this UT, it is spark.celeborn.client.spark.stageRerun.enabled=false.

I recommend to revert the refactor the test module to prevent mistake.

This reverts commit bc133c9.

turboFei · 2025-05-20T16:17:15Z

        baseBuilder.config("spark.celeborn.client.spark.fetch.cleanFailedShuffle", "true")
      } else {
-        baseBuilder
+        baseBuilder.config("spark.celeborn.client.spark.stageRerun.enabled", "false")


it is not generic

In fact, this config is usually true.

turboFei · 2025-05-20T16:19:13Z

@@ -86,13 +76,7 @@ class CelebornFetchFailureSuite extends AnyFunSuite

  test("celeborn spark integration test - unregister shuffle with throwsFetchFailure disabled") {


It is special here, please see the UT name,

test("celeborn spark integration test - unregister shuffle with throwsFetchFailure disabled")

The current abstraction of createSparkSession is not generic and hard to extend.

This reverts commit a13b3f0.

This reverts commit 3679598.

CodingCat · 2025-05-20T19:52:01Z

@turboFei I think now the code is in a better shape and test also seems has been stabilized

FMX

LGTM. Thanks. Merged into main(v0.6.0).

CodingCat changed the title ~~[WIP] clean failed shuffle disk~~ [CELEBORN-1896] delete data from failed to fetch shuffles Mar 7, 2025

CodingCat force-pushed the delete_fi branch from 401e023 to d834d64 Compare March 7, 2025 04:36

FMX reviewed Mar 19, 2025

View reviewed changes

Comment thread client-spark/common/src/main/scala/org/apache/celeborn/spark/FailedShuffleCleaner.scala Outdated

FMX reviewed Mar 19, 2025

View reviewed changes

Comment thread client-spark/common/src/main/scala/org/apache/celeborn/spark/FailedShuffleCleaner.scala Outdated

github-actions Bot added module:client module:spark kind:documentation module:common module:tests labels Apr 4, 2025

CodingCat force-pushed the delete_fi branch 2 times, most recently from 4263659 to 7cbe1c8 Compare April 7, 2025 01:54

RexXiong reviewed Apr 14, 2025

View reviewed changes

FMX reviewed Apr 14, 2025

View reviewed changes

RexXiong reviewed Apr 21, 2025

View reviewed changes

RexXiong approved these changes Apr 28, 2025

View reviewed changes

turboFei reviewed Apr 28, 2025

View reviewed changes

Comment thread common/src/main/scala/org/apache/celeborn/common/CelebornConf.scala Outdated

turboFei reviewed Apr 28, 2025

View reviewed changes

Comment thread common/src/main/scala/org/apache/celeborn/common/CelebornConf.scala Outdated

turboFei reviewed Apr 28, 2025

View reviewed changes

YutingWang98 approved these changes Apr 29, 2025

View reviewed changes

CodingCat added 4 commits May 16, 2025 14:56

10000

518a680

10 mins

5fcb5da

20 mins

c6af16d

delete most of tests

23539b4

turboFei reviewed May 18, 2025

View reviewed changes

Comment thread ...k-it/src/test/scala/org/apache/celeborn/tests/spark/fetch/failure/FetchFailureTestBase.scala Outdated

turboFei reviewed May 18, 2025

View reviewed changes

Comment thread tests/spark-it/src/test/scala/org/apache/spark/SparkContextHelper.scala Outdated

turboFei approved these changes May 18, 2025

View reviewed changes

turboFei and others added 6 commits May 19, 2025 19:39

Merge branch 'main' into delete_fi

b0302ef

Update tests/spark-it/src/test/scala/org/apache/celeborn/tests/spark/…

3e1bd1a

…CelebornFetchFailureSuite.scala

Update tests/spark-it/src/test/scala/org/apache/celeborn/tests/spark/…

2fa6907

…CelebornFetchFailureSuite.scala

Update tests/spark-it/src/test/scala/org/apache/celeborn/tests/spark/…

a2b23f9

…CelebornFetchFailureSuite.scala

NIT

757fb1a

style

9f9c2aa

turboFei reviewed May 20, 2025

View reviewed changes

turboFei and others added 3 commits May 20, 2025 08:52

Revert tests module refactor

bc133c9

Revert "Revert tests module refactor"

3679598

This reverts commit bc133c9.

fix ut

a13b3f0

turboFei reviewed May 20, 2025

View reviewed changes

CodingCat added 3 commits May 20, 2025 09:26

Revert "fix ut"

9264c88

This reverts commit a13b3f0.

Revert "Revert "Revert tests module refactor""

37365bc

This reverts commit 3679598.

further clean up

4f513d6

FMX approved these changes May 21, 2025

View reviewed changes

FMX closed this in 0b5a09a May 21, 2025

SteNicholas mentioned this pull request Jul 9, 2025

[GLUTEN-10151][CELEBORN] Bump Celeborn version to 0.6.0 apache/gluten#10152

Merged

turboFei mentioned this pull request Dec 12, 2025

Pinterest open source: early shuffle deletion #3564

Closed

	@volatile private var getShuffleIdForWriterCallback: Option[BiConsumer[Integer, String]] = None
	@volatile private var validateCelebornShuffleIdForClean: Option[BiConsumer[Integer, String]] = None

	@volatile private var getShuffleIdForReaderCallback: Option[BiConsumer[Integer, String]] = None
	@volatile private var recordShuffleIdReference: Option[BiConsumer[Integer, String]] = None

	private static final DynFields.UnboundField shuffleIdToMapStage_FIELD =
	DynFields.builder().hiddenImpl(DAGScheduler.class, "shuffleIdToMapStage").build();


		lifecycleManager.registerShuffleTrackerCallback(
		shuffleId -> SparkUtils.unregisterAllMapOutput(mapOutputTracker, shuffleId));

		@@ -86,13 +76,7 @@ class CelebornFetchFailureSuite extends AnyFunSuite

		test("celeborn spark integration test - unregister shuffle with throwsFetchFailure disabled") {

Conversation

CodingCat commented Feb 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

codecov Bot commented Mar 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

CodingCat commented Apr 18, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

CodingCat commented Apr 21, 2025

Uh oh!

RexXiong commented Apr 25, 2025

CodingCat commented Feb 19, 2025 •

edited

Loading

codecov Bot commented Mar 9, 2025 •

edited

Loading