-
Notifications
You must be signed in to change notification settings - Fork 447
[CELEBORN-1896] delete data from failed to fetch shuffles #3109
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Closed
Closed
Changes from 115 commits
Commits
Show all changes
121 commits
Select commit
Hold shift + click to select a range
8668754
compilable
CodingCat 1bf53bd
fetch failure suite
CodingCat 4f6acca
add disk clean suite
CodingCat 935f11b
fix compilation error
CodingCat 36af3f3
continue fixing compilation error
CodingCat 3599e47
fix compilation error
CodingCat 23a76fa
param doc
CodingCat 8932e4b
change param ver
CodingCat 4db2d88
add debug info
CodingCat 422784e
lint
CodingCat 29d2094
try less number of workers
CodingCat 136df5e
ignore fetchfailure test for now to see whether it is concurrency issue
CodingCat 7c4ff32
lint
CodingCat 8714dce
try 1 worker
CodingCat ee6dcaa
resume celeborn fetch failure suite
CodingCat 446bd71
make it only available for spark 3
CodingCat f30b5a0
Revert "make it only available for spark 3"
CodingCat eac8e4b
compatible with 2.11
CodingCat 35f6db0
fix rebase errors
CodingCat 59ea64f
more time to finish test
CodingCat 1e471f8
Revert "more time to finish test"
CodingCat fbf36b4
add more msg got storage
CodingCat 2ed92b2
remove first few tests and test what happened
CodingCat 4d9139d
test
CodingCat 5b557b0
more test
CodingCat b6f7285
add back one more test
CodingCat 23b230b
one more test
CodingCat 72a170c
more debugging info
CodingCat fa80ed3
add back one more test
CodingCat e40c4c1
handle empty message
CodingCat 413dbc6
rm useless println
CodingCat 7339267
allow more time in the suspicious test
CodingCat 3702259
more
CodingCat 257e649
try to separate test and see whether it works
CodingCat 574bc51
check more frequently
CodingCat 7b9bcb4
override shutdown minicluster in expensive suite
CodingCat 0a1dc80
try persist
CodingCat 9fb22d9
move back test and see
CodingCat 14c53a5
Revert "move back test and see"
CodingCat 633fc2a
addr comments1
CodingCat c539533
addr comments 2
CodingCat 142966f
addr comments 3
CodingCat ed8ebff
fix compilation
CodingCat 80c397f
use runnable to be compatible with spark 2
CodingCat 2476f3a
update param doc
CodingCat 9d705c3
fix NPE
CodingCat 0f45cb8
fix tests
CodingCat d48e6a8
add debugging info2
CodingCat d5344da
remove flaky test
CodingCat b76fbfb
addr comments
CodingCat 61bae50
fix compile
CodingCat 8a5692b
fix spark 2 compile
CodingCat bbe9638
fix build
CodingCat e35f996
refactor encode/decode app identifier and remove runningstagemanagers
CodingCat c121ece
stylistic fixes
CodingCat 33b4145
addr comments
CodingCat c6e2f81
license
CodingCat 5adfe0b
addr comments
CodingCat a948f92
update param doc
CodingCat fc5142a
addr comments
CodingCat e658641
comments
turboFei f407158
ensure type safe
CodingCat dbb1423
make it compilable with spark 2
CodingCat d8ed331
add unit test to guard runningstagemanagerimpl
CodingCat c896f47
add unit test
CodingCat 6ab91a7
add header
CodingCat e0c2ee7
update test
CodingCat b724959
RunningStageManager UT
turboFei ec21555
avoid using property
CodingCat 5d8ade0
merge
CodingCat ef55def
param fix
CodingCat b0d9bb2
handle indeterminstic case
CodingCat 2395e26
resume tests
CodingCat b896120
lint
CodingCat 0b39821
fix typos
CodingCat fb5a84e
fix spark 2
CodingCat 4e1aa67
change debugging string
CodingCat 7151992
simplify code
CodingCat 018c75e
addr comments
CodingCat 43e50c6
4 mins
CodingCat 3025a39
avoid driver oom
CodingCat c8ed30a
16g?
CodingCat 68ed11e
change
CodingCat 6f1a7fe
smaller test data?
CodingCat 5e2b507
recover test data size to ensure enough partitions
CodingCat 61ef8fc
code cleanup
CodingCat 6a10ef8
use more cores
CodingCat 76e92e4
add back original test
CodingCat f63c81e
stylistic fixes
CodingCat 6a60db6
less data
CodingCat ba7b882
further reduce memory overhead
CodingCat e6f87fd
addr comments
CodingCat 00914c3
doc update
CodingCat c5a6495
10g
CodingCat 802431f
further reduce test data
CodingCat 7b09c43
enlength timeout
CodingCat d8f0a27
recover to 240
CodingCat 7916059
rm one expensive test
CodingCat c5dbaf5
check faster
CodingCat 8b99b7a
check per sec
CodingCat e18a6d9
addr comments
CodingCat 6ae7fe0
test param
CodingCat f595e57
data
CodingCat 10f2c18
4g
CodingCat c03adcd
addr comments2
CodingCat 518a680
10000
CodingCat 5fcb5da
10 mins
CodingCat c6af16d
20 mins
CodingCat 23539b4
delete most of tests
CodingCat b0302ef
Merge branch 'main' into delete_fi
turboFei 3e1bd1a
Update tests/spark-it/src/test/scala/org/apache/celeborn/tests/spark/…
turboFei 2fa6907
Update tests/spark-it/src/test/scala/org/apache/celeborn/tests/spark/…
turboFei a2b23f9
Update tests/spark-it/src/test/scala/org/apache/celeborn/tests/spark/…
turboFei 757fb1a
NIT
turboFei 9f9c2aa
style
turboFei bc133c9
Revert tests module refactor
turboFei 3679598
Revert "Revert tests module refactor"
CodingCat a13b3f0
fix ut
CodingCat 9264c88
Revert "fix ut"
CodingCat 37365bc
Revert "Revert "Revert tests module refactor""
CodingCat 4f513d6
further clean up
CodingCat File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
93 changes: 93 additions & 0 deletions
93
client-spark/common/src/main/scala/org/apache/celeborn/spark/FailedShuffleCleaner.scala
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,93 @@ | ||
| /* | ||
| * Licensed to the Apache Software Foundation (ASF) under one or more | ||
| * contributor license agreements. See the NOTICE file distributed with | ||
| * this work for additional information regarding copyright ownership. | ||
| * The ASF licenses this file to You under the Apache License, Version 2.0 | ||
| * (the "License"); you may not use this file except in compliance with | ||
| * the License. You may obtain a copy of the License at | ||
| * | ||
| * http://www.apache.org/licenses/LICENSE-2.0 | ||
| * | ||
| * Unless required by applicable law or agreed to in writing, software | ||
| * distributed under the License is distributed on an "AS IS" BASIS, | ||
| * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
| * See the License for the specific language governing permissions and | ||
| * limitations under the License. | ||
| */ | ||
| package org.apache.celeborn.spark | ||
|
|
||
| import java.util | ||
| import java.util.concurrent.{LinkedBlockingQueue, ScheduledExecutorService, TimeUnit} | ||
|
|
||
| import scala.collection.JavaConverters._ | ||
| import scala.collection.mutable | ||
|
|
||
| import org.apache.spark.shuffle.celeborn.SparkCommonUtils | ||
|
|
||
| import org.apache.celeborn.client.LifecycleManager | ||
| import org.apache.celeborn.common.internal.Logging | ||
| import org.apache.celeborn.common.util.ThreadUtils | ||
|
|
||
| private[celeborn] class FailedShuffleCleaner(lifecycleManager: LifecycleManager) extends Logging { | ||
|
|
||
| // in celeborn ids | ||
| private val shufflesToBeCleaned = new LinkedBlockingQueue[Int]() | ||
| private val cleanedShuffleIds = new mutable.HashSet[Int] | ||
|
CodingCat marked this conversation as resolved.
|
||
|
|
||
| private lazy val cleanInterval = | ||
| lifecycleManager.conf.clientFetchCleanFailedShuffleIntervalMS | ||
|
|
||
| // for test | ||
| def reset(): Unit = { | ||
| shufflesToBeCleaned.clear() | ||
| cleanedShuffleIds.clear() | ||
| if (cleanerThreadPool != null) { | ||
| cleanerThreadPool.shutdownNow() | ||
| cleanerThreadPool = null | ||
| } | ||
| } | ||
|
|
||
| def addShuffleIdToBeCleaned(appShuffleIdentifier: String): Unit = { | ||
|
CodingCat marked this conversation as resolved.
|
||
| val Array(appShuffleId, _, _) = SparkCommonUtils.decodeAppShuffleIdentifier( | ||
| appShuffleIdentifier) | ||
| lifecycleManager.getShuffleIdMapping.get(appShuffleId.toInt).foreach { | ||
| case (_, (celebornShuffleId, _)) => shufflesToBeCleaned.put(celebornShuffleId) | ||
| } | ||
| } | ||
|
|
||
| def init(): Unit = { | ||
| cleanerThreadPool = ThreadUtils.newDaemonSingleThreadScheduledExecutor( | ||
| "failedShuffleCleanerThreadPool") | ||
| cleanerThreadPool.scheduleWithFixedDelay( | ||
| new Runnable { | ||
| override def run(): Unit = { | ||
| try { | ||
| val allShuffleIds = new util.ArrayList[Int] | ||
| shufflesToBeCleaned.drainTo(allShuffleIds) | ||
| allShuffleIds.asScala.foreach { shuffleId => | ||
| if (!cleanedShuffleIds.contains(shuffleId)) { | ||
| lifecycleManager.unregisterShuffle(shuffleId) | ||
| logInfo( | ||
| s"sent unregister shuffle request for shuffle $shuffleId (celeborn shuffle id)") | ||
| cleanedShuffleIds += shuffleId | ||
| } | ||
|
CodingCat marked this conversation as resolved.
|
||
| } | ||
| } catch { | ||
| case e: Exception => | ||
| logError("unexpected exception in cleaner thread", e) | ||
| } | ||
| } | ||
| }, | ||
| cleanInterval, | ||
| cleanInterval, | ||
| TimeUnit.MILLISECONDS) | ||
| } | ||
|
|
||
| init() | ||
|
|
||
| def removeCleanedShuffleId(celebornShuffleId: Int): Unit = { | ||
| cleanedShuffleIds.remove(celebornShuffleId) | ||
| } | ||
|
|
||
| private var cleanerThreadPool: ScheduledExecutorService = _ | ||
| } | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.