[CELEBORN-1855] LifecycleManager return appshuffleId for non barrier stage when fetch fail has been reported#3090
[CELEBORN-1855] LifecycleManager return appshuffleId for non barrier stage when fetch fail has been reported#3090buska88 wants to merge 8 commits into
Conversation
…stage when fetch fail has been reported
No.We mainly use branch-0.5.Would you please consider that merging these two pr into branch-0.5-rc?Those are important features and many users using branch-0.5 may need them. |
As for this pr, it occurs to me that when a task of a stage throws fetchFail exception, in the small duration between fetchFail exception reported and the stage is aborted, following read tasks cans still get appId and finish shuffle reading.Then when stage retrying, these tasks may not re compute, which save resources.If following tasks fail due to failing to get appId, then they will be recomputed in the next stage-retry inevitablely. |
This PR can be helpful in branch-0.5 because branch-0.5 doesn't have PR #2921. Drafting a new RC for branch-0.5 can be a good idea. |
Does this PR #2921 also need to be backported to the branch-0.5 branch? @FMX |
There are too many conflicts between PR#2921 and branch-0.5. So the PR is not backported. |
|
@buska88 It would be better to check the validity of a shuffle Id after you get it instead of using a shuffle ID that marked as invalid. You can add the check here |
Sorry, I don't understand the purpose of checking shuffle id here. If the shuffle id is invalid, what should we do next? |
Maybe we can also throw FetchFailureException and wait task retry |
OKay.I choose to throw FetchFailException in CelebornShuffleReader(shuffleclientImpl doesn't have spark dependency).And you can have a look at it when you have free time cc @FMX @RexXiong |
|
This PR is stale because it has been open 20 days with no activity. Remove stale label or comment or this will be closed in 10 days. |
RexXiong
left a comment
There was a problem hiding this comment.
Very Sorry for the late reply, I missed the message
| val shuffleId = SparkUtils.celebornShuffleId(shuffleClient, handle, context, false) | ||
| var shuffleId = handle.shuffleId | ||
| try { | ||
| shuffleId = SparkUtils.celebornShuffleId(shuffleClient, handle, context, false) |
There was a problem hiding this comment.
In this scenario, Celeborn would throw a CelebornException, so catching a CelebornIOException is incorrect. Moreover, I believe we should not catch the CelebornException itself, as we cannot identify the outcome when encountering it. I propose adding a bool success field to the PbGetShuffleIdResponse to indicate whether the shuffleId is available or not. This way, the client can throw a FetchFailureException if necessary, thereby improving the handling and expression of the situation.
There was a problem hiding this comment.
Thanks for reviewing and i fix my code according to your views. cc @RexXiong
Find a new case for this pr.Task 29935 launch on executor 5432.When executor 5432 lost, SparkUtils cannot get TaskSetManager by taskid, because taskSet.removeRunningTask(tid) has been called.So this pr seems useful for this case.The task can throw a fetchfail exception and re run the stage |
| } | ||
|
|
||
| public static int celebornShuffleId( | ||
| public static Tuple2<Integer, Boolean> celebornShuffleId( |
There was a problem hiding this comment.
nit: Should we consider throwing a CelebornRuntimeException when an unavailable celebornShuffleId is encountered? This can allows to catch the exception within the CelebornShuffleReader while maintaining the existing method signature.
| var shuffleId = handle.shuffleId | ||
| try { | ||
| shuffleId = SparkUtils.celebornShuffleId(shuffleClient, handle, context, false) |
There was a problem hiding this comment.
| var shuffleId = handle.shuffleId | |
| try { | |
| shuffleId = SparkUtils.celebornShuffleId(shuffleClient, handle, context, false) | |
| val shuffleId = try { | |
| SparkUtils.celebornShuffleId(shuffleClient, handle, context, false) | |
| } catch { |
|
Thanks @buska88 Merge to main(v0.6.0) and also backport to branch-0.5(v0.5.5) |
What changes were proposed in this pull request?
for non barrier shuffle read stage, LifecycleManager#handleGetShuffleIdForApp always return appshuffleId whether fetch status is true or not.
Why are the changes needed?
As described in jira, If LifecycleManager only returns appshuffleId whose fetch status is success, the task will fail directly to "there is no finished map stage associated with", but previous fetch fail event reported may not be fatal.So just give it a chance
Does this PR introduce any user-facing change?
How was this patch tested?