Skip to content

[CELEBORN-1948] Fix the issue where replica may lose data when HARD_SPLIT occurs during handlePushMergeData.#3185

Closed
leixm wants to merge 5 commits into
apache:mainfrom
leixm:CELEBORN-1948
Closed

[CELEBORN-1948] Fix the issue where replica may lose data when HARD_SPLIT occurs during handlePushMergeData.#3185
leixm wants to merge 5 commits into
apache:mainfrom
leixm:CELEBORN-1948

Conversation

@leixm

@leixm leixm commented Mar 31, 2025

Copy link
Copy Markdown
Contributor

What changes were proposed in this pull request?

Fix data lost when push merged data of replica and hard split happen.

Why are the changes needed?

There is a problem with replicate rpc callback. The code should satisfy the following conditions: when comparing the status returned by primary and replica data, the status on the left should be used as the final status for the client, FAILURE > HARD_SPLIT > CONGESTION > SUCCESS. The status on the right cannot cover the status on the left.

There are two problems with the code now

  1. HARD_SPLIT can cover FAILURE, which will affect the exclude worker logic, and there may be some problems
  2. When processing a pushMergedData request, some partitionLocations are committed, PushDataHandler cannot stop pushing replicas as long as there are any partitions that have not been committed, otherwise data loss will occur.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Existing UTs.

@leixm

leixm commented Mar 31, 2025

Copy link
Copy Markdown
Contributor Author

cc @RexXiong @AngersZhuuuu

@turboFei turboFei added the bug Something isn't working label Mar 31, 2025

@turboFei turboFei left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks

@turboFei turboFei requested review from FMX and RexXiong April 1, 2025 18:55
@turboFei

turboFei commented Apr 7, 2025

Copy link
Copy Markdown
Member

ping @FMX and @RexXiong

@turboFei

Copy link
Copy Markdown
Member

cc @FMX and @RexXiong

@turboFei turboFei added the 0.6.0 label Apr 16, 2025
@RexXiong

Copy link
Copy Markdown
Contributor

cc @FMX and @RexXiong

We have discussed offline with @leixm . this issue has not been resolved yet. I will reproduce it again in the future for further investigation.

@turboFei

turboFei commented Apr 18, 2025

Copy link
Copy Markdown
Member

We have discussed offline with @leixm . this issue has not been resolved yet. I will reproduce it again in the future for further investigation.

We should resolve it before 0.6.0 release? @RexXiong

@RexXiong RexXiong left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, After testing, now this PR has no correctness issues. cc @turboFei

@turboFei

Copy link
Copy Markdown
Member

LGTM, After testing, now this PR has no correctness issues. cc @turboFei

Great work, thanks

@AngersZhuuuu AngersZhuuuu left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, Nice catch

@mridulm

mridulm commented Apr 24, 2025

Copy link
Copy Markdown
Contributor

+CC @akpatnam25

@FMX FMX left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@RexXiong

RexXiong commented May 9, 2025

Copy link
Copy Markdown
Contributor

After thorough testing, we eventually find the issue and have reserved the problem. ❤️❤️❤️ Thanks @leixm cc @turboFei

@RexXiong RexXiong changed the title [CELEBORN-1948] Worker handle replica data callback should not ignore HARD_SPLIT [CELEBORN-1948] Fix the issue where replica may lose data when HARD_SPLIT occurs during handlePushMergeData. May 9, 2025
@RexXiong RexXiong closed this in 7542adf May 9, 2025
@RexXiong

RexXiong commented May 9, 2025

Copy link
Copy Markdown
Contributor

Thanks, merge to main(v0.6.0)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

0.6.0 bug Something isn't working module:worker

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants