Skip to content

[CELEBORN-1577][FOLLOWUP] Fix backward compatiblity issue with interrupt shuffle#3675

Closed
s0nskar wants to merge 3 commits into
apache:mainfrom
s0nskar:fix_interrupt
Closed

[CELEBORN-1577][FOLLOWUP] Fix backward compatiblity issue with interrupt shuffle#3675
s0nskar wants to merge 3 commits into
apache:mainfrom
s0nskar:fix_interrupt

Conversation

@s0nskar

@s0nskar s0nskar commented May 6, 2026

Copy link
Copy Markdown
Contributor

What changes were proposed in this pull request?

Fix backward compatibility issue with interrupt shuffle by checking if the reason is nonEmpty.

Why are the changes needed?

If someone uses a new client with old server which is not sending CheckQuotaResponse in HeartbeatFromApplicationResponse then proto uses default value to build CheckQuotaResponse with isAvailable=false and reason="". In this case the job will always fails without breaching the quota, we should not fail the job if the reason is empty to make it backward compatible.

Does this PR resolve a correctness bug?

No

Does this PR introduce any user-facing change?

No

How was this patch tested?

Tested in local setup.

@RexXiong

RexXiong commented May 7, 2026

Copy link
Copy Markdown
Contributor

Using reason.nonEmpty as a proxy for "server supports quota check" is fragile — reason is a business description field, not a compatibility signal. If a legitimate quota-exceeded response happens to have an empty reason, it would be silently ignored; conversely, any future change to the reason format could break this logic.

Since quotaInterruptShuffleEnabled defaults to false, users must explicitly opt in. In a new-client/old-server scenario, users should simply not enable this config. This is an explicit and semantically clear approach, rather than relying on an implicit convention around the reason field.

If we do need robust backward compatibility, a better approach would be to add an explicit capability field (e.g., quotaCheckSupported) to the heartbeat response. But that's a larger change and can be addressed separately.

Reviewed by Claude Code

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Updates the client-side quota interrupt handling in ApplicationHeartbeater to be backward compatible with older masters that don’t populate checkQuotaResponse in HeartbeatFromApplicationResponse, preventing accidental job cancellation due to protobuf default values.

Changes:

  • Gate shuffle interruption on CheckQuotaResponse.reason.nonEmpty in addition to !isAvailable.
  • Avoid treating protobuf default isAvailable=false, reason="" (from missing message field) as a real quota-exceeded signal.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread client/src/main/scala/org/apache/celeborn/client/ApplicationHeartbeater.scala Outdated
@SteNicholas

Copy link
Copy Markdown
Member

@s0nskar, please rebase the latest main branch for Grafana Dashboard CI.

@SteNicholas

Copy link
Copy Markdown
Member

@s0nskar, could you please help to update?

@SteNicholas SteNicholas force-pushed the main branch 2 times, most recently from 1d92a40 to cf8d472 Compare May 27, 2026 02:11
@s0nskar

s0nskar commented May 28, 2026

Copy link
Copy Markdown
Contributor Author

Ack, will work on this.

@SteNicholas

Copy link
Copy Markdown
Member

@s0nskar, any update?

@s0nskar

s0nskar commented Jun 2, 2026

Copy link
Copy Markdown
Contributor Author

@SteNicholas / @RexXiong PTAL. Instead of adding a new field, i have made changes in the parsing of HeartbeatFromApplicationResponse.

@RexXiong RexXiong left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@SteNicholas SteNicholas left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@SteNicholas

Copy link
Copy Markdown
Member

Thanks. Merged to main(v0.7.0).

@s0nskar s0nskar mentioned this pull request Jun 8, 2026
2 tasks
SteNicholas pushed a commit that referenced this pull request Jun 8, 2026
### What changes were proposed in this pull request?

Fix PbSerDeUtilsTest failure for spark4 jobs

```
Error:  /home/runner/work/celeborn/celeborn/common/src/test/scala/org/apache/celeborn/common/util/PbSerDeUtilsTest.scala:851: comparing values of types Boolean and Boolean using `equals` unsafely bypasses cooperative equality; use `==` instead
Error: [ERROR] one error found
```
https://github.com/apache/celeborn/actions/runs/27126367363/job/80055898287?pr=3720

### Why are the changes needed?

After this change #3675, some of tests are failing.

### Does this PR resolve a correctness bug?

- [ ] Yes

### Does this PR introduce _any_ user-facing change?

- [ ] Yes

### How was this patch tested?

Existing UTs.

Closes #3722 from s0nskar/CELEBORN-1577_fix_test.

Authored-by: Sanskar Modi <sanskarmodi97@gmail.com>
Signed-off-by: Nicholas Jiang <programgeek@163.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants