Skip to content

[CELEBORN-2310] Reject RESERVE_SLOTS when disks are full#3666

Open
saurabhd336 wants to merge 6 commits intoapache:mainfrom
saurabhd336:diskFullReserveSlotsRejection
Open

[CELEBORN-2310] Reject RESERVE_SLOTS when disks are full#3666
saurabhd336 wants to merge 6 commits intoapache:mainfrom
saurabhd336:diskFullReserveSlotsRejection

Conversation

@saurabhd336
Copy link
Copy Markdown
Contributor

@saurabhd336 saurabhd336 commented Apr 16, 2026

What changes were proposed in this pull request?

Disk full only lead to HARD_SPLITs as a response to writes. However, doesn't lead to reserve slot rejections. This means too many write retries (due to HARD_SPLITs on each write attempt) leads to wasted network I/O. We can reject RESERVE_SLOT during disk full to avoid the wasted data write network IO.

Why are the changes needed?

Reject reserve slots during disk full, avoid unnecessary network IO.

Does this PR resolve a correctness bug?

No.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Added UTs, CI.

@saurabhd336 saurabhd336 changed the title Reject RESERVE_SLOTS when disks are full [CELEBORN-2310] Reject RESERVE_SLOTS when disks are full Apr 16, 2026
@SteNicholas SteNicholas requested a review from Copilot April 17, 2026 11:49
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates the worker’s local-disk “availability” checks to treat disks with actualUsableSpace <= 0 as unusable, so RESERVE_SLOTS can be rejected earlier during disk-full scenarios and avoid wasted push/write network I/O.

Changes:

  • Refine healthyWorkingDirs() to exclude disks that are HEALTHY but have actualUsableSpace <= 0.
  • Refine createDiskFile() directory selection to avoid using a suggested mount point when its disk has no usable space.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 113 to +117
def healthyWorkingDirs(): List[File] =
disksSnapshot().filter(_.status == DiskStatus.HEALTHY).flatMap(_.dirs)
disksSnapshot()
.filter(diskInfo =>
(diskInfo.status == DiskStatus.HEALTHY) && (diskInfo.actualUsableSpace > 0))
.flatMap(_.dirs)
Copy link

Copilot AI Apr 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

healthyWorkingDirs() now filters by actualUsableSpace > 0, which changes behavior in reserve-slot handling. There’s existing test coverage for updateDiskInfos() in StorageManagerSuite, but no test exercising this new predicate (e.g., disk status HEALTHY with actualUsableSpace == 0 should produce an empty healthyWorkingDirs). Adding a targeted unit test would prevent regressions and ensure RESERVE_SLOTS rejection works as intended.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR is meaningful for the case when disks are already full. Agreed that we should add some UTs in StorageManagerSuite.scala

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a test to make sure createFile fails even if diskInfo.status is HEALTHY, if the usable space is <= 0.

Comment on lines +114 to +117
disksSnapshot()
.filter(diskInfo =>
(diskInfo.status == DiskStatus.HEALTHY) && (diskInfo.actualUsableSpace > 0))
.flatMap(_.dirs)
Copy link

Copilot AI Apr 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The disk-availability predicate (status == HEALTHY) && (actualUsableSpace > 0) is now duplicated here and again in createDiskFile. To avoid future divergence (e.g., if the definition of "writable" changes), consider centralizing this check in a small helper like isDiskWritable(diskInfo) and reuse it in both places.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ack

@SteNicholas
Copy link
Copy Markdown
Member

Review: [CELEBORN-2310] Reject RESERVE_SLOTS when disks are full

PR: #3666 | Author: saurabhd336 | Change: +6 / -2 in StorageManager.scala

  • What it does

Adds an actualUsableSpace > 0 check in two places so that workers reject RESERVE_SLOTS requests upfront when local disks are full, instead of accepting the reservation and only triggering a HARD_SPLIT after wasted network I/O on write.

  • Looks good

    • Correct and low-risk — if a disk has zero usable space, rejecting early is the right call. The staleness window of actualUsableSpace matches the existing DiskStatus.HEALTHY refresh cycle, so no new race conditions are introduced.
    • Narrow scope — surgical change in the right places (healthyWorkingDirs() and createDiskFile()).
  • Issues / suggestions

    1. Duplicated predicate — (status == HEALTHY) && (actualUsableSpace > 0) now appears in two separate locations. Extract a helper like isDiskWritable(diskInfo): Boolean to centralize this and prevent future divergence.
    2. No unit tests — A test for healthyWorkingDirs() excluding a HEALTHY disk with actualUsableSpace = 0 would be straightforward and would guard against regressions.
    3. Formatting nit — The multi-line if in createDiskFile splits awkwardly mid-.equals(). Prefer aligning each condition on its own line:
      if (diskInfo != null &&
      diskInfo.status.equals(DiskStatus.HEALTHY) &&
      diskInfo.actualUsableSpace > 0) {
    4. Threshold consideration — > 0 is a fine start, but a disk with 1 byte free will pass the check and immediately fail on write. A configurable minimum space threshold could be a good follow-up.
  • Verdict
    Approve with minor comments. The logic is sound and the change is valuable. Adding a unit test and extracting the duplicated predicate would make it merge-ready.

@SteNicholas
Copy link
Copy Markdown
Member

SteNicholas commented Apr 20, 2026

@saurabhd336, thanks for contribution. Please address above comments of Copilot and Claude Code.

} else {
if (suggestedMountPoint.isEmpty) {
logDebug(s"Location suggestedMountPoint is not set, return all healthy working dirs.")
} else if (diskInfo == null) {
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is largely unrelated to the change, but fixes a potential NPE if diskInfo for the suggestedMountPoint is not found.

@saurabhd336
Copy link
Copy Markdown
Contributor Author

@SteNicholas Addressed comments. PTAL!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants