[CELEBORN-2310] Reject RESERVE_SLOTS when disks are full by saurabhd336 · Pull Request #3666 · apache/celeborn

saurabhd336 · 2026-04-16T06:44:06Z

What changes were proposed in this pull request?

Disk full only lead to HARD_SPLITs as a response to writes. However, doesn't lead to reserve slot rejections. This means too many write retries (due to HARD_SPLITs on each write attempt) leads to wasted network I/O. We can reject RESERVE_SLOT during disk full to avoid the wasted data write network IO.

Why are the changes needed?

Reject reserve slots during disk full, avoid unnecessary network IO.

Does this PR resolve a correctness bug?

No.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Added UTs, CI.

Copilot

Pull request overview

This PR updates the worker’s local-disk “availability” checks to treat disks with actualUsableSpace <= 0 as unusable, so RESERVE_SLOTS can be rejected earlier during disk-full scenarios and avoid wasted push/write network I/O.

Changes:

Refine healthyWorkingDirs() to exclude disks that are HEALTHY but have actualUsableSpace <= 0.
Refine createDiskFile() directory selection to avoid using a suggested mount point when its disk has no usable space.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-17T11:54:19Z

  def healthyWorkingDirs(): List[File] =
-    disksSnapshot().filter(_.status == DiskStatus.HEALTHY).flatMap(_.dirs)
+    disksSnapshot()
+      .filter(diskInfo =>
+        (diskInfo.status == DiskStatus.HEALTHY) && (diskInfo.actualUsableSpace > 0))
+      .flatMap(_.dirs)


healthyWorkingDirs() now filters by actualUsableSpace > 0, which changes behavior in reserve-slot handling. There’s existing test coverage for updateDiskInfos() in StorageManagerSuite, but no test exercising this new predicate (e.g., disk status HEALTHY with actualUsableSpace == 0 should produce an empty healthyWorkingDirs). Adding a targeted unit test would prevent regressions and ensure RESERVE_SLOTS rejection works as intended.

This PR is meaningful for the case when disks are already full. Agreed that we should add some UTs in StorageManagerSuite.scala

Added a test to make sure createFile fails even if diskInfo.status is HEALTHY, if the usable space is <= 0.

Copilot · 2026-04-17T11:54:20Z

+    disksSnapshot()
+      .filter(diskInfo =>
+        (diskInfo.status == DiskStatus.HEALTHY) && (diskInfo.actualUsableSpace > 0))
+      .flatMap(_.dirs)


The disk-availability predicate (status == HEALTHY) && (actualUsableSpace > 0) is now duplicated here and again in createDiskFile. To avoid future divergence (e.g., if the definition of "writable" changes), consider centralizing this check in a small helper like isDiskWritable(diskInfo) and reuse it in both places.

SteNicholas · 2026-04-20T02:58:37Z

Review: [CELEBORN-2310] Reject RESERVE_SLOTS when disks are full

PR: #3666 | Author: saurabhd336 | Change: +6 / -2 in StorageManager.scala

What it does

Adds an actualUsableSpace > 0 check in two places so that workers reject RESERVE_SLOTS requests upfront when local disks are full, instead of accepting the reservation and only triggering a HARD_SPLIT after wasted network I/O on write.

Looks good
- Correct and low-risk — if a disk has zero usable space, rejecting early is the right call. The staleness window of actualUsableSpace matches the existing DiskStatus.HEALTHY refresh cycle, so no new race conditions are introduced.
- Narrow scope — surgical change in the right places (healthyWorkingDirs() and createDiskFile()).
Issues / suggestions
1. Duplicated predicate — (status == HEALTHY) && (actualUsableSpace > 0) now appears in two separate locations. Extract a helper like isDiskWritable(diskInfo): Boolean to centralize this and prevent future divergence.
2. No unit tests — A test for healthyWorkingDirs() excluding a HEALTHY disk with actualUsableSpace = 0 would be straightforward and would guard against regressions.
3. Formatting nit — The multi-line if in createDiskFile splits awkwardly mid-.equals(). Prefer aligning each condition on its own line:
  if (diskInfo != null &&
  diskInfo.status.equals(DiskStatus.HEALTHY) &&
  diskInfo.actualUsableSpace > 0) {
4. Threshold consideration — > 0 is a fine start, but a disk with 1 byte free will pass the check and immediately fail on write. A configurable minimum space threshold could be a good follow-up.
Verdict
Approve with minor comments. The logic is sound and the change is valuable. Adding a unit test and extracting the duplicated predicate would make it merge-ready.

SteNicholas · 2026-04-20T02:59:32Z

@saurabhd336, thanks for contribution. Please address above comments of Copilot and Claude Code.

saurabhd336 · 2026-04-20T06:27:54Z

        } else {
          if (suggestedMountPoint.isEmpty) {
            logDebug(s"Location suggestedMountPoint is not set, return all healthy working dirs.")
+          } else if (diskInfo == null) {


This is largely unrelated to the change, but fixes a potential NPE if diskInfo for the suggestedMountPoint is not found.

saurabhd336 · 2026-04-20T06:28:46Z

@SteNicholas Addressed comments. PTAL!

Reject RESERVER_SLOTS when disks are full

1c813ee

github-actions Bot added the module:worker label Apr 16, 2026

Fix check

625899d

saurabhd336 changed the title ~~Reject RESERVE_SLOTS when disks are full~~ [CELEBORN-2310] Reject RESERVE_SLOTS when disks are full Apr 16, 2026

SteNicholas requested a review from Copilot April 17, 2026 11:49

Copilot started reviewing on behalf of SteNicholas April 17, 2026 11:50 View session

Copilot AI reviewed Apr 17, 2026

View reviewed changes

Review comments

c37148a

github-actions Bot added the module:common label Apr 20, 2026

saurabhd336 added 3 commits April 20, 2026 11:02

Fix test

ad5ad32

Fix test

38803c2

Enhance logs

7f7ffac

saurabhd336 commented Apr 20, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CELEBORN-2310] Reject RESERVE_SLOTS when disks are full#3666

[CELEBORN-2310] Reject RESERVE_SLOTS when disks are full#3666
saurabhd336 wants to merge 6 commits intoapache:mainfrom
saurabhd336:diskFullReserveSlotsRejection

saurabhd336 commented Apr 16, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Apr 17, 2026

Uh oh!

zaynt4606 Apr 20, 2026

Uh oh!

saurabhd336 Apr 20, 2026

Uh oh!

Uh oh!

Copilot AI Apr 17, 2026

Uh oh!

saurabhd336 Apr 20, 2026

Uh oh!

SteNicholas commented Apr 20, 2026

Uh oh!

SteNicholas commented Apr 20, 2026 •

edited

Loading

Uh oh!

saurabhd336 Apr 20, 2026

Uh oh!

saurabhd336 commented Apr 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

saurabhd336 commented Apr 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR resolve a correctness bug?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Apr 17, 2026

Choose a reason for hiding this comment

Uh oh!

zaynt4606 Apr 20, 2026

Choose a reason for hiding this comment

Uh oh!

saurabhd336 Apr 20, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Copilot AI Apr 17, 2026

Choose a reason for hiding this comment

Uh oh!

saurabhd336 Apr 20, 2026

Choose a reason for hiding this comment

Uh oh!

SteNicholas commented Apr 20, 2026

Uh oh!

SteNicholas commented Apr 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

saurabhd336 Apr 20, 2026

Choose a reason for hiding this comment

Uh oh!

saurabhd336 commented Apr 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

saurabhd336 commented Apr 16, 2026 •

edited

Loading

SteNicholas commented Apr 20, 2026 •

edited

Loading