[CELEBORN-2310] Reject RESERVE_SLOTS when disks are full#3666
[CELEBORN-2310] Reject RESERVE_SLOTS when disks are full#3666saurabhd336 wants to merge 6 commits intoapache:mainfrom
Conversation
There was a problem hiding this comment.
Pull request overview
This PR updates the worker’s local-disk “availability” checks to treat disks with actualUsableSpace <= 0 as unusable, so RESERVE_SLOTS can be rejected earlier during disk-full scenarios and avoid wasted push/write network I/O.
Changes:
- Refine
healthyWorkingDirs()to exclude disks that areHEALTHYbut haveactualUsableSpace <= 0. - Refine
createDiskFile()directory selection to avoid using a suggested mount point when its disk has no usable space.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| def healthyWorkingDirs(): List[File] = | ||
| disksSnapshot().filter(_.status == DiskStatus.HEALTHY).flatMap(_.dirs) | ||
| disksSnapshot() | ||
| .filter(diskInfo => | ||
| (diskInfo.status == DiskStatus.HEALTHY) && (diskInfo.actualUsableSpace > 0)) | ||
| .flatMap(_.dirs) |
There was a problem hiding this comment.
healthyWorkingDirs() now filters by actualUsableSpace > 0, which changes behavior in reserve-slot handling. There’s existing test coverage for updateDiskInfos() in StorageManagerSuite, but no test exercising this new predicate (e.g., disk status HEALTHY with actualUsableSpace == 0 should produce an empty healthyWorkingDirs). Adding a targeted unit test would prevent regressions and ensure RESERVE_SLOTS rejection works as intended.
There was a problem hiding this comment.
This PR is meaningful for the case when disks are already full. Agreed that we should add some UTs in StorageManagerSuite.scala
There was a problem hiding this comment.
Added a test to make sure createFile fails even if diskInfo.status is HEALTHY, if the usable space is <= 0.
| disksSnapshot() | ||
| .filter(diskInfo => | ||
| (diskInfo.status == DiskStatus.HEALTHY) && (diskInfo.actualUsableSpace > 0)) | ||
| .flatMap(_.dirs) |
There was a problem hiding this comment.
The disk-availability predicate (status == HEALTHY) && (actualUsableSpace > 0) is now duplicated here and again in createDiskFile. To avoid future divergence (e.g., if the definition of "writable" changes), consider centralizing this check in a small helper like isDiskWritable(diskInfo) and reuse it in both places.
|
Review: [CELEBORN-2310] Reject RESERVE_SLOTS when disks are full PR: #3666 | Author: saurabhd336 | Change: +6 / -2 in StorageManager.scala
Adds an actualUsableSpace > 0 check in two places so that workers reject RESERVE_SLOTS requests upfront when local disks are full, instead of accepting the reservation and only triggering a HARD_SPLIT after wasted network I/O on write.
|
|
@saurabhd336, thanks for contribution. Please address above comments of Copilot and Claude Code. |
| } else { | ||
| if (suggestedMountPoint.isEmpty) { | ||
| logDebug(s"Location suggestedMountPoint is not set, return all healthy working dirs.") | ||
| } else if (diskInfo == null) { |
There was a problem hiding this comment.
This is largely unrelated to the change, but fixes a potential NPE if diskInfo for the suggestedMountPoint is not found.
|
@SteNicholas Addressed comments. PTAL! |
What changes were proposed in this pull request?
Disk full only lead to HARD_SPLITs as a response to writes. However, doesn't lead to reserve slot rejections. This means too many write retries (due to HARD_SPLITs on each write attempt) leads to wasted network I/O. We can reject RESERVE_SLOT during disk full to avoid the wasted data write network IO.
Why are the changes needed?
Reject reserve slots during disk full, avoid unnecessary network IO.
Does this PR resolve a correctness bug?
No.
Does this PR introduce any user-facing change?
No.
How was this patch tested?
Added UTs, CI.