[Data] Add descriptive error when using local:// paths with a zero-resource head node#60709
[Data] Add descriptive error when using local:// paths with a zero-resource head node#60709KaisennHu wants to merge 1 commit intoray-project:masterfrom
local:// paths with a zero-resource head node#60709Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces a helpful validation check to provide a more descriptive error message when attempting to use local:// paths with a head node that has zero resources. The implementation adds a new utility function _validate_head_node_resources_for_local_scheduling and integrates it into the read and write paths. The changes are well-implemented and include corresponding regression tests. The code is clear and addresses the issue effectively. I have one minor suggestion for improving code clarity.
iamjustinhsu
left a comment
There was a problem hiding this comment.
thanks for the contribution! left some feedback below
0638b78 to
f1e2735
Compare
Thanks for the feedback! All addressed. |
iamjustinhsu
left a comment
There was a problem hiding this comment.
I added more feedback, after you address those, feel free to ping me again
| if not head_node: | ||
| # The head node metadata is unavailable (e.g., during shutdown). Fall back | ||
| # to the default behavior and let Ray surface its own error. | ||
| return |
There was a problem hiding this comment.
For my understanding, do u have a script of when this can occur? (head_node is None , BUT next(...) doesn't throw a StopIteration exception?)
There was a problem hiding this comment.
Thanks for the question. next(..., None) is intentional, so it returns None (no StopIteration) when no matching head node is visible. That can happen during shutdown/teardown or before head resources are fully registered, so we fall back and let Ray surface its own error.
4c4c02f to
c590e9a
Compare
|
@iamjustinhsu Thanks for the detailed review. I've implemented the suggested changes. Let me know if there's anything else! |
| if num_gpus > 0: | ||
| required_resources["GPU"] = float(num_gpus) | ||
| if memory > 0: | ||
| required_resources["memory"] = float(memory) |
There was a problem hiding this comment.
Missing None handling for standard resource arguments
Medium Severity
The standard resources (num_cpus, num_gpus, memory) use .get() with a default value, but this only applies when the key is absent. If ray_remote_args contains an explicit None value (e.g., {"num_cpus": None}), the .get() returns None, and the subsequent comparison like num_cpus > 0 raises a TypeError. This is inconsistent with the custom resources handling at lines 404-405, which explicitly checks for and skips None values.
| and "node:__internal_head__" in node.get("Resources", {}) | ||
| ), | ||
| None, | ||
| ) |
There was a problem hiding this comment.
Validation checks head node but tasks scheduled elsewhere
Medium Severity
The validation function explicitly searches for the head node using node:__internal_head__ in resources, but the actual NodeAffinitySchedulingStrategy is set using ray.get_runtime_context().get_node_id(), which returns the current node (driver's node). If the driver is running on a non-head node (e.g., in a multi-node cluster where a script runs from a worker node), the validation checks resources on the wrong node. This could cause false positives (blocking operations that would succeed) or false negatives (allowing operations that will fail).
Additional Locations (2)
…source head node Signed-off-by: Haichuan Hu <kaisennhu@gmail.com>
c590e9a to
8fd4696
Compare
|
|
||
| # Include any additional custom resources requested. | ||
| custom_resources = ray_remote_args.get("resources", {}) | ||
| for name, amount in custom_resources.items(): |
There was a problem hiding this comment.
Missing None handling for resources dict causes AttributeError
Medium Severity
Similar to the standard resource fields, if ray_remote_args contains {"resources": None}, the .get("resources", {}) call returns None (since the key exists), and then custom_resources.items() raises AttributeError: 'NoneType' object has no attribute 'items'. The code handles None for individual resource amounts within the dict (line 404-405), but not for the case where the entire resources value is None.


Description
When users read from or write to
local://paths, Ray Data schedules these tasks on the head node usingNodeAffinitySchedulingStrategy(head_node_id, soft=False). If the head node has no logical resources (a recommended best practice to avoid head OOM), these tasks become unschedulable and produce a confusing error:Add a clear, actionable error message that explains why the operation failed and how to resolve it.
Solution
_validate_head_node_resources_for_local_schedulingthat inspects the final mergedray_remote_argsand raises a clear, actionableValueErrorwhen an operation pinned to the head node requires resources the head node lacksmerge_resources_to_ray_remote_argsfor head-pinned readsread_datasourceand writesDataset.write_datasink, so explicit user settings (e.g.,num_cpus=0) are respectedlocal://scenario and assert the descriptive error is raised@tianyi-ge
Related issues
Closes #60698