Skip to content

[Data] Add descriptive error when using local:// paths with a zero-resource head node#60709

Open
KaisennHu wants to merge 1 commit intoray-project:masterfrom
KaisennHu:local-path-err
Open

[Data] Add descriptive error when using local:// paths with a zero-resource head node#60709
KaisennHu wants to merge 1 commit intoray-project:masterfrom
KaisennHu:local-path-err

Conversation

@KaisennHu
Copy link
Contributor

Description

When users read from or write to local:// paths, Ray Data schedules these tasks on the head node using NodeAffinitySchedulingStrategy(head_node_id, soft=False). If the head node has no logical resources (a recommended best practice to avoid head OOM), these tasks become unschedulable and produce a confusing error:

ray.exceptions.TaskUnschedulableError: The task is not schedulable: The node specified
via NodeAffinitySchedulingStrategy doesn't exist any more or is infeasible, and soft=False
was specified. task_id=..., task_name=Write

Add a clear, actionable error message that explains why the operation failed and how to resolve it.

Solution

  • Add a helper _validate_head_node_resources_for_local_scheduling that inspects the final merged ray_remote_args and raises a clear, actionable ValueError when an operation pinned to the head node requires resources the head node lacks
  • Call this validation after merge_resources_to_ray_remote_args for head-pinned reads read_datasource and writes Dataset.write_datasink, so explicit user settings (e.g., num_cpus=0) are respected
  • Include regression tests that reproduce the zero-resource head-node + local:// scenario and assert the descriptive error is raised

@tianyi-ge

Related issues

Closes #60698

@KaisennHu KaisennHu requested a review from a team as a code owner February 3, 2026 14:29
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a helpful validation check to provide a more descriptive error message when attempting to use local:// paths with a head node that has zero resources. The implementation adds a new utility function _validate_head_node_resources_for_local_scheduling and integrates it into the read and write paths. The changes are well-implemented and include corresponding regression tests. The code is clear and addresses the issue effectively. I have one minor suggestion for improving code clarity.

@ray-gardener ray-gardener bot added data Ray Data-related issues community-contribution Contributed by the community labels Feb 3, 2026
@iamjustinhsu iamjustinhsu self-assigned this Feb 4, 2026
@bveeramani bveeramani self-assigned this Feb 4, 2026
Copy link
Contributor

@iamjustinhsu iamjustinhsu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for the contribution! left some feedback below

@KaisennHu
Copy link
Contributor Author

thanks for the contribution! left some feedback below

Thanks for the feedback! All addressed.

Copy link
Contributor

@iamjustinhsu iamjustinhsu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added more feedback, after you address those, feel free to ping me again

Comment on lines +422 to +430
if not head_node:
# The head node metadata is unavailable (e.g., during shutdown). Fall back
# to the default behavior and let Ray surface its own error.
return
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For my understanding, do u have a script of when this can occur? (head_node is None , BUT next(...) doesn't throw a StopIteration exception?)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the question. next(..., None) is intentional, so it returns None (no StopIteration) when no matching head node is visible. That can happen during shutdown/teardown or before head resources are fully registered, so we fall back and let Ray surface its own error.

@KaisennHu KaisennHu force-pushed the local-path-err branch 2 times, most recently from 4c4c02f to c590e9a Compare February 7, 2026 03:06
@KaisennHu
Copy link
Contributor Author

@iamjustinhsu Thanks for the detailed review. I've implemented the suggested changes. Let me know if there's anything else!

if num_gpus > 0:
required_resources["GPU"] = float(num_gpus)
if memory > 0:
required_resources["memory"] = float(memory)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing None handling for standard resource arguments

Medium Severity

The standard resources (num_cpus, num_gpus, memory) use .get() with a default value, but this only applies when the key is absent. If ray_remote_args contains an explicit None value (e.g., {"num_cpus": None}), the .get() returns None, and the subsequent comparison like num_cpus > 0 raises a TypeError. This is inconsistent with the custom resources handling at lines 404-405, which explicitly checks for and skips None values.

Fix in Cursor Fix in Web

and "node:__internal_head__" in node.get("Resources", {})
),
None,
)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Validation checks head node but tasks scheduled elsewhere

Medium Severity

The validation function explicitly searches for the head node using node:__internal_head__ in resources, but the actual NodeAffinitySchedulingStrategy is set using ray.get_runtime_context().get_node_id(), which returns the current node (driver's node). If the driver is running on a non-head node (e.g., in a multi-node cluster where a script runs from a worker node), the validation checks resources on the wrong node. This could cause false positives (blocking operations that would succeed) or false negatives (allowing operations that will fail).

Additional Locations (2)

Fix in Cursor Fix in Web

…source head node

Signed-off-by: Haichuan Hu <kaisennhu@gmail.com>
Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.


# Include any additional custom resources requested.
custom_resources = ray_remote_args.get("resources", {})
for name, amount in custom_resources.items():
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing None handling for resources dict causes AttributeError

Medium Severity

Similar to the standard resource fields, if ray_remote_args contains {"resources": None}, the .get("resources", {}) call returns None (since the key exists), and then custom_resources.items() raises AttributeError: 'NoneType' object has no attribute 'items'. The code handles None for individual resource amounts within the dict (line 404-405), but not for the case where the entire resources value is None.

Fix in Cursor Fix in Web

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

community-contribution Contributed by the community data Ray Data-related issues

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Data] Add descriptive error when using local:// paths with a zero-resource head node

3 participants