-
Notifications
You must be signed in to change notification settings - Fork 7.1k
[Data] DefaultClusterAutoscalerV2 raises KeyError: 'CPU' Fix #60208
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request effectively resolves a KeyError by safely accessing the 'CPU' resource using dict.get() with a default value. This change is crucial for preventing crashes on nodes that may not explicitly define 'CPU' resources, such as dedicated GPU nodes. For enhanced robustness and consistency, a similar safe access pattern is recommended for the 'memory' resource.
python/ray/data/_internal/cluster_autoscaler/default_cluster_autoscaler_v2.py
Outdated
Show resolved
Hide resolved
| node_resource_spec = _NodeResourceSpec.of( | ||
| cpu=r["CPU"], gpu=r.get("GPU", 0), mem=r["memory"] | ||
| cpu=r.get("CPU", 0), gpu=r.get("GPU", 0), mem=r.get("memory", 0) | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add a regression test? e.g., a test where ray.nodes() returns nodes with no logical resources?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey yes that makes sense , sorry for missing that , I will add that first thing tomorrow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also I am sorry I know this is the right platform for all these questions,
- Is there a specific Python version (e.g., 3.9, 3.10) that is most stable/recommended for development right now?
- I noticed the dev guide covers Linux, macOS, and Windows. Given that Ray has a complex C++ backend with OS-level dependencies, how does the project ensure changes on one OS don't break others? Is there anything specific I should watch out for on macOS?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, no need to apologize for asking questions!
I'd recommend using 3.10, because it's the lowest supported version.
I think we test against different operating systems as part of our release process. AFAIK most Ray Data devs use Mac, so you should be good on that front
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have added the test as needed please review again, sorry for the delay was facing issues with python 3.13 for deps installation for testing but switching to 3.10 helped. Thanks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My bad for using 2 github accounts will be using this one from now on
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nw!
bveeramani
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice
Signed-off-by: Limark Dcunha <[email protected]>
Signed-off-by: Limark Dcunha <[email protected]>
|
@bveeramani I have fixed all the issues but I don't seem to have privileges to merge to master ,can you please take care of it Thanks |
|
Done, ty for the contribution! |
This PR updates DefaultClusterAutoscalerV2 to safely handle nodes with 0 logical CPUs by replacing direct dictionary access (r["CPU"]) with r.get("CPU", 0), preventing crashes on dedicated GPU nodes.
This fix has been discussed firsthand with @bveeramani.
"Fixes #60166"