Skip to content

Conversation

@limarkdcunha
Copy link
Contributor

This PR updates DefaultClusterAutoscalerV2 to safely handle nodes with 0 logical CPUs by replacing direct dictionary access (r["CPU"]) with r.get("CPU", 0), preventing crashes on dedicated GPU nodes.
This fix has been discussed firsthand with @bveeramani.

"Fixes #60166"

@limarkdcunha limarkdcunha requested a review from a team as a code owner January 16, 2026 04:46
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request effectively resolves a KeyError by safely accessing the 'CPU' resource using dict.get() with a default value. This change is crucial for preventing crashes on nodes that may not explicitly define 'CPU' resources, such as dedicated GPU nodes. For enhanced robustness and consistency, a similar safe access pattern is recommended for the 'memory' resource.

node_resource_spec = _NodeResourceSpec.of(
cpu=r["CPU"], gpu=r.get("GPU", 0), mem=r["memory"]
cpu=r.get("CPU", 0), gpu=r.get("GPU", 0), mem=r.get("memory", 0)
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a regression test? e.g., a test where ray.nodes() returns nodes with no logical resources?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey yes that makes sense , sorry for missing that , I will add that first thing tomorrow.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also I am sorry I know this is the right platform for all these questions,

  1. Is there a specific Python version (e.g., 3.9, 3.10) that is most stable/recommended for development right now?
  2. I noticed the dev guide covers Linux, macOS, and Windows. Given that Ray has a complex C++ backend with OS-level dependencies, how does the project ensure changes on one OS don't break others? Is there anything specific I should watch out for on macOS?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, no need to apologize for asking questions!

I'd recommend using 3.10, because it's the lowest supported version.

I think we test against different operating systems as part of our release process. AFAIK most Ray Data devs use Mac, so you should be good on that front

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have added the test as needed please review again, sorry for the delay was facing issues with python 3.13 for deps installation for testing but switching to 3.10 helped. Thanks

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My bad for using 2 github accounts will be using this one from now on

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nw!

@ray-gardener ray-gardener bot added core Issues that should be addressed in Ray Core data Ray Data-related issues community-contribution Contributed by the community labels Jan 16, 2026
Copy link
Member

@bveeramani bveeramani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice

@bveeramani bveeramani enabled auto-merge (squash) January 18, 2026 00:08
@github-actions github-actions bot disabled auto-merge January 18, 2026 00:08
@github-actions github-actions bot added the go add ONLY when ready to merge, run all tests label Jan 18, 2026
@limarkdcunha
Copy link
Contributor Author

@bveeramani I have fixed all the issues but I don't seem to have privileges to merge to master ,can you please take care of it Thanks

@bveeramani bveeramani merged commit bf4b2d0 into ray-project:master Jan 18, 2026
6 checks passed
@bveeramani
Copy link
Member

Done, ty for the contribution!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

community-contribution Contributed by the community core Issues that should be addressed in Ray Core data Ray Data-related issues go add ONLY when ready to merge, run all tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Data] DefaultClusterAutoscalerV2 raises KeyError: 'CPU' on nodes with 0 logical CPU resources

3 participants