-
Notifications
You must be signed in to change notification settings - Fork 2.9k
[ci] feat: add npu unit test #4626
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces NPU support for unit tests by abstracting device-specific logic into utility functions. The changes primarily involve replacing hardcoded "cuda" references with calls to these new utilities like get_device_name. While the overall approach is correct, I've identified a few critical issues in the implementation of some new helper functions and their usage. These could lead to test failures in environments without NPU support. I've provided detailed comments and code suggestions to address these problems and improve the robustness of the device detection logic.
| def get_device_name_ray() -> str: | ||
| """Function that gets the torch.device based on the current machine. | ||
| This currently only supports CPU, CUDA, NPU. | ||
| Returns: | ||
| device | ||
| """ | ||
| if torch.cuda.is_available(): | ||
| device = "GPU" | ||
| elif torch.npu.is_available(): | ||
| device = "NPU" | ||
| return device |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This new helper function get_device_name_ray has two critical issues that will cause tests to crash:
- Unsafe NPU check: The call
torch.npu.is_available()will raise anAttributeErrorif thetorch.npumodule is not present (e.g., in a CUDA-only environment). UnboundLocalError: If neither CUDA nor NPU is available, thedevicevariable is never assigned, leading to anUnboundLocalErroron the return statement.
Please use a safe method to check for NPU availability and ensure there's a fallback return value (e.g., "CPU") to prevent crashes.
| def get_device_name_ray() -> str: | |
| """Function that gets the torch.device based on the current machine. | |
| This currently only supports CPU, CUDA, NPU. | |
| Returns: | |
| device | |
| """ | |
| if torch.cuda.is_available(): | |
| device = "GPU" | |
| elif torch.npu.is_available(): | |
| device = "NPU" | |
| return device | |
| def get_device_name_ray() -> str: | |
| """Function that gets the torch.device based on the current machine. | |
| This currently only supports CPU, CUDA, NPU. | |
| Returns: | |
| device | |
| """ | |
| from verl.utils.device import is_npu_available | |
| if torch.cuda.is_available(): | |
| return "GPU" | |
| elif is_npu_available: | |
| return "NPU" | |
| return "CPU" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
已修改
| if torch.cuda.is_available(): | ||
| torch.cuda.set_device(cls.rank) | ||
| cls.device = torch.device(f"cuda:{cls.rank}") | ||
| elif torch.npu.is_available(): | ||
| torch.npu.set_device(cls.rank) | ||
| cls.device = torch.device(f"npu:{cls.rank}") | ||
| else: | ||
| cls.device = torch.device("cpu") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The call torch.npu.is_available() is unsafe and will raise an AttributeError if the torch.npu module is not available (e.g., in a CUDA-only or CPU-only environment). This will cause the test to crash. Please use a safe way to check for NPU availability, for example by using get_device_name from verl.utils.device, to ensure the code runs correctly across different hardware environments.
| if torch.cuda.is_available(): | |
| torch.cuda.set_device(cls.rank) | |
| cls.device = torch.device(f"cuda:{cls.rank}") | |
| elif torch.npu.is_available(): | |
| torch.npu.set_device(cls.rank) | |
| cls.device = torch.device(f"npu:{cls.rank}") | |
| else: | |
| cls.device = torch.device("cpu") | |
| from verl.utils.device import get_device_name | |
| device_name = get_device_name() | |
| if device_name == "cuda": | |
| torch.cuda.set_device(cls.rank) | |
| cls.device = torch.device(f"cuda:{cls.rank}") | |
| elif device_name == "npu": | |
| torch.npu.set_device(cls.rank) | |
| cls.device = torch.device(f"npu:{cls.rank}") | |
| else: | |
| cls.device = torch.device("cpu") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
已修改
tests/utils/test_torch_functional.py
Outdated
|
|
||
| # each rank holds tensor [rank+1] | ||
| local = torch.tensor([float(rank + 1)], device=f"cuda:{rank}") | ||
| local = torch.tensor([float(rank + 1)], device=get_device_name()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For consistency and clarity, it's better to explicitly specify the device with its index, similar to how it's done in the _worker_mask function in this same file (device=f"{get_device_name()}:{rank}"). While the current code works because set_device is called earlier, being explicit avoids reliance on implicit state and makes the code easier to understand and maintain.
| local = torch.tensor([float(rank + 1)], device=get_device_name()) | |
| local = torch.tensor([float(rank + 1)], device=f"{get_device_name()}:{rank}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
已修改
What does this PR do?
Checklist Before Starting
[{modules}] {type}: {description}(This will be checked by the CI){modules}includefsdp,megatron,sglang,vllm,rollout,trainer,ci,training_utils,recipe,hardware,deployment,ray,worker,single_controller,misc,perf,model,algo,env,tool,ckpt,doc,data,cfg,reward,like[megatron, fsdp, doc]{type}is infeat,fix,refactor,chore,test[BREAKING]to the beginning of the title.[BREAKING][fsdp, megatron] feat: dynamic batchingTest
API and Usage Example
# Add code snippet or script demonstrating how to use thisDesign & Code Changes
Checklist Before Submitting
Important
Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review.
pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=alwaysci-requestchannel in theverlSlack workspace. (If not accessible, please try the Feishu group (飞书群).)