Conversation
Fix issue: RuntimeError: Expected all tensors to be on the same device, but found at least two devices, npu:0 and cpu!
|
|
There was a problem hiding this comment.
Code Review
This pull request introduces a monkey patch for qwen3_vl_moe to address a device mismatch error on NPU when using FSDP with parameter offloading. The patch correctly identifies that grid_thw.device should be used as the target device instead of self.pos_embed.weight.device. However, I've found a critical issue in the implementation of the patch that will cause a runtime error. Please see my comment for details.
| h_idxs = torch.linspace(0, self.num_grid_per_side - 1, h) | ||
| w_idxs = torch.linspace(0, self.num_grid_per_side - 1, w) |
There was a problem hiding this comment.
The steps argument of torch.linspace must be an integer, but h and w are 0-dimensional tensors from iterating over grid_hs and grid_ws. This will raise a TypeError and cause the program to crash. You should use .item() to convert them to Python integers.
Additionally, performing these calculations on the CPU within the loop and then transferring to the target device can be inefficient. Consider performing the computations directly on the grid_thw.device to avoid unnecessary data transfers between CPU and NPU.
| h_idxs = torch.linspace(0, self.num_grid_per_side - 1, h) | |
| w_idxs = torch.linspace(0, self.num_grid_per_side - 1, w) | |
| h_idxs = torch.linspace(0, self.num_grid_per_side - 1, h.item()) | |
| w_idxs = torch.linspace(0, self.num_grid_per_side - 1, w.item()) |
|
I think we should fix NPU fsdp load/offload instead of patch specific model. cc @ji-huazhong |
|
+1 |
Fix issue: RuntimeError: Expected all tensors to be on the same device, but found at least two devices, npu:0 and cpu!
What does this PR do?
Checklist Before Starting
[{modules}] {type}: {description}(This will be checked by the CI){modules}includefsdp,megatron,veomni,sglang,vllm,rollout,trainer,ci,training_utils,recipe,hardware,deployment,ray,worker,single_controller,misc,perf,model,algo,env,tool,ckpt,doc,data,cfg,reward,fully_async,one_step_off,like[megatron, fsdp, doc]{type}is infeat,fix,refactor,chore,test[BREAKING]to the beginning of the title.[BREAKING][fsdp, megatron] feat: dynamic batchingTest
API and Usage Example
# Add code snippet or script demonstrating how to use thisDesign & Code Changes
Checklist Before Submitting
Important
Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review.
pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=alwaysci-requestchannel in theverlSlack workspace. (If not accessible, please try the Feishu group (飞书群).)recipesubmodule, please also update the reference to the submodule commit viagit submodule update --remoteorcd recipe && git pull origin main.