-
Notifications
You must be signed in to change notification settings - Fork 2.9k
[recipe, megatron, fsdp] Support fully async on NPU #4658
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request refactors the checkpointing and parameter synchronization logic to enable fully asynchronous training on NPUs by replacing ray.util.collective with a vllm-based stateless process group. This is a significant and necessary change for NPU support. The implementation is mostly solid, introducing a new distributed_util.py and adapting existing components. However, I've identified a critical issue where the device is hardcoded to 'npu' in a couple of places, which would break compatibility with other hardware like GPUs. Addressing this will make the solution robust and hardware-agnostic.
a650c49 to
170feb6
Compare
Co-authored-by: hswei88 <[email protected]>
170feb6 to
a7d49ea
Compare
| f" offload model to cpu cost {offload_duration} seconds" | ||
| ) | ||
|
|
||
| @register(dispatch_mode=Dispatch.ONE_TO_ALL, blocking=False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function looks same in fsdp/megatron worker, can we move it to a public place and use it in fsdp/megatron backend?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This could be easily achieved after this recipe migrates to engine workers. If we move it to Worker class now it would result in another class maintained in recipe.
What does this PR do?
Due to ray collective not supported on NPU yet, this PR uses vllm stateless group to replace ray collective group to support fully async training and checkpoint engine.
Checklist Before Starting
[{modules}] {type}: {description}(This will be checked by the CI){modules}includefsdp,megatron,sglang,vllm,rollout,trainer,ci,training_utils,recipe,hardware,deployment,ray,worker,single_controller,misc,perf,model,algo,env,tool,ckpt,doc,data,cfg,reward,like[megatron, fsdp, doc]{type}is infeat,fix,refactor,chore,test[BREAKING]to the beginning of the title.[BREAKING][fsdp, megatron] feat: dynamic batchingTest
We have looked into the performance and throughput of fully async with Qwen3-30B-A3B.


With FSDP backend, we have achieved even higher reward than a previous collocated training. (The pink line represent fully async while the others representing collocated)
As for throughput, we have achieved more than 4x throughput gaining comparing 64-rank fully async training to 128-rank collocated training, which means 64-rank fully async training is 2x faster than collocated 128-rank.
Also verified with Qwen3-0.6B in cases below:
API and Usage Example
# Add code snippet or script demonstrating how to use thisDesign & Code Changes
Checklist Before Submitting
Important
Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review.
pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=alwaysci-requestchannel in theverlSlack workspace. (If not accessible, please try the Feishu group (飞书群).)