-
Notifications
You must be signed in to change notification settings - Fork 2.9k
[trainer] feat: Implemented VeomniEngine as a alternative training backend #4072
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[trainer] feat: Implemented VeomniEngine as a alternative training backend #4072
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces VeomniEngine as a new training backend, which is a significant addition. The changes include new configuration files, modifications to existing configuration dataclasses, and the core implementation of the VeomniEngine. The implementation is still a draft with some commented-out code and NotImplementedErrors. I've identified a few critical issues that would cause runtime errors and a high-severity issue related to model evaluation mode. Addressing these will be crucial for making the engine functional.
97cd3f7 to
eed3b99
Compare
5260e58 to
d4d80a3
Compare
1043cb1 to
002f03c
Compare
a2f4b7f to
b34821a
Compare
09c92ff to
33a2a63
Compare
156b83d to
d505d51
Compare
d505d51 to
f98c214
Compare
…to reduce redundant code
00548ce to
7b5e767
Compare
7b5e767 to
1ed46d6
Compare
edf4161 to
bd01f7c
Compare
bd01f7c to
1961db3
Compare
| @@ -0,0 +1,39 @@ | |||
| # Target class for this configuration | |||
| _target_: verl.workers.config.VeOmniOptimizerConfig | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we reuse fsdp optimizer config: verl/trainer/config/optim/fsdp.yaml?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we reuse fsdp optimizer config: verl/trainer/config/optim/fsdp.yaml?
Directly use FSDP optimizer cannot work with the EP case, need to figure out why. Based on the error msg, it seems that there are 2 device_mesh when we enable EP. EP and (DP_shard, SP), they cannot share same device_mesh, and the optimizer cannot work in this suitiation. So, I might keep using VeOmni's optimizer for EP adaptation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In Veomni, the parameter optimization for EP and non-EP parameters is managed separately by distinct optimizers, as seen here: https://github.com/ByteDance-Seed/VeOmni/blob/889cb3379a1143f4aa178ff55dbb3b1bcb788135/veomni/optim/optimizer.py#L311 @wuxibin89
2eb7e08 to
a1d02fd
Compare
|
Great and beautiful work!!! |
| if parallel_state.get_parallel_state().ulysses_size > 1: | ||
| return parallel_state.get_parallel_state().device_mesh["dp"].get_local_rank() | ||
| else: | ||
| return torch.distributed.get_rank() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why dp_rank is torch.distributed.get_rank() when ulysses_size==1?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we support EP/PP?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we support EP/PP?
only SP and EP works right now
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why dp_rank is torch.distributed.get_rank() when ulysses_size==1?
Lmao, I copyed this func with FSDPEngine style, and forgot to refine it. And yeah, I can simply return parallel_state.get_parallel_state().device_mesh.get_local_rank("dp"). Will fix it later
What does this PR do?
This PR introduces a implementation of a VeOmniEngine for VERL, providing an alternative to the existing FSDP engine.
We plan to integrate the Veomni Engine in two phases. The first phase (as part of this PR) is to complete the engine code development and conduct basic validation via SFT. The second phase is to finish the integration of the RL workflow and supplemnt the relevant document.
Checklist Before Starting
[{modules}] {type}: {description}(This will be checked by the CI){modules}includefsdp,megatron,sglang,vllm,rollout,trainer,ci,training_utils,recipe,hardware,deployment,ray,worker,single_controller,misc,perf,model,algo,env,tool,ckpt,doc,data,like[megatron, fsdp, doc]{type}is infeat,fix,refactor,chore,test[BREAKING]to the beginning of the title.[BREAKING][fsdp, megatron] feat: dynamic batchingTest
API and Usage Example
# Add code snippet or script demonstrating how to use thisDesign & Code Changes
Checklist Before Submitting
Important
Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review.
pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=alwaysci-requestchannel in theverlSlack workspace. (If not accessible, please try the Feishu group (飞书群).)