[trainer] feat: Implemented VeomniEngine as a alternative training backend#4072
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces VeomniEngine as a new training backend, which is a significant addition. The changes include new configuration files, modifications to existing configuration dataclasses, and the core implementation of the VeomniEngine. The implementation is still a draft with some commented-out code and NotImplementedErrors. I've identified a few critical issues that would cause runtime errors and a high-severity issue related to model evaluation mode. Addressing these will be crucial for making the engine functional.
97cd3f7 to
eed3b99
Compare
5260e58 to
d4d80a3
Compare
1043cb1 to
002f03c
Compare
a2f4b7f to
b34821a
Compare
09c92ff to
33a2a63
Compare
9c92001 to
f73f6dc
Compare
c381571 to
ea0489e
Compare
…to reduce redundant code
| @@ -0,0 +1,39 @@ | |||
| # Target class for this configuration | |||
| _target_: verl.workers.config.VeOmniOptimizerConfig | |||
There was a problem hiding this comment.
Can we reuse fsdp optimizer config: verl/trainer/config/optim/fsdp.yaml?
There was a problem hiding this comment.
Can we reuse fsdp optimizer config: verl/trainer/config/optim/fsdp.yaml?
Directly use FSDP optimizer cannot work with the EP case, need to figure out why. Based on the error msg, it seems that there are 2 device_mesh when we enable EP. EP and (DP_shard, SP), they cannot share same device_mesh, and the optimizer cannot work in this suitiation. So, I might keep using VeOmni's optimizer for EP adaptation.
There was a problem hiding this comment.
In Veomni, the parameter optimization for EP and non-EP parameters is managed separately by distinct optimizers, as seen here: https://github.com/ByteDance-Seed/VeOmni/blob/889cb3379a1143f4aa178ff55dbb3b1bcb788135/veomni/optim/optimizer.py#L311 @wuxibin89
|
Great and beautiful work!!! |
| if parallel_state.get_parallel_state().ulysses_size > 1: | ||
| return parallel_state.get_parallel_state().device_mesh["dp"].get_local_rank() | ||
| else: | ||
| return torch.distributed.get_rank() |
There was a problem hiding this comment.
Why dp_rank is torch.distributed.get_rank() when ulysses_size==1?
There was a problem hiding this comment.
Do we support EP/PP?
There was a problem hiding this comment.
Do we support EP/PP?
only SP and EP works right now
There was a problem hiding this comment.
Why dp_rank is torch.distributed.get_rank() when ulysses_size==1?
Lmao, I copyed this func with FSDPEngine style, and forgot to refine it. And yeah, I can simply return parallel_state.get_parallel_state().device_mesh.get_local_rank("dp"). Will fix it later
What does this PR do?
This PR introduces a implementation of a VeOmniEngine for VERL, providing an alternative to the existing FSDP engine.
We plan to integrate the Veomni Engine in two phases. The first phase (as part of this PR) is to complete the engine code development and conduct basic validation via SFT. The second phase is to finish the integration of the RL workflow and supplemnt the relevant document.
Checklist Before Starting
[{modules}] {type}: {description}(This will be checked by the CI){modules}includefsdp,megatron,sglang,vllm,rollout,trainer,ci,training_utils,recipe,hardware,deployment,ray,worker,single_controller,misc,perf,model,algo,env,tool,ckpt,doc,data,like[megatron, fsdp, doc]{type}is infeat,fix,refactor,chore,test[BREAKING]to the beginning of the title.[BREAKING][fsdp, megatron] feat: dynamic batchingTest
API and Usage Example
# Add code snippet or script demonstrating how to use thisDesign & Code Changes
Checklist Before Submitting
Important
Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review.
pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=alwaysci-requestchannel in theverlSlack workspace. (If not accessible, please try the Feishu group (飞书群).)