Skip to content

Conversation

@A1waysBeenHere
Copy link
Contributor

@A1waysBeenHere A1waysBeenHere commented Nov 10, 2025

What does this PR do?

This PR introduces a implementation of a VeOmniEngine for VERL, providing an alternative to the existing FSDP engine.

We plan to integrate the Veomni Engine in two phases. The first phase (as part of this PR) is to complete the engine code development and conduct basic validation via SFT. The second phase is to finish the integration of the RL workflow and supplemnt the relevant document.

Checklist Before Starting

  • Search for similar PRs. Paste at least one query link here: ...
  • Format the PR title as [{modules}] {type}: {description} (This will be checked by the CI)
    • {modules} include fsdp, megatron, sglang, vllm, rollout, trainer, ci, training_utils, recipe, hardware, deployment, ray, worker, single_controller, misc, perf, model, algo, env, tool, ckpt, doc, data
    • If this PR involves multiple modules, separate them with , like [megatron, fsdp, doc]
    • {type} is in feat, fix, refactor, chore, test
    • If this PR breaks any API (CLI arguments, config, function signature, etc.), add [BREAKING] to the beginning of the title.
    • Example: [BREAKING][fsdp, megatron] feat: dynamic batching

Test

For changes that can not be tested by CI (e.g., algorithm implementation, new model support), validate by experiment(s) and show results like training curve plots, evaluation results, etc.

API and Usage Example

Demonstrate how the API changes if any, and provide usage example(s) if possible.

# Add code snippet or script demonstrating how to use this

Design & Code Changes

Demonstrate the high-level design if this PR is complex, and list the specific changes.

Checklist Before Submitting

Important

Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces VeomniEngine as a new training backend, which is a significant addition. The changes include new configuration files, modifications to existing configuration dataclasses, and the core implementation of the VeomniEngine. The implementation is still a draft with some commented-out code and NotImplementedErrors. I've identified a few critical issues that would cause runtime errors and a high-severity issue related to model evaluation mode. Addressing these will be crucial for making the engine functional.

@A1waysBeenHere A1waysBeenHere marked this pull request as draft November 11, 2025 01:29
@A1waysBeenHere A1waysBeenHere changed the title [Draft] Implemented VeomniEngine as a alternative training backend [Trainer] Implemented VeomniEngine as a alternative training backend Nov 11, 2025
@A1waysBeenHere A1waysBeenHere changed the title [Trainer] Implemented VeomniEngine as a alternative training backend [trainer] Implemented VeomniEngine as a alternative training backend Nov 11, 2025
@A1waysBeenHere A1waysBeenHere force-pushed the veomni_engine_loss_cal branch 3 times, most recently from 97cd3f7 to eed3b99 Compare November 12, 2025 02:32
@A1waysBeenHere A1waysBeenHere changed the title [trainer] Implemented VeomniEngine as a alternative training backend [WIP][trainer] Implemented VeomniEngine as a alternative training backend Nov 15, 2025
@A1waysBeenHere A1waysBeenHere force-pushed the veomni_engine_loss_cal branch 7 times, most recently from 1043cb1 to 002f03c Compare November 26, 2025 03:37
@A1waysBeenHere A1waysBeenHere changed the title [WIP][trainer] Implemented VeomniEngine as a alternative training backend [trainer]Implemented VeomniEngine as a alternative training backend Nov 28, 2025
@A1waysBeenHere A1waysBeenHere marked this pull request as ready for review November 29, 2025 03:27
@A1waysBeenHere A1waysBeenHere force-pushed the veomni_engine_loss_cal branch 4 times, most recently from 09c92ff to 33a2a63 Compare November 29, 2025 08:11
@ji-huazhong ji-huazhong force-pushed the veomni_engine_loss_cal branch 2 times, most recently from 00548ce to 7b5e767 Compare December 17, 2025 01:43
@ji-huazhong ji-huazhong force-pushed the veomni_engine_loss_cal branch from 7b5e767 to 1ed46d6 Compare December 17, 2025 02:23
@A1waysBeenHere A1waysBeenHere force-pushed the veomni_engine_loss_cal branch 2 times, most recently from edf4161 to bd01f7c Compare December 18, 2025 07:42
@@ -0,0 +1,39 @@
# Target class for this configuration
_target_: verl.workers.config.VeOmniOptimizerConfig
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we reuse fsdp optimizer config: verl/trainer/config/optim/fsdp.yaml?

Copy link
Contributor Author

@A1waysBeenHere A1waysBeenHere Dec 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we reuse fsdp optimizer config: verl/trainer/config/optim/fsdp.yaml?

Directly use FSDP optimizer cannot work with the EP case, need to figure out why. Based on the error msg, it seems that there are 2 device_mesh when we enable EP. EP and (DP_shard, SP), they cannot share same device_mesh, and the optimizer cannot work in this suitiation. So, I might keep using VeOmni's optimizer for EP adaptation.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In Veomni, the parameter optimization for EP and non-EP parameters is managed separately by distinct optimizers, as seen here: https://github.com/ByteDance-Seed/VeOmni/blob/889cb3379a1143f4aa178ff55dbb3b1bcb788135/veomni/optim/optimizer.py#L311 @wuxibin89

@A1waysBeenHere A1waysBeenHere force-pushed the veomni_engine_loss_cal branch 4 times, most recently from 2eb7e08 to a1d02fd Compare December 18, 2025 13:14
@vermouth1992
Copy link
Collaborator

Great and beautiful work!!!

if parallel_state.get_parallel_state().ulysses_size > 1:
return parallel_state.get_parallel_state().device_mesh["dp"].get_local_rank()
else:
return torch.distributed.get_rank()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why dp_rank is torch.distributed.get_rank() when ulysses_size==1?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we support EP/PP?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we support EP/PP?

only SP and EP works right now

Copy link
Contributor Author

@A1waysBeenHere A1waysBeenHere Dec 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why dp_rank is torch.distributed.get_rank() when ulysses_size==1?

Lmao, I copyed this func with FSDPEngine style, and forgot to refine it. And yeah, I can simply return parallel_state.get_parallel_state().device_mesh.get_local_rank("dp"). Will fix it later

@wuxibin89 wuxibin89 merged commit 81a1ab1 into volcengine:main Dec 19, 2025
160 of 214 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants