[trainer] feat: Add Torchtitan as alternative training engine#5051
[trainer] feat: Add Torchtitan as alternative training engine#5051wuxibin89 merged 33 commits intoverl-project:mainfrom
Conversation
|
|
There was a problem hiding this comment.
Code Review
The pull request introduces the Torchtitan engine, model, and optimizer configurations, along with their implementation. The changes involve adding new dataclasses and integrating them into the existing system. While the overall structure is in place, several critical issues related to configuration consistency, attribute definitions, and potential runtime errors have been identified. These issues primarily stem from mismatches between the new Torchtitan configurations and the expectations of the underlying Torchtitan and Verl utility functions. Addressing these will be crucial for the correct functioning of the new engine.
| context_parallel_degree=engine_config.context_parallel_size, | ||
| ) |
There was a problem hiding this comment.
The Parallelism constructor and ParallelDims in _init_device_mesh expect an expert_tensor_parallel_degree (or etp). However, TorchtitanEngineConfig does not define an expert_tensor_parallel_size attribute. This omission will lead to an AttributeError when constructing the parallelism configuration. Please add expert_tensor_parallel_size to TorchtitanEngineConfig with an appropriate default value.
expert_parallel_degree=engine_config.expert_parallel_size,
expert_tensor_parallel_degree=engine_config.expert_tensor_parallel_size,|
|
||
| input_ids_rmpad_rolled = torch.roll(input_ids_rmpad, shifts=-1, dims=1) | ||
|
|
||
| if self.use_ulysses_sp: |
There was a problem hiding this comment.
The attribute self.use_ulysses_sp is accessed but not defined in TorchTitanEngineWithLMHead or its base class TorchTitanEngine. This will cause an AttributeError at runtime. This flag is used to control Ulysses sequence parallelism logic. Please ensure it is properly defined and initialized.
if hasattr(self, "use_ulysses_sp") and self.use_ulysses_sp:| assert isinstance(self.engine, TorchTitanEngine) | ||
|
|
||
| # Reshard the root FSDP module | ||
| if self.engine.engine_config.fsdp_size > 1: |
There was a problem hiding this comment.
The self.engine.engine_config.fsdp_size attribute is accessed in EngineEvalModeCtx but TorchtitanEngineConfig does not have an fsdp_size attribute. This will cause an AttributeError at runtime. If FSDP resharding logic is needed for Torchtitan, fsdp_size or an equivalent parameter should be added to TorchtitanEngineConfig.
if hasattr(self.engine.engine_config, "fsdp_size") and self.engine.engine_config.fsdp_size > 1:| """Zero gradients.""" | ||
| dist_utils.clip_grad_norm_( | ||
| [p for m in self.module for p in m.parameters()], | ||
| self.job_config.training.max_norm, |
There was a problem hiding this comment.
The self.job_config attribute is accessed within optimizer_zero_grad but was not assigned to self in the __init__ method. The JobConfig instance was created as a local variable config. This will result in an AttributeError. Please assign the JobConfig instance to self.job_config in __init__.
[p for m in self.module for p in m.parameters()],
self.job_config.training.max_norm,
verl/workers/config/model.py
Outdated
| class TorchtitanModelConfig(BaseConfig): | ||
| name: str = "llama3" | ||
| flavor: str = "debugmodel" | ||
| hf_assets_path: str = "./tests/assets/tokenizer" |
There was a problem hiding this comment.
The default value for hf_assets_path is set to "./tests/assets/tokenizer". Using a path specifically located within the tests/assets directory as a default for a production configuration class is highly problematic. This path is intended for testing and will likely cause issues or failures in non-test environments where these assets might not exist or be appropriate. Please provide a more suitable default path or make this field mandatory if there's no universal default.
| hf_assets_path: str = "./tests/assets/tokenizer" | |
| hf_assets_path: str = "" |
verl/workers/config/engine.py
Outdated
| mixed_precision (bool): Mixed precision configuration for FSDP, default None | ||
| data_parallel_size (int): FSDP group size, default 1 |
There was a problem hiding this comment.
The docstring for mixed_precision states "default None", but the field is defined as mixed_precision: bool = False. This creates a discrepancy between the documentation and the actual implementation. Please update the docstring to reflect the boolean type and its default value.
| mixed_precision (bool): Mixed precision configuration for FSDP, default None | |
| data_parallel_size (int): FSDP group size, default 1 | |
| mixed_precision (bool): Mixed precision configuration for FSDP, default False |
verl/workers/config/optimizer.py
Outdated
|
|
||
| @dataclass | ||
| class TorchtitanOptimizerConfig(OptimizerConfig): | ||
| """VeOmni optimizer configuration extending base OptimizerConfig. |
There was a problem hiding this comment.
The docstring for TorchtitanOptimizerConfig incorrectly states "VeOmni optimizer configuration". It should be "Torchtitan optimizer configuration" to match the class name.
| """VeOmni optimizer configuration extending base OptimizerConfig. | |
| """Torchtitan optimizer configuration extending base OptimizerConfig. |
verl/workers/config/engine.py
Outdated
|
|
||
| Args: | ||
| wrap_policy (Dict[str, Any]): Configuration for FSDP wrap policy. | ||
| reshard_after_forward (str): The policy for applying `reshard_after_forward` within an FSDP setup, default "default" |
There was a problem hiding this comment.
The docstring for reshard_after_forward specifies (str) as its type, but the field is defined as Literal["default", "always", "never"]. The docstring should accurately reflect the Literal type for clarity.
| reshard_after_forward (str): The policy for applying `reshard_after_forward` within an FSDP setup, default "default" | |
| reshard_after_forward (Literal["default", "always", "never"]): The policy for applying `reshard_after_forward` within an FSDP setup, default "default" |
verl/workers/config/engine.py
Outdated
| wrap_policy (Dict[str, Any]): Configuration for FSDP wrap policy. | ||
| reshard_after_forward (str): The policy for applying `reshard_after_forward` within an FSDP setup, default "default" | ||
| forward_prefetch (bool): Whether to prefetch parameters for next forward pass, default False | ||
| use_orig_params (bool): Whether to use original parameters when initialize FSDP1, default False |
There was a problem hiding this comment.
The docstring for use_orig_params mentions "FSDP1". Given that Torchtitan is described as using "FSDP2 + TP + PP", this reference might be outdated or misleading. Please clarify if this parameter is still relevant for Torchtitan's FSDP2 implementation or update the description accordingly.
| use_orig_params (bool): Whether to use original parameters when initialize FSDP1, default False | |
| use_orig_params (bool): Whether to use original parameters when initializing FSDP, default False |
| data_parallel_replicate_size (int): Data parallel replicate size, default 1 | ||
| data_parallel_shard_size (int): Data parallel shard degree, default 1 | ||
| tensor_parallel_size (int): Tensor parallel size, default 1 | ||
| expert_parallel_size (int): Expert parallel size, default 1 | ||
| pipeline_parallel_size (int): Pipeline parallel size, default 1 | ||
| context_parallel_size (int): Ring-attn context parallel size, default 1 | ||
| strategy (str): Strategy to use for distributed training, default "torchtitan" |
There was a problem hiding this comment.
The descriptions for data_parallel_size, data_parallel_replicate_size, data_parallel_shard_size, tensor_parallel_size, expert_parallel_size, pipeline_parallel_size, and context_parallel_size in the docstring refer to "FSDP group size" or similar FSDP-specific terms. These are general parallelism parameters, and the descriptions should be more generic to avoid confusion, or the FSDP reference should be removed if it's not directly related to FSDP for Torchtitan.
| data_parallel_replicate_size (int): Data parallel replicate size, default 1 | |
| data_parallel_shard_size (int): Data parallel shard degree, default 1 | |
| tensor_parallel_size (int): Tensor parallel size, default 1 | |
| expert_parallel_size (int): Expert parallel size, default 1 | |
| pipeline_parallel_size (int): Pipeline parallel size, default 1 | |
| context_parallel_size (int): Ring-attn context parallel size, default 1 | |
| strategy (str): Strategy to use for distributed training, default "torchtitan" | |
| data_parallel_size (int): Data parallel group size, default 1 | |
| data_parallel_replicate_size (int): Data parallel replicate size, default 1 | |
| data_parallel_shard_size (int): Data parallel shard degree, default 1 | |
| tensor_parallel_size (int): Tensor parallel size, default 1 | |
| expert_parallel_size (int): Expert parallel size, default 1 | |
| pipeline_parallel_size (int): Pipeline parallel size, default 1 | |
| context_parallel_size (int): Context parallel size, default 1 |
a2504b4 to
1b42553
Compare
40da68e to
7960604
Compare
verl/utils/torch_functional.py
Outdated
| @@ -743,6 +573,8 @@ def get_cosine_schedule_with_warmup( | |||
| assert init_lr_ratio >= 0 and init_lr_ratio <= 1.0 | |||
|
|
|||
| def lr_lambda(current_step): | |||
| # # 0-indexed step, hence + 1 adjustments | |||
| current_step += 1 | |||
There was a problem hiding this comment.
see lr discussions here pytorch/torchtitan#2333 (comment)
There was a problem hiding this comment.
Can we still aligned with megatron if we made this changes?
There was a problem hiding this comment.
If not, then shall we make this configurable?
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request introduces Torchtitan as a new training engine, which is a significant feature addition. The implementation includes the necessary configuration files, the core engine logic, and updates to the end-to-end test script. While the overall structure is good, I've identified a critical bug in the test script that will prevent it from running, a high-risk change to a shared utility function that could cause unintended side effects, and a maintainability concern regarding the use of monkey-patching in the engine's implementation. Please address these points to ensure the stability and maintainability of the codebase.
| engine=${backend} \ | ||
| model=torchtitan_model \ | ||
| model.attn_type=varlen \ | ||
| model.hf_assets_path=${MODEL_PATH} |
There was a problem hiding this comment.
There's a missing backslash \ at the end of this line. In a multi-line shell variable assignment, each line except the last one must end with a \. Without it, the shell will concatenate this line with the next one (optim=${backend} \), leading to an invalid command argument and causing the script to fail.
| model.hf_assets_path=${MODEL_PATH} | |
| model.hf_assets_path=${MODEL_PATH} \ |
verl/utils/torch_functional.py
Outdated
| # # 0-indexed step, hence + 1 adjustments | ||
| current_step += 1 |
There was a problem hiding this comment.
Modifying the current_step within the shared utility function get_cosine_schedule_with_warmup introduces a significant risk of unintended side effects. This change effectively converts the step counting from 0-indexed to 1-indexed for all callers of this function, which could break the learning rate scheduling for other engines (e.g., FSDP) that rely on the original behavior. A safer approach would be to handle the step indexing within the specific calling code that requires 1-based indexing, or to create a new, separate scheduler utility (e.g., get_cosine_schedule_with_warmup_1_indexed) to avoid impacting existing functionality.
| import torchtitan.protocols.train_spec as train_spec_module | ||
|
|
||
| original_get_train_spec = train_spec_module.get_train_spec | ||
|
|
||
| def _get_train_spec_without_dataloader(model_name): | ||
| train_spec = original_get_train_spec(model_name) | ||
| train_spec.build_dataloader_fn = None | ||
| return train_spec | ||
|
|
||
| train_spec_module.get_train_spec = _get_train_spec_without_dataloader |
There was a problem hiding this comment.
This global monkey-patch of torchtitan.protocols.train_spec.get_train_spec is risky and can lead to maintenance issues. It makes the code dependent on the internal implementation details of torchtitan. If torchtitan's train_spec module is refactored in a future version, this will break in a non-obvious way. It would be safer to investigate if torchtitan.train.Trainer can be initialized without a dataloader through its public API, or if a modified train_spec object can be constructed and passed without patching the module globally. If patching is unavoidable, it should be scoped as narrowly as possible and clearly documented as a fragile dependency.
| engine=${backend} \ | ||
| model=torchtitan_model \ | ||
| model.attn_type=varlen \ | ||
| model.hf_assets_path=${MODEL_PATH} |
| # 3. Inline comments (after a field on the same line) are not allowed. | ||
| # 4. Indentation level is respected for nested fields. | ||
|
|
||
| _target_: verl.workers.config.TorchtitanModelConfig |
There was a problem hiding this comment.
Why not reuse verl/trainer/config/model/hf_model.yaml?
| """Move model and/or optimizer to CPU or GPU.""" | ||
| super().to(device=device, model=model, optimizer=optimizer, grad=grad) | ||
|
|
||
| if self.engine_config.forward_only: |
There was a problem hiding this comment.
Does torchtitan share same load/offload mechanism with fsdp backend? In fsdp backend, we use CPUOffloadPolicy for forward_only reference model, I don't see any cpu offload policy here.
There was a problem hiding this comment.
nice catch, in Titan it's passed through here https://github.com/pytorch/torchtitan/blob/27930cb612df4d6ebafdb21909749a9694bc167e/torchtitan/config/job_config.py#L270. Will update this in next commit.
| @@ -97,6 +122,10 @@ elif [ "$backend" = "veomni" ]; then | |||
| ENGINE_CONFIG="$VEOMNI_ENGINE_CONFIG" | |||
| echo "Using veomni engine" | |||
| exp_name=gsm8k-${backend}-sp${SP_SIZE}-fsdp${FSDP_SIZE}-pad-${PAD_MODE}-use_remove_padding-${USE_REMOVE_PADDING}-mode-${mode} | |||
| elif [ "$backend" = "torchtitan" ]; then | |||
There was a problem hiding this comment.
Please verify different parallelism in tests/special_e2e/sft/test_sft_engine_all.sh
There was a problem hiding this comment.
sounds good. I will incorporate TP/SP with this PR, for other parallelism, will have separate PRs.
| if hasattr(model_args, "attn_mask_type"): | ||
| model_args.attn_mask_type = self.model_config.attn_mask_type | ||
|
|
||
| model = Model( |
There was a problem hiding this comment.
Is there a model support list in torchtitan? or does it support any huggingface model?
There was a problem hiding this comment.
this is the model list we supported https://github.com/pytorch/torchtitan/tree/main/torchtitan/models. Right now it doesn't support HF models, but should be relatively easy if user wants to add one.
We are also actively working on adding VLMs. cc @shuhuayu @tianyu-l
There was a problem hiding this comment.
We have a model backend aiming to support transformers models out-of-box.
https://github.com/pytorch/torchtitan/tree/main/torchtitan/experiments/transformers_modeling_backend
| data_parallel_size: 1 | ||
|
|
||
| # Data parallel replicate size | ||
| data_parallel_replicate_size: 1 |
There was a problem hiding this comment.
Is there any document explain these parallelism?
There was a problem hiding this comment.
|
@acisseJZhong Since there're quite some works to do, please open an issue to track torchtitan integration roadmap. #4880 |
|
created roadmap here #5306, @wuxibin89 please feel free to add any items I might be missing. |
fix format fix format training runs training runs test test
|
|
||
| # test with torchtitan fsdp=1 | ||
| echo "run with tp1 pp1 cp1 fsdp2 num_gpus2" | ||
| BACKEND=torchtitan TP_SIZE=1 PP_SIZE=1 CP_SIZE=1 FSDP_SIZE=2 NUM_GPUS=2 bash tests/special_e2e/sft/run_sft_engine.sh |
There was a problem hiding this comment.
Well...it breaks NPU and VLM ci, I think we better temporary disable it in ci once we're ready. Checkout ci workflows:
.github/workflows/e2e_sft_llm_ascend.yml
.github/workflows/e2e_sft_llm.yml
.github/workflows/e2e_sft_vlm.yml
There was a problem hiding this comment.
ah it's because I didnt pip install torchtitan in the CI flow. let me install and see if it passes.
There was a problem hiding this comment.
seems CI needs to manual approval after I added pip install torchtitan. I will disable torchtitan run in CI for now, we can add back once we have everything including RL ready.
|
sanity check failed: python3 tests/special_sanity/check_device_api_usage.py --directory ./verl |
| @@ -0,0 +1,25 @@ | |||
| # Format checks enforced on CI: | |||
There was a problem hiding this comment.
Please use verl/trainer/config/model/hf_model.yaml. All the model should start from huggingface
There was a problem hiding this comment.
I think we can add backend specific fields in hf_model.yaml, e.g:
torchtitan:
name: qwen3
flavor: "0.6B"There was a problem hiding this comment.
This is not desired. We should strictly start from huggingface naming and checkpoint because this is where people create their models
There was a problem hiding this comment.
the name and flavor fields are required by Titan to get corresponding train spec and model args. It does 1-1 mapping to hugging face models. See more here https://github.com/pytorch/torchtitan/blob/fde830de29c34c55b4cdc0209ac51f5b8084244e/torchtitan/models/llama3/__init__.py#L51
@vermouth1992 @wuxibin89 let me know if you have better ideas, but for now I think what the user need to do is just explicitly pass in name and flavor. It still aligns with the HF naming and ckpt.
verl/utils/torch_functional.py
Outdated
| @@ -743,6 +573,8 @@ def get_cosine_schedule_with_warmup( | |||
| assert init_lr_ratio >= 0 and init_lr_ratio <= 1.0 | |||
|
|
|||
| def lr_lambda(current_step): | |||
| # # 0-indexed step, hence + 1 adjustments | |||
| current_step += 1 | |||
There was a problem hiding this comment.
Can we still aligned with megatron if we made this changes?
verl/utils/torch_functional.py
Outdated
| @@ -743,6 +573,8 @@ def get_cosine_schedule_with_warmup( | |||
| assert init_lr_ratio >= 0 and init_lr_ratio <= 1.0 | |||
|
|
|||
| def lr_lambda(current_step): | |||
| # # 0-indexed step, hence + 1 adjustments | |||
| current_step += 1 | |||
There was a problem hiding this comment.
If not, then shall we make this configurable?
**Goal:** This PR makes the changes so that we can integrate Torchtitan as a trainer to Verl: verl-project/verl#5051 **Major changes:** 1. ~~Change LR schedule to be 0 indexed instead of 1 indexed; to align with Verl's [fsdp util ](https://github.com/verl-project/verl/blob/d987199906f09ba53139df13e4528b2d575ec4ce/verl/utils/torch_functional.py#L745) See more analysis in https://docs.google.com/document/d/1YiFUvIa_JqTYpBd2Xj7ReH3Bw6wS07nKldycBX--uVE/edit?usp=sharing~~ ==> We decide not to change Titan's LR Scheduler behavior. <img width="993" height="571" alt="image" src="https://github.com/user-attachments/assets/e4012dbd-5624-45ff-b82b-a6225b91e1c0" /> 2. ~~add `position_block_causal` attn mask type, which creates block causal mask based on `position_id` for both varlen and flex attention: [transformers reference](https://github.com/huggingface/transformers/blob/0c89522f2af2f85cf997193645a1e727d6b8c1d7/src/transformers/masking_utils.py#L708)~~ ==> this is added in Verl's Torchtitan Engine code instead **Todos:** 1. Enable PP, right now [`pp_schedule.eval()` ](https://github.com/pytorch/pytorch/blob/03406903616077227734f772d682fc6027513ecf/torch/distributed/pipelining/schedules.py#L402)does the microbatch split for us, as it takes in the whole batch. However, in verl we split batch into microbatches before pp, and we'd love to pass in a list of pre-split microbatches to pp schedule. (thanks for @H-Huang's help)
|
two failing CI test seems irrelevant to this PR, will add titan engine CI after RL trainer is enabled. |
|
|
||
| # Torchtitan backend configuration | ||
| # Only used when engine backend is set to "torchtitan" | ||
| torchtitan: |
There was a problem hiding this comment.
This is still not desirable. All the models including names and flavors must start from a single huggingface folder. We can introduce a general model_implementation dict so that users can write attn_type and attn_mask_type inside this sub-config
There was a problem hiding this comment.
added a helper function to derive model name and flavor from hf config, and get rid of attn_mask_type since it's not used. For attn_type, I moved it to TorchtitanEngineConfig since it's more torchtitan specific field(I don't want other training engine to have this field). Please let me know if you have different opinions @vermouth1992
…roject#5051) ### What does this PR do? Integrate Torchtitan as a new training engine in Verl. This PR implements the basic APIs needed by Torchtitan Engine, and tested SFT trainer in verl (qwen3 0.6b): - Torchtitan Engine matches exactly with FSDP engine for SFT trainer - `use_remove_padding=True` matches `use_remove_padding=False` - TP/SP and FSDP work with both varlen and flex attention; numerics match with single process. **Relevant PRs:** - Torchtitan side changes in pytorch/torchtitan#2333. - RFC for engine interfaces verl-project#1371 - Training engine interface design verl-project#1977 - Add Veomini Engine verl-project#4072 **Todos:** See roadmap here: verl-project#5306 - [ ] enable parallelism: enable PP, EP, CP - [ ] make Torchtitan Engine work with RL trainer - [ ] test multimodal input(ref: https://github.com/verl-project/verl/pull/4492/changes) ### Checklist Before Starting - [ ] Search for similar PRs. Paste at least one query link here: ... - [ ] Format the PR title as `[{modules}] {type}: {description}` (This will be checked by the CI) - `{modules}` include `fsdp`, `megatron`, `veomni`, `sglang`, `vllm`, `rollout`, `trainer`, `ci`, `training_utils`, `recipe`, `hardware`, `deployment`, `ray`, `worker`, `single_controller`, `misc`, `perf`, `model`, `algo`, `env`, `tool`, `ckpt`, `doc`, `data`, `cfg`, `reward` - If this PR involves multiple modules, separate them with `,` like `[megatron, fsdp, doc]` - `{type}` is in `feat`, `fix`, `refactor`, `chore`, `test` - If this PR breaks any API (CLI arguments, config, function signature, etc.), add `[BREAKING]` to the beginning of the title. - Example: `[BREAKING][fsdp, megatron] feat: dynamic batching` ### Test ``` MODEL_ID=Qwen/Qwen3-0.6B BACKEND=torchtitan bash tests/special_e2e/sft/run_sft_engine.sh MODEL_ID=Qwen/Qwen3-0.6B BACKEND=fsdp bash tests/special_e2e/sft/run_sft_engine.sh ``` `use_remove_padding=True` <img width="1372" height="658" alt="image" src="https://github.com/user-attachments/assets/42c01ce6-f561-4c81-a562-e412a51ac296" /> `use_remove_padding=False` <img width="1353" height="610" alt="image" src="https://github.com/user-attachments/assets/c8db130e-626c-4d38-8932-9b3218431da3" /> Test TP and FSDP <img width="1324" height="683" alt="image" src="https://github.com/user-attachments/assets/12f8414d-041c-41fb-b915-012ed75c4adb" /> > For changes that can not be tested by CI (e.g., algorithm implementation, new model support), validate by experiment(s) and show results like training curve plots, evaluation results, etc. ### API and Usage Example > Demonstrate how the API changes if any, and provide usage example(s) if possible. ```python # Add code snippet or script demonstrating how to use this ``` ### Design & Code Changes > Demonstrate the high-level design if this PR is complex, and list the specific changes. ### Checklist Before Submitting > [!IMPORTANT] > Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review. - [X] Read the [Contribute Guide](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md). - [X] Apply [pre-commit checks](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md#code-linting-and-formatting): `pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=always` - [ ] Add / Update [the documentation](https://github.com/volcengine/verl/tree/main/docs). - [ ] Add unit or end-to-end test(s) to [the CI workflow](https://github.com/volcengine/verl/tree/main/.github/workflows) to cover all the code. If not feasible, explain why: ... - [ ] Once your PR is ready for CI, send a message in [the `ci-request` channel](https://verl-project.slack.com/archives/C091TCESWB1) in [the `verl` Slack workspace](https://join.slack.com/t/verl-project/shared_invite/zt-3855yhg8g-CTkqXu~hKojPCmo7k_yXTQ). (If not accessible, please try [the Feishu group (飞书群)](https://applink.larkoffice.com/client/chat/chatter/add_by_link?link_token=772jd4f1-cd91-441e-a820-498c6614126a).) - [ ] If your PR is related to the `recipe` submodule, please also update the reference to the submodule commit via `git submodule update --remote` or `cd recipe && git pull origin main`.
…roject#5051) ### What does this PR do? Integrate Torchtitan as a new training engine in Verl. This PR implements the basic APIs needed by Torchtitan Engine, and tested SFT trainer in verl (qwen3 0.6b): - Torchtitan Engine matches exactly with FSDP engine for SFT trainer - `use_remove_padding=True` matches `use_remove_padding=False` - TP/SP and FSDP work with both varlen and flex attention; numerics match with single process. **Relevant PRs:** - Torchtitan side changes in pytorch/torchtitan#2333. - RFC for engine interfaces verl-project#1371 - Training engine interface design verl-project#1977 - Add Veomini Engine verl-project#4072 **Todos:** See roadmap here: verl-project#5306 - [ ] enable parallelism: enable PP, EP, CP - [ ] make Torchtitan Engine work with RL trainer - [ ] test multimodal input(ref: https://github.com/verl-project/verl/pull/4492/changes) ### Checklist Before Starting - [ ] Search for similar PRs. Paste at least one query link here: ... - [ ] Format the PR title as `[{modules}] {type}: {description}` (This will be checked by the CI) - `{modules}` include `fsdp`, `megatron`, `veomni`, `sglang`, `vllm`, `rollout`, `trainer`, `ci`, `training_utils`, `recipe`, `hardware`, `deployment`, `ray`, `worker`, `single_controller`, `misc`, `perf`, `model`, `algo`, `env`, `tool`, `ckpt`, `doc`, `data`, `cfg`, `reward` - If this PR involves multiple modules, separate them with `,` like `[megatron, fsdp, doc]` - `{type}` is in `feat`, `fix`, `refactor`, `chore`, `test` - If this PR breaks any API (CLI arguments, config, function signature, etc.), add `[BREAKING]` to the beginning of the title. - Example: `[BREAKING][fsdp, megatron] feat: dynamic batching` ### Test ``` MODEL_ID=Qwen/Qwen3-0.6B BACKEND=torchtitan bash tests/special_e2e/sft/run_sft_engine.sh MODEL_ID=Qwen/Qwen3-0.6B BACKEND=fsdp bash tests/special_e2e/sft/run_sft_engine.sh ``` `use_remove_padding=True` <img width="1372" height="658" alt="image" src="https://github.com/user-attachments/assets/42c01ce6-f561-4c81-a562-e412a51ac296" /> `use_remove_padding=False` <img width="1353" height="610" alt="image" src="https://github.com/user-attachments/assets/c8db130e-626c-4d38-8932-9b3218431da3" /> Test TP and FSDP <img width="1324" height="683" alt="image" src="https://github.com/user-attachments/assets/12f8414d-041c-41fb-b915-012ed75c4adb" /> > For changes that can not be tested by CI (e.g., algorithm implementation, new model support), validate by experiment(s) and show results like training curve plots, evaluation results, etc. ### API and Usage Example > Demonstrate how the API changes if any, and provide usage example(s) if possible. ```python # Add code snippet or script demonstrating how to use this ``` ### Design & Code Changes > Demonstrate the high-level design if this PR is complex, and list the specific changes. ### Checklist Before Submitting > [!IMPORTANT] > Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review. - [X] Read the [Contribute Guide](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md). - [X] Apply [pre-commit checks](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md#code-linting-and-formatting): `pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=always` - [ ] Add / Update [the documentation](https://github.com/volcengine/verl/tree/main/docs). - [ ] Add unit or end-to-end test(s) to [the CI workflow](https://github.com/volcengine/verl/tree/main/.github/workflows) to cover all the code. If not feasible, explain why: ... - [ ] Once your PR is ready for CI, send a message in [the `ci-request` channel](https://verl-project.slack.com/archives/C091TCESWB1) in [the `verl` Slack workspace](https://join.slack.com/t/verl-project/shared_invite/zt-3855yhg8g-CTkqXu~hKojPCmo7k_yXTQ). (If not accessible, please try [the Feishu group (飞书群)](https://applink.larkoffice.com/client/chat/chatter/add_by_link?link_token=772jd4f1-cd91-441e-a820-498c6614126a).) - [ ] If your PR is related to the `recipe` submodule, please also update the reference to the submodule commit via `git submodule update --remote` or `cd recipe && git pull origin main`.
After verl-project#5051 added a `dp_group is not None` guard in rearrange_micro_batches, the FSDP actor/critic calls to prepare_dynamic_batch (which do not pass dp_group) silently skipped the num_micro_batches all_reduce across data-parallel ranks. Under dynamic batching with uneven sequence lengths across DP ranks, this causes different ranks to compute different numbers of micro-batches. Since FSDP performs reduce-scatter on every backward() call, mismatched micro-batch counts lead to a deadlock where one rank waits for the other to participate in a collective that never comes. This is the same root cause as verl-project#5451 which fixed the megatron backend. This PR applies the equivalent fix to the FSDP backend. Fix: Pass the data-parallel process group to prepare_dynamic_batch in both dp_actor.py and dp_critic.py to restore proper DP synchronization of micro-batch counts.
…roject#5051) ### What does this PR do? Integrate Torchtitan as a new training engine in Verl. This PR implements the basic APIs needed by Torchtitan Engine, and tested SFT trainer in verl (qwen3 0.6b): - Torchtitan Engine matches exactly with FSDP engine for SFT trainer - `use_remove_padding=True` matches `use_remove_padding=False` - TP/SP and FSDP work with both varlen and flex attention; numerics match with single process. **Relevant PRs:** - Torchtitan side changes in pytorch/torchtitan#2333. - RFC for engine interfaces verl-project#1371 - Training engine interface design verl-project#1977 - Add Veomini Engine verl-project#4072 **Todos:** See roadmap here: verl-project#5306 - [ ] enable parallelism: enable PP, EP, CP - [ ] make Torchtitan Engine work with RL trainer - [ ] test multimodal input(ref: https://github.com/verl-project/verl/pull/4492/changes) ### Checklist Before Starting - [ ] Search for similar PRs. Paste at least one query link here: ... - [ ] Format the PR title as `[{modules}] {type}: {description}` (This will be checked by the CI) - `{modules}` include `fsdp`, `megatron`, `veomni`, `sglang`, `vllm`, `rollout`, `trainer`, `ci`, `training_utils`, `recipe`, `hardware`, `deployment`, `ray`, `worker`, `single_controller`, `misc`, `perf`, `model`, `algo`, `env`, `tool`, `ckpt`, `doc`, `data`, `cfg`, `reward` - If this PR involves multiple modules, separate them with `,` like `[megatron, fsdp, doc]` - `{type}` is in `feat`, `fix`, `refactor`, `chore`, `test` - If this PR breaks any API (CLI arguments, config, function signature, etc.), add `[BREAKING]` to the beginning of the title. - Example: `[BREAKING][fsdp, megatron] feat: dynamic batching` ### Test ``` MODEL_ID=Qwen/Qwen3-0.6B BACKEND=torchtitan bash tests/special_e2e/sft/run_sft_engine.sh MODEL_ID=Qwen/Qwen3-0.6B BACKEND=fsdp bash tests/special_e2e/sft/run_sft_engine.sh ``` `use_remove_padding=True` <img width="1372" height="658" alt="image" src="https://github.com/user-attachments/assets/42c01ce6-f561-4c81-a562-e412a51ac296" /> `use_remove_padding=False` <img width="1353" height="610" alt="image" src="https://github.com/user-attachments/assets/c8db130e-626c-4d38-8932-9b3218431da3" /> Test TP and FSDP <img width="1324" height="683" alt="image" src="https://github.com/user-attachments/assets/12f8414d-041c-41fb-b915-012ed75c4adb" /> > For changes that can not be tested by CI (e.g., algorithm implementation, new model support), validate by experiment(s) and show results like training curve plots, evaluation results, etc. ### API and Usage Example > Demonstrate how the API changes if any, and provide usage example(s) if possible. ```python # Add code snippet or script demonstrating how to use this ``` ### Design & Code Changes > Demonstrate the high-level design if this PR is complex, and list the specific changes. ### Checklist Before Submitting > [!IMPORTANT] > Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review. - [X] Read the [Contribute Guide](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md). - [X] Apply [pre-commit checks](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md#code-linting-and-formatting): `pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=always` - [ ] Add / Update [the documentation](https://github.com/volcengine/verl/tree/main/docs). - [ ] Add unit or end-to-end test(s) to [the CI workflow](https://github.com/volcengine/verl/tree/main/.github/workflows) to cover all the code. If not feasible, explain why: ... - [ ] Once your PR is ready for CI, send a message in [the `ci-request` channel](https://verl-project.slack.com/archives/C091TCESWB1) in [the `verl` Slack workspace](https://join.slack.com/t/verl-project/shared_invite/zt-3855yhg8g-CTkqXu~hKojPCmo7k_yXTQ). (If not accessible, please try [the Feishu group (飞书群)](https://applink.larkoffice.com/client/chat/chatter/add_by_link?link_token=772jd4f1-cd91-441e-a820-498c6614126a).) - [ ] If your PR is related to the `recipe` submodule, please also update the reference to the submodule commit via `git submodule update --remote` or `cd recipe && git pull origin main`.
…roject#5051) ### What does this PR do? Integrate Torchtitan as a new training engine in Verl. This PR implements the basic APIs needed by Torchtitan Engine, and tested SFT trainer in verl (qwen3 0.6b): - Torchtitan Engine matches exactly with FSDP engine for SFT trainer - `use_remove_padding=True` matches `use_remove_padding=False` - TP/SP and FSDP work with both varlen and flex attention; numerics match with single process. **Relevant PRs:** - Torchtitan side changes in pytorch/torchtitan#2333. - RFC for engine interfaces verl-project#1371 - Training engine interface design verl-project#1977 - Add Veomini Engine verl-project#4072 **Todos:** See roadmap here: verl-project#5306 - [ ] enable parallelism: enable PP, EP, CP - [ ] make Torchtitan Engine work with RL trainer - [ ] test multimodal input(ref: https://github.com/verl-project/verl/pull/4492/changes) ### Checklist Before Starting - [ ] Search for similar PRs. Paste at least one query link here: ... - [ ] Format the PR title as `[{modules}] {type}: {description}` (This will be checked by the CI) - `{modules}` include `fsdp`, `megatron`, `veomni`, `sglang`, `vllm`, `rollout`, `trainer`, `ci`, `training_utils`, `recipe`, `hardware`, `deployment`, `ray`, `worker`, `single_controller`, `misc`, `perf`, `model`, `algo`, `env`, `tool`, `ckpt`, `doc`, `data`, `cfg`, `reward` - If this PR involves multiple modules, separate them with `,` like `[megatron, fsdp, doc]` - `{type}` is in `feat`, `fix`, `refactor`, `chore`, `test` - If this PR breaks any API (CLI arguments, config, function signature, etc.), add `[BREAKING]` to the beginning of the title. - Example: `[BREAKING][fsdp, megatron] feat: dynamic batching` ### Test ``` MODEL_ID=Qwen/Qwen3-0.6B BACKEND=torchtitan bash tests/special_e2e/sft/run_sft_engine.sh MODEL_ID=Qwen/Qwen3-0.6B BACKEND=fsdp bash tests/special_e2e/sft/run_sft_engine.sh ``` `use_remove_padding=True` <img width="1372" height="658" alt="image" src="https://github.com/user-attachments/assets/42c01ce6-f561-4c81-a562-e412a51ac296" /> `use_remove_padding=False` <img width="1353" height="610" alt="image" src="https://github.com/user-attachments/assets/c8db130e-626c-4d38-8932-9b3218431da3" /> Test TP and FSDP <img width="1324" height="683" alt="image" src="https://github.com/user-attachments/assets/12f8414d-041c-41fb-b915-012ed75c4adb" /> > For changes that can not be tested by CI (e.g., algorithm implementation, new model support), validate by experiment(s) and show results like training curve plots, evaluation results, etc. ### API and Usage Example > Demonstrate how the API changes if any, and provide usage example(s) if possible. ```python # Add code snippet or script demonstrating how to use this ``` ### Design & Code Changes > Demonstrate the high-level design if this PR is complex, and list the specific changes. ### Checklist Before Submitting > [!IMPORTANT] > Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review. - [X] Read the [Contribute Guide](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md). - [X] Apply [pre-commit checks](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md#code-linting-and-formatting): `pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=always` - [ ] Add / Update [the documentation](https://github.com/volcengine/verl/tree/main/docs). - [ ] Add unit or end-to-end test(s) to [the CI workflow](https://github.com/volcengine/verl/tree/main/.github/workflows) to cover all the code. If not feasible, explain why: ... - [ ] Once your PR is ready for CI, send a message in [the `ci-request` channel](https://verl-project.slack.com/archives/C091TCESWB1) in [the `verl` Slack workspace](https://join.slack.com/t/verl-project/shared_invite/zt-3855yhg8g-CTkqXu~hKojPCmo7k_yXTQ). (If not accessible, please try [the Feishu group (飞书群)](https://applink.larkoffice.com/client/chat/chatter/add_by_link?link_token=772jd4f1-cd91-441e-a820-498c6614126a).) - [ ] If your PR is related to the `recipe` submodule, please also update the reference to the submodule commit via `git submodule update --remote` or `cd recipe && git pull origin main`.
What does this PR do?
Integrate Torchtitan as a new training engine in Verl. This PR implements the basic APIs needed by Torchtitan Engine, and tested SFT trainer in verl (qwen3 0.6b):
use_remove_padding=Truematchesuse_remove_padding=FalseRelevant PRs:
Todos:
See roadmap here: #5306
Checklist Before Starting
[{modules}] {type}: {description}(This will be checked by the CI){modules}includefsdp,megatron,veomni,sglang,vllm,rollout,trainer,ci,training_utils,recipe,hardware,deployment,ray,worker,single_controller,misc,perf,model,algo,env,tool,ckpt,doc,data,cfg,reward,like[megatron, fsdp, doc]{type}is infeat,fix,refactor,chore,test[BREAKING]to the beginning of the title.[BREAKING][fsdp, megatron] feat: dynamic batchingTest
use_remove_padding=Trueuse_remove_padding=FalseTest TP and FSDP

API and Usage Example
# Add code snippet or script demonstrating how to use thisDesign & Code Changes
Checklist Before Submitting
Important
Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review.
pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=alwaysci-requestchannel in theverlSlack workspace. (If not accessible, please try the Feishu group (飞书群).)recipesubmodule, please also update the reference to the submodule commit viagit submodule update --remoteorcd recipe && git pull origin main.