[trainer] feat: Add Torchtitan as alternative training engine by acisseJZhong · Pull Request #5051 · verl-project/verl

acisseJZhong · 2026-01-27T01:47:21Z

What does this PR do?

Integrate Torchtitan as a new training engine in Verl. This PR implements the basic APIs needed by Torchtitan Engine, and tested SFT trainer in verl (qwen3 0.6b):

Torchtitan Engine matches exactly with FSDP engine for SFT trainer
use_remove_padding=True matches use_remove_padding=False
TP/SP and FSDP work with both varlen and flex attention; numerics match with single process.

Relevant PRs:

Torchtitan side changes in Torchtitan changes to integrate into Verl pytorch/torchtitan#2333.
RFC for engine interfaces [RFC] engine interface for training backends (FSDP, FSDP2, torchtitan, Megatron, Mindspore, PAI-Megatron, etc) #1371
Training engine interface design [trainer] refactor: Training Engine Interface and Development Plan #1977
Add Veomini Engine [trainer] feat: Implemented VeomniEngine as a alternative training backend #4072

Todos:
See roadmap here: #5306

enable parallelism: enable PP, EP, CP
make Torchtitan Engine work with RL trainer
test multimodal input(ref: https://github.com/verl-project/verl/pull/4492/changes)

Checklist Before Starting

Search for similar PRs. Paste at least one query link here: ...
Format the PR title as [{modules}] {type}: {description} (This will be checked by the CI)
- {modules} include fsdp, megatron, veomni, sglang, vllm, rollout, trainer, ci, training_utils, recipe, hardware, deployment, ray, worker, single_controller, misc, perf, model, algo, env, tool, ckpt, doc, data, cfg, reward
- If this PR involves multiple modules, separate them with , like [megatron, fsdp, doc]
- {type} is in feat, fix, refactor, chore, test
- If this PR breaks any API (CLI arguments, config, function signature, etc.), add [BREAKING] to the beginning of the title.
- Example: [BREAKING][fsdp, megatron] feat: dynamic batching

Test

MODEL_ID=Qwen/Qwen3-0.6B BACKEND=torchtitan bash tests/special_e2e/sft/run_sft_engine.sh
MODEL_ID=Qwen/Qwen3-0.6B BACKEND=fsdp bash tests/special_e2e/sft/run_sft_engine.sh

use_remove_padding=True

use_remove_padding=False

Test TP and FSDP

For changes that can not be tested by CI (e.g., algorithm implementation, new model support), validate by experiment(s) and show results like training curve plots, evaluation results, etc.

API and Usage Example

Demonstrate how the API changes if any, and provide usage example(s) if possible.

# Add code snippet or script demonstrating how to use this

Design & Code Changes

Demonstrate the high-level design if this PR is complex, and list the specific changes.

Checklist Before Submitting

Important

Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review.

Read the Contribute Guide.
Apply pre-commit checks: pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=always
Add / Update the documentation.
Add unit or end-to-end test(s) to the CI workflow to cover all the code. If not feasible, explain why: ...
Once your PR is ready for CI, send a message in the ci-request channel in the verl Slack workspace. (If not accessible, please try the Feishu group (飞书群).)
If your PR is related to the recipe submodule, please also update the reference to the submodule commit via git submodule update --remote or cd recipe && git pull origin main.

CLAassistant · 2026-01-27T01:47:29Z

Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
_{You have signed the CLA already but the status is still pending? Let us recheck it.}

gemini-code-assist

Code Review

The pull request introduces the Torchtitan engine, model, and optimizer configurations, along with their implementation. The changes involve adding new dataclasses and integrating them into the existing system. While the overall structure is in place, several critical issues related to configuration consistency, attribute definitions, and potential runtime errors have been identified. These issues primarily stem from mismatches between the new Torchtitan configurations and the expectations of the underlying Torchtitan and Verl utility functions. Addressing these will be crucial for the correct functioning of the new engine.

gemini-code-assist · 2026-01-27T01:50:22Z

verl/workers/engine/torchtitan/transformer_impl.py

+            context_parallel_degree=engine_config.context_parallel_size,
+        )


The Parallelism constructor and ParallelDims in _init_device_mesh expect an expert_tensor_parallel_degree (or etp). However, TorchtitanEngineConfig does not define an expert_tensor_parallel_size attribute. This omission will lead to an AttributeError when constructing the parallelism configuration. Please add expert_tensor_parallel_size to TorchtitanEngineConfig with an appropriate default value.

expert_parallel_degree=engine_config.expert_parallel_size, expert_tensor_parallel_degree=engine_config.expert_tensor_parallel_size,

gemini-code-assist · 2026-01-27T01:50:22Z

verl/workers/engine/torchtitan/transformer_impl.py

+
+            input_ids_rmpad_rolled = torch.roll(input_ids_rmpad, shifts=-1, dims=1)
+
+            if self.use_ulysses_sp:


The attribute self.use_ulysses_sp is accessed but not defined in TorchTitanEngineWithLMHead or its base class TorchTitanEngine. This will cause an AttributeError at runtime. This flag is used to control Ulysses sequence parallelism logic. Please ensure it is properly defined and initialized.

if hasattr(self, "use_ulysses_sp") and self.use_ulysses_sp:

gemini-code-assist · 2026-01-27T01:50:22Z

verl/workers/engine/torchtitan/transformer_impl.py

+        assert isinstance(self.engine, TorchTitanEngine)
+
+        # Reshard the root FSDP module
+        if self.engine.engine_config.fsdp_size > 1:


The self.engine.engine_config.fsdp_size attribute is accessed in EngineEvalModeCtx but TorchtitanEngineConfig does not have an fsdp_size attribute. This will cause an AttributeError at runtime. If FSDP resharding logic is needed for Torchtitan, fsdp_size or an equivalent parameter should be added to TorchtitanEngineConfig.

if hasattr(self.engine.engine_config, "fsdp_size") and self.engine.engine_config.fsdp_size > 1:

gemini-code-assist · 2026-01-27T01:50:22Z

verl/workers/engine/torchtitan/transformer_impl.py

+        """Zero gradients."""
+        dist_utils.clip_grad_norm_(
+            [p for m in self.module for p in m.parameters()],
+            self.job_config.training.max_norm,


The self.job_config attribute is accessed within optimizer_zero_grad but was not assigned to self in the __init__ method. The JobConfig instance was created as a local variable config. This will result in an AttributeError. Please assign the JobConfig instance to self.job_config in __init__.

[p for m in self.module for p in m.parameters()], self.job_config.training.max_norm,

gemini-code-assist · 2026-01-27T01:50:22Z

verl/workers/config/model.py

+class TorchtitanModelConfig(BaseConfig):
+    name: str = "llama3"
+    flavor: str = "debugmodel"
+    hf_assets_path: str = "./tests/assets/tokenizer"


The default value for hf_assets_path is set to "./tests/assets/tokenizer". Using a path specifically located within the tests/assets directory as a default for a production configuration class is highly problematic. This path is intended for testing and will likely cause issues or failures in non-test environments where these assets might not exist or be appropriate. Please provide a more suitable default path or make this field mandatory if there's no universal default.

Suggested change

hf_assets_path: str = "./tests/assets/tokenizer"

hf_assets_path: str = ""

gemini-code-assist · 2026-01-27T01:50:23Z

verl/workers/config/engine.py

+        mixed_precision (bool): Mixed precision configuration for FSDP, default None
+        data_parallel_size (int): FSDP group size, default 1


The docstring for mixed_precision states "default None", but the field is defined as mixed_precision: bool = False. This creates a discrepancy between the documentation and the actual implementation. Please update the docstring to reflect the boolean type and its default value.

Suggested change

mixed_precision (bool): Mixed precision configuration for FSDP, default None

data_parallel_size (int): FSDP group size, default 1

mixed_precision (bool): Mixed precision configuration for FSDP, default False

gemini-code-assist · 2026-01-27T01:50:23Z

verl/workers/config/optimizer.py


+@dataclass
+class TorchtitanOptimizerConfig(OptimizerConfig):
+    """VeOmni optimizer configuration extending base OptimizerConfig.


The docstring for TorchtitanOptimizerConfig incorrectly states "VeOmni optimizer configuration". It should be "Torchtitan optimizer configuration" to match the class name.

Suggested change

"""VeOmni optimizer configuration extending base OptimizerConfig.

"""Torchtitan optimizer configuration extending base OptimizerConfig.

gemini-code-assist · 2026-01-27T01:50:23Z

verl/workers/config/engine.py

+
+    Args:
+        wrap_policy (Dict[str, Any]): Configuration for FSDP wrap policy.
+        reshard_after_forward (str): The policy for applying `reshard_after_forward` within an FSDP setup, default "default"


The docstring for reshard_after_forward specifies (str) as its type, but the field is defined as Literal["default", "always", "never"]. The docstring should accurately reflect the Literal type for clarity.

Suggested change

reshard_after_forward (str): The policy for applying `reshard_after_forward` within an FSDP setup, default "default"

reshard_after_forward (Literal["default", "always", "never"]): The policy for applying `reshard_after_forward` within an FSDP setup, default "default"

gemini-code-assist · 2026-01-27T01:50:23Z

verl/workers/config/engine.py

+        wrap_policy (Dict[str, Any]): Configuration for FSDP wrap policy.
+        reshard_after_forward (str): The policy for applying `reshard_after_forward` within an FSDP setup, default "default"
+        forward_prefetch (bool): Whether to prefetch parameters for next forward pass, default False
+        use_orig_params (bool): Whether to use original parameters when initialize FSDP1, default False


The docstring for use_orig_params mentions "FSDP1". Given that Torchtitan is described as using "FSDP2 + TP + PP", this reference might be outdated or misleading. Please clarify if this parameter is still relevant for Torchtitan's FSDP2 implementation or update the description accordingly.

Suggested change

use_orig_params (bool): Whether to use original parameters when initialize FSDP1, default False

use_orig_params (bool): Whether to use original parameters when initializing FSDP, default False

gemini-code-assist · 2026-01-27T01:50:23Z

verl/workers/config/engine.py

+        data_parallel_replicate_size (int): Data parallel replicate size, default 1
+        data_parallel_shard_size (int): Data parallel shard degree, default 1
+        tensor_parallel_size (int): Tensor parallel size, default 1
+        expert_parallel_size (int): Expert parallel size, default 1
+        pipeline_parallel_size (int): Pipeline parallel size, default 1
+        context_parallel_size (int): Ring-attn context parallel size, default 1
+        strategy (str): Strategy to use for distributed training, default "torchtitan"


The descriptions for data_parallel_size, data_parallel_replicate_size, data_parallel_shard_size, tensor_parallel_size, expert_parallel_size, pipeline_parallel_size, and context_parallel_size in the docstring refer to "FSDP group size" or similar FSDP-specific terms. These are general parallelism parameters, and the descriptions should be more generic to avoid confusion, or the FSDP reference should be removed if it's not directly related to FSDP for Torchtitan.

Suggested change

data_parallel_replicate_size (int): Data parallel replicate size, default 1

data_parallel_shard_size (int): Data parallel shard degree, default 1

tensor_parallel_size (int): Tensor parallel size, default 1

expert_parallel_size (int): Expert parallel size, default 1

pipeline_parallel_size (int): Pipeline parallel size, default 1

context_parallel_size (int): Ring-attn context parallel size, default 1

strategy (str): Strategy to use for distributed training, default "torchtitan"

data_parallel_size (int): Data parallel group size, default 1

data_parallel_replicate_size (int): Data parallel replicate size, default 1

data_parallel_shard_size (int): Data parallel shard degree, default 1

tensor_parallel_size (int): Tensor parallel size, default 1

expert_parallel_size (int): Expert parallel size, default 1

pipeline_parallel_size (int): Pipeline parallel size, default 1

context_parallel_size (int): Context parallel size, default 1

acisseJZhong · 2026-02-10T01:33:28Z

verl/utils/torch_functional.py

@@ -743,6 +573,8 @@ def get_cosine_schedule_with_warmup(
    assert init_lr_ratio >= 0 and init_lr_ratio <= 1.0

    def lr_lambda(current_step):
+        # # 0-indexed step, hence + 1 adjustments
+        current_step += 1


see lr discussions here pytorch/torchtitan#2333 (comment)

Can we still aligned with megatron if we made this changes?

If not, then shall we make this configurable?

wuxibin89 · 2026-02-12T14:18:05Z

/gemini review

gemini-code-assist

Code Review

This pull request introduces Torchtitan as a new training engine, which is a significant feature addition. The implementation includes the necessary configuration files, the core engine logic, and updates to the end-to-end test script. While the overall structure is good, I've identified a critical bug in the test script that will prevent it from running, a high-risk change to a shared utility function that could cause unintended side effects, and a maintainability concern regarding the use of monkey-patching in the engine's implementation. Please address these points to ensure the stability and maintainability of the codebase.

gemini-code-assist · 2026-02-12T14:19:54Z

tests/special_e2e/sft/run_sft_engine.sh

+    engine=${backend} \
+    model=torchtitan_model \
+    model.attn_type=varlen \
+    model.hf_assets_path=${MODEL_PATH}


There's a missing backslash \ at the end of this line. In a multi-line shell variable assignment, each line except the last one must end with a \. Without it, the shell will concatenate this line with the next one (optim=${backend} \), leading to an invalid command argument and causing the script to fail.

Suggested change

model.hf_assets_path=${MODEL_PATH}

model.hf_assets_path=${MODEL_PATH} \

@acisseJZhong Please resolve this comment.

gemini-code-assist · 2026-02-12T14:19:54Z

verl/utils/torch_functional.py

+        # # 0-indexed step, hence + 1 adjustments
+        current_step += 1


Modifying the current_step within the shared utility function get_cosine_schedule_with_warmup introduces a significant risk of unintended side effects. This change effectively converts the step counting from 0-indexed to 1-indexed for all callers of this function, which could break the learning rate scheduling for other engines (e.g., FSDP) that rely on the original behavior. A safer approach would be to handle the step indexing within the specific calling code that requires 1-based indexing, or to create a new, separate scheduler utility (e.g., get_cosine_schedule_with_warmup_1_indexed) to avoid impacting existing functionality.

gemini-code-assist · 2026-02-12T14:19:54Z

verl/workers/engine/torchtitan/transformer_impl.py

+        import torchtitan.protocols.train_spec as train_spec_module
+
+        original_get_train_spec = train_spec_module.get_train_spec
+
+        def _get_train_spec_without_dataloader(model_name):
+            train_spec = original_get_train_spec(model_name)
+            train_spec.build_dataloader_fn = None
+            return train_spec
+
+        train_spec_module.get_train_spec = _get_train_spec_without_dataloader


This global monkey-patch of torchtitan.protocols.train_spec.get_train_spec is risky and can lead to maintenance issues. It makes the code dependent on the internal implementation details of torchtitan. If torchtitan's train_spec module is refactored in a future version, this will break in a non-obvious way. It would be safer to investigate if torchtitan.train.Trainer can be initialized without a dataloader through its public API, or if a modified train_spec object can be constructed and passed without patching the module globally. If patching is unavoidable, it should be scoped as narrowly as possible and clearly documented as a fragile dependency.

wuxibin89 · 2026-02-12T14:21:37Z

tests/special_e2e/sft/run_sft_engine.sh

+    engine=${backend} \
+    model=torchtitan_model \
+    model.attn_type=varlen \
+    model.hf_assets_path=${MODEL_PATH}


@acisseJZhong Please resolve this comment.

wuxibin89 · 2026-02-12T14:23:46Z

verl/trainer/config/model/torchtitan_model.yaml

+# 3. Inline comments (after a field on the same line) are not allowed.
+# 4. Indentation level is respected for nested fields.
+
+_target_: verl.workers.config.TorchtitanModelConfig


Why not reuse verl/trainer/config/model/hf_model.yaml?

verl/workers/config/model.py

verl/trainer/config/optim/torchtitan.yaml

wuxibin89 · 2026-02-12T15:46:27Z

verl/workers/engine/torchtitan/transformer_impl.py

+        """Move model and/or optimizer to CPU or GPU."""
+        super().to(device=device, model=model, optimizer=optimizer, grad=grad)
+
+        if self.engine_config.forward_only:


Does torchtitan share same load/offload mechanism with fsdp backend? In fsdp backend, we use CPUOffloadPolicy for forward_only reference model, I don't see any cpu offload policy here.

nice catch, in Titan it's passed through here https://github.com/pytorch/torchtitan/blob/27930cb612df4d6ebafdb21909749a9694bc167e/torchtitan/config/job_config.py#L270. Will update this in next commit.

wuxibin89 · 2026-02-12T15:51:48Z

tests/special_e2e/sft/run_sft_engine.sh

@@ -97,6 +122,10 @@ elif [ "$backend" = "veomni" ]; then
    ENGINE_CONFIG="$VEOMNI_ENGINE_CONFIG"
    echo "Using veomni engine"
    exp_name=gsm8k-${backend}-sp${SP_SIZE}-fsdp${FSDP_SIZE}-pad-${PAD_MODE}-use_remove_padding-${USE_REMOVE_PADDING}-mode-${mode}
+elif [ "$backend" = "torchtitan" ]; then


Please verify different parallelism in tests/special_e2e/sft/test_sft_engine_all.sh

sounds good. I will incorporate TP/SP with this PR, for other parallelism, will have separate PRs.

wuxibin89 · 2026-02-12T15:55:17Z

verl/workers/engine/torchtitan/transformer_impl.py

+            if hasattr(model_args, "attn_mask_type"):
+                model_args.attn_mask_type = self.model_config.attn_mask_type
+
+        model = Model(


Is there a model support list in torchtitan? or does it support any huggingface model?

this is the model list we supported https://github.com/pytorch/torchtitan/tree/main/torchtitan/models. Right now it doesn't support HF models, but should be relatively easy if user wants to add one.

We are also actively working on adding VLMs. cc @shuhuayu @tianyu-l

We have a model backend aiming to support transformers models out-of-box.
https://github.com/pytorch/torchtitan/tree/main/torchtitan/experiments/transformers_modeling_backend

wuxibin89 · 2026-02-12T15:59:13Z

verl/trainer/config/engine/torchtitan.yaml

+data_parallel_size: 1
+
+# Data parallel replicate size
+data_parallel_replicate_size: 1


Is there any document explain these parallelism?

https://github.com/pytorch/torchtitan/blob/27930cb612df4d6ebafdb21909749a9694bc167e/torchtitan/config/job_config.py#L312

wuxibin89 · 2026-02-12T16:07:03Z

@acisseJZhong Since there're quite some works to do, please open an issue to track torchtitan integration roadmap. #4880

acisseJZhong · 2026-02-13T01:03:23Z

created roadmap here #5306, @wuxibin89 please feel free to add any items I might be missing.

fix format fix format training runs training runs test test

wuxibin89 · 2026-02-13T13:34:17Z

tests/special_e2e/sft/test_sft_engine_all.sh


+# test with torchtitan fsdp=1
+echo "run with tp1 pp1 cp1 fsdp2 num_gpus2"
+BACKEND=torchtitan TP_SIZE=1 PP_SIZE=1 CP_SIZE=1 FSDP_SIZE=2 NUM_GPUS=2 bash tests/special_e2e/sft/run_sft_engine.sh


Well...it breaks NPU and VLM ci, I think we better temporary disable it in ci once we're ready. Checkout ci workflows:

.github/workflows/e2e_sft_llm_ascend.yml .github/workflows/e2e_sft_llm.yml .github/workflows/e2e_sft_vlm.yml

ah it's because I didnt pip install torchtitan in the CI flow. let me install and see if it passes.

seems CI needs to manual approval after I added pip install torchtitan. I will disable torchtitan run in CI for now, we can add back once we have everything including RL ready.

wuxibin89 · 2026-02-13T13:37:02Z

sanity check failed:

python3 tests/special_sanity/check_device_api_usage.py --directory ./verl

vermouth1992 · 2026-02-13T14:04:06Z

verl/trainer/config/model/torchtitan_model.yaml

@@ -0,0 +1,25 @@
+# Format checks enforced on CI:


Please use verl/trainer/config/model/hf_model.yaml. All the model should start from huggingface

I think we can add backend specific fields in hf_model.yaml, e.g:

torchtitan: name: qwen3 flavor: "0.6B"

This is not desired. We should strictly start from huggingface naming and checkpoint because this is where people create their models

the name and flavor fields are required by Titan to get corresponding train spec and model args. It does 1-1 mapping to hugging face models. See more here https://github.com/pytorch/torchtitan/blob/fde830de29c34c55b4cdc0209ac51f5b8084244e/torchtitan/models/llama3/__init__.py#L51

@vermouth1992 @wuxibin89 let me know if you have better ideas, but for now I think what the user need to do is just explicitly pass in name and flavor. It still aligns with the HF naming and ckpt.

vermouth1992 · 2026-02-13T14:06:03Z

verl/utils/torch_functional.py

@@ -743,6 +573,8 @@ def get_cosine_schedule_with_warmup(
    assert init_lr_ratio >= 0 and init_lr_ratio <= 1.0

    def lr_lambda(current_step):
+        # # 0-indexed step, hence + 1 adjustments
+        current_step += 1


Can we still aligned with megatron if we made this changes?

vermouth1992 · 2026-02-13T14:06:18Z

verl/utils/torch_functional.py

@@ -743,6 +573,8 @@ def get_cosine_schedule_with_warmup(
    assert init_lr_ratio >= 0 and init_lr_ratio <= 1.0

    def lr_lambda(current_step):
+        # # 0-indexed step, hence + 1 adjustments
+        current_step += 1


If not, then shall we make this configurable?

@H-Huang

**Goal:** This PR makes the changes so that we can integrate Torchtitan as a trainer to Verl: verl-project/verl#5051 **Major changes:** 1. ~~Change LR schedule to be 0 indexed instead of 1 indexed; to align with Verl's [fsdp util ](https://github.com/verl-project/verl/blob/d987199906f09ba53139df13e4528b2d575ec4ce/verl/utils/torch_functional.py#L745) See more analysis in https://docs.google.com/document/d/1YiFUvIa_JqTYpBd2Xj7ReH3Bw6wS07nKldycBX--uVE/edit?usp=sharing~~ ==> We decide not to change Titan's LR Scheduler behavior. <img width="993" height="571" alt="image" src="https://github.com/user-attachments/assets/e4012dbd-5624-45ff-b82b-a6225b91e1c0" /> 2. ~~add `position_block_causal` attn mask type, which creates block causal mask based on `position_id` for both varlen and flex attention: [transformers reference](https://github.com/huggingface/transformers/blob/0c89522f2af2f85cf997193645a1e727d6b8c1d7/src/transformers/masking_utils.py#L708)~~ ==> this is added in Verl's Torchtitan Engine code instead **Todos:** 1. Enable PP, right now [`pp_schedule.eval()` ](https://github.com/pytorch/pytorch/blob/03406903616077227734f772d682fc6027513ecf/torch/distributed/pipelining/schedules.py#L402)does the microbatch split for us, as it takes in the whole batch. However, in verl we split batch into microbatches before pp, and we'd love to pass in a list of pre-split microbatches to pp schedule. (thanks for @H-Huang's help)

acisseJZhong · 2026-02-14T08:30:25Z

two failing CI test seems irrelevant to this PR, will add titan engine CI after RL trainer is enabled.

vermouth1992 · 2026-02-14T13:30:40Z

verl/trainer/config/model/hf_model.yaml

+
+# Torchtitan backend configuration
+# Only used when engine backend is set to "torchtitan"
+torchtitan:


This is still not desirable. All the models including names and flavors must start from a single huggingface folder. We can introduce a general model_implementation dict so that users can write attn_type and attn_mask_type inside this sub-config

added a helper function to derive model name and flavor from hf config, and get rid of attn_mask_type since it's not used. For attn_type, I moved it to TorchtitanEngineConfig since it's more torchtitan specific field(I don't want other training engine to have this field). Please let me know if you have different opinions @vermouth1992

…roject#5051) ### What does this PR do? Integrate Torchtitan as a new training engine in Verl. This PR implements the basic APIs needed by Torchtitan Engine, and tested SFT trainer in verl (qwen3 0.6b): - Torchtitan Engine matches exactly with FSDP engine for SFT trainer - `use_remove_padding=True` matches `use_remove_padding=False` - TP/SP and FSDP work with both varlen and flex attention; numerics match with single process. **Relevant PRs:** - Torchtitan side changes in pytorch/torchtitan#2333. - RFC for engine interfaces verl-project#1371 - Training engine interface design verl-project#1977 - Add Veomini Engine verl-project#4072 **Todos:** See roadmap here: verl-project#5306 - [ ] enable parallelism: enable PP, EP, CP - [ ] make Torchtitan Engine work with RL trainer - [ ] test multimodal input(ref: https://github.com/verl-project/verl/pull/4492/changes) ### Checklist Before Starting - [ ] Search for similar PRs. Paste at least one query link here: ... - [ ] Format the PR title as `[{modules}] {type}: {description}` (This will be checked by the CI) - `{modules}` include `fsdp`, `megatron`, `veomni`, `sglang`, `vllm`, `rollout`, `trainer`, `ci`, `training_utils`, `recipe`, `hardware`, `deployment`, `ray`, `worker`, `single_controller`, `misc`, `perf`, `model`, `algo`, `env`, `tool`, `ckpt`, `doc`, `data`, `cfg`, `reward` - If this PR involves multiple modules, separate them with `,` like `[megatron, fsdp, doc]` - `{type}` is in `feat`, `fix`, `refactor`, `chore`, `test` - If this PR breaks any API (CLI arguments, config, function signature, etc.), add `[BREAKING]` to the beginning of the title. - Example: `[BREAKING][fsdp, megatron] feat: dynamic batching` ### Test ``` MODEL_ID=Qwen/Qwen3-0.6B BACKEND=torchtitan bash tests/special_e2e/sft/run_sft_engine.sh MODEL_ID=Qwen/Qwen3-0.6B BACKEND=fsdp bash tests/special_e2e/sft/run_sft_engine.sh ``` `use_remove_padding=True` <img width="1372" height="658" alt="image" src="https://github.com/user-attachments/assets/42c01ce6-f561-4c81-a562-e412a51ac296" /> `use_remove_padding=False` <img width="1353" height="610" alt="image" src="https://github.com/user-attachments/assets/c8db130e-626c-4d38-8932-9b3218431da3" /> Test TP and FSDP <img width="1324" height="683" alt="image" src="https://github.com/user-attachments/assets/12f8414d-041c-41fb-b915-012ed75c4adb" /> > For changes that can not be tested by CI (e.g., algorithm implementation, new model support), validate by experiment(s) and show results like training curve plots, evaluation results, etc. ### API and Usage Example > Demonstrate how the API changes if any, and provide usage example(s) if possible. ```python # Add code snippet or script demonstrating how to use this ``` ### Design & Code Changes > Demonstrate the high-level design if this PR is complex, and list the specific changes. ### Checklist Before Submitting > [!IMPORTANT] > Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review. - [X] Read the [Contribute Guide](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md). - [X] Apply [pre-commit checks](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md#code-linting-and-formatting): `pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=always` - [ ] Add / Update [the documentation](https://github.com/volcengine/verl/tree/main/docs). - [ ] Add unit or end-to-end test(s) to [the CI workflow](https://github.com/volcengine/verl/tree/main/.github/workflows) to cover all the code. If not feasible, explain why: ... - [ ] Once your PR is ready for CI, send a message in [the `ci-request` channel](https://verl-project.slack.com/archives/C091TCESWB1) in [the `verl` Slack workspace](https://join.slack.com/t/verl-project/shared_invite/zt-3855yhg8g-CTkqXu~hKojPCmo7k_yXTQ). (If not accessible, please try [the Feishu group (飞书群)](https://applink.larkoffice.com/client/chat/chatter/add_by_link?link_token=772jd4f1-cd91-441e-a820-498c6614126a).) - [ ] If your PR is related to the `recipe` submodule, please also update the reference to the submodule commit via `git submodule update --remote` or `cd recipe && git pull origin main`.

After verl-project#5051 added a `dp_group is not None` guard in rearrange_micro_batches, the FSDP actor/critic calls to prepare_dynamic_batch (which do not pass dp_group) silently skipped the num_micro_batches all_reduce across data-parallel ranks. Under dynamic batching with uneven sequence lengths across DP ranks, this causes different ranks to compute different numbers of micro-batches. Since FSDP performs reduce-scatter on every backward() call, mismatched micro-batch counts lead to a deadlock where one rank waits for the other to participate in a collective that never comes. This is the same root cause as verl-project#5451 which fixed the megatron backend. This PR applies the equivalent fix to the FSDP backend. Fix: Pass the data-parallel process group to prepare_dynamic_batch in both dp_actor.py and dp_critic.py to restore proper DP synchronization of micro-batch counts.

…roject#5051) ### What does this PR do? Integrate Torchtitan as a new training engine in Verl. This PR implements the basic APIs needed by Torchtitan Engine, and tested SFT trainer in verl (qwen3 0.6b): - Torchtitan Engine matches exactly with FSDP engine for SFT trainer - `use_remove_padding=True` matches `use_remove_padding=False` - TP/SP and FSDP work with both varlen and flex attention; numerics match with single process. **Relevant PRs:** - Torchtitan side changes in pytorch/torchtitan#2333. - RFC for engine interfaces verl-project#1371 - Training engine interface design verl-project#1977 - Add Veomini Engine verl-project#4072 **Todos:** See roadmap here: verl-project#5306 - [ ] enable parallelism: enable PP, EP, CP - [ ] make Torchtitan Engine work with RL trainer - [ ] test multimodal input(ref: https://github.com/verl-project/verl/pull/4492/changes) ### Checklist Before Starting - [ ] Search for similar PRs. Paste at least one query link here: ... - [ ] Format the PR title as `[{modules}] {type}: {description}` (This will be checked by the CI) - `{modules}` include `fsdp`, `megatron`, `veomni`, `sglang`, `vllm`, `rollout`, `trainer`, `ci`, `training_utils`, `recipe`, `hardware`, `deployment`, `ray`, `worker`, `single_controller`, `misc`, `perf`, `model`, `algo`, `env`, `tool`, `ckpt`, `doc`, `data`, `cfg`, `reward` - If this PR involves multiple modules, separate them with `,` like `[megatron, fsdp, doc]` - `{type}` is in `feat`, `fix`, `refactor`, `chore`, `test` - If this PR breaks any API (CLI arguments, config, function signature, etc.), add `[BREAKING]` to the beginning of the title. - Example: `[BREAKING][fsdp, megatron] feat: dynamic batching` ### Test ``` MODEL_ID=Qwen/Qwen3-0.6B BACKEND=torchtitan bash tests/special_e2e/sft/run_sft_engine.sh MODEL_ID=Qwen/Qwen3-0.6B BACKEND=fsdp bash tests/special_e2e/sft/run_sft_engine.sh ``` `use_remove_padding=True` <img width="1372" height="658" alt="image" src="https://github.com/user-attachments/assets/42c01ce6-f561-4c81-a562-e412a51ac296" /> `use_remove_padding=False` <img width="1353" height="610" alt="image" src="https://github.com/user-attachments/assets/c8db130e-626c-4d38-8932-9b3218431da3" /> Test TP and FSDP <img width="1324" height="683" alt="image" src="https://github.com/user-attachments/assets/12f8414d-041c-41fb-b915-012ed75c4adb" /> > For changes that can not be tested by CI (e.g., algorithm implementation, new model support), validate by experiment(s) and show results like training curve plots, evaluation results, etc. ### API and Usage Example > Demonstrate how the API changes if any, and provide usage example(s) if possible. ```python # Add code snippet or script demonstrating how to use this ``` ### Design & Code Changes > Demonstrate the high-level design if this PR is complex, and list the specific changes. ### Checklist Before Submitting > [!IMPORTANT] > Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review. - [X] Read the [Contribute Guide](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md). - [X] Apply [pre-commit checks](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md#code-linting-and-formatting): `pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=always` - [ ] Add / Update [the documentation](https://github.com/volcengine/verl/tree/main/docs). - [ ] Add unit or end-to-end test(s) to [the CI workflow](https://github.com/volcengine/verl/tree/main/.github/workflows) to cover all the code. If not feasible, explain why: ... - [ ] Once your PR is ready for CI, send a message in [the `ci-request` channel](https://verl-project.slack.com/archives/C091TCESWB1) in [the `verl` Slack workspace](https://join.slack.com/t/verl-project/shared_invite/zt-3855yhg8g-CTkqXu~hKojPCmo7k_yXTQ). (If not accessible, please try [the Feishu group (飞书群)](https://applink.larkoffice.com/client/chat/chatter/add_by_link?link_token=772jd4f1-cd91-441e-a820-498c6614126a).) - [ ] If your PR is related to the `recipe` submodule, please also update the reference to the submodule commit via `git submodule update --remote` or `cd recipe && git pull origin main`.

acisseJZhong requested review from eric-haibin-lin and vermouth1992 as code owners January 27, 2026 01:47

gemini-code-assist bot reviewed Jan 27, 2026

View reviewed changes

acisseJZhong marked this pull request as draft January 27, 2026 23:38

acisseJZhong force-pushed the torchtitan_engine branch from a2504b4 to 1b42553 Compare January 30, 2026 08:11

wuxibin89 mentioned this pull request Feb 4, 2026

Google TPU support in verl with Ray #5192

Open

acisseJZhong mentioned this pull request Feb 6, 2026

Torchtitan changes to integrate into Verl pytorch/torchtitan#2333

Merged

acisseJZhong marked this pull request as ready for review February 6, 2026 23:51

acisseJZhong requested review from PeterSH6 and tongyx361 as code owners February 6, 2026 23:51

acisseJZhong force-pushed the torchtitan_engine branch 2 times, most recently from 40da68e to 7960604 Compare February 7, 2026 01:20

acisseJZhong changed the title ~~[WIP] Try Add Torchtitan Engine~~ [trainer] feat: Add Torchtitan as alternative training engine Feb 10, 2026

acisseJZhong commented Feb 10, 2026

View reviewed changes

gemini-code-assist bot reviewed Feb 12, 2026

View reviewed changes

wuxibin89 reviewed Feb 12, 2026

View reviewed changes

wuxibin89 mentioned this pull request Feb 12, 2026

[roadmap] verl 26Q1 roadmap #4880

Open

27 tasks

acisseJZhong mentioned this pull request Feb 12, 2026

[roadmap] TorchTitan Integration into Verl #5306

Open

11 tasks

acisseJZhong added 4 commits February 12, 2026 23:51

initial try to add Torchtitan Engine

542ab37

fix format fix format training runs training runs test test

sft running but loss mismatch

17e3f5f

loss become large

745cb09

loss closer but still mismatch

8f1183c

delete log

bada868

wuxibin89 reviewed Feb 13, 2026

View reviewed changes

vermouth1992 requested changes Feb 13, 2026

View reviewed changes

acisseJZhong added 8 commits February 13, 2026 11:20

address comments

26da997

address comments

9f4510b

address comments

902916f

address comments

9703d2b

remove ci for now

f448b27

remove ci for now

95abca1

Re-enable FSDP's gradient division

f55959f

Re-enable FSDP's gradient division

71e432b

trigger ci

133e69e

vermouth1992 reviewed Feb 14, 2026

View reviewed changes

acisseJZhong added 5 commits February 14, 2026 11:06

format

712b38b

remove file

f61d0ae

move attn_type to engine

ccbece3

remove log

db55a2e

misc

543b1d4

wuxibin89 approved these changes Feb 20, 2026

View reviewed changes

wuxibin89 merged commit f5c34bb into verl-project:main Feb 20, 2026
87 of 129 checks passed

HuiyingLi mentioned this pull request Feb 27, 2026

[trainer] feat: Add Nemo-Automodel as alternative training engine #5407

Open

8 tasks

dubin555 mentioned this pull request Mar 6, 2026

[fsdp] fix: pass dp_group to prepare_dynamic_batch to fix potential hang #5511

Open

5 tasks

		context_parallel_degree=engine_config.context_parallel_size,
		)


		input_ids_rmpad_rolled = torch.roll(input_ids_rmpad, shifts=-1, dims=1)

		if self.use_ulysses_sp:

	hf_assets_path: str = "./tests/assets/tokenizer"
	hf_assets_path: str = ""

		mixed_precision (bool): Mixed precision configuration for FSDP, default None
		data_parallel_size (int): FSDP group size, default 1

	"""VeOmni optimizer configuration extending base OptimizerConfig.
	"""Torchtitan optimizer configuration extending base OptimizerConfig.

	reshard_after_forward (str): The policy for applying `reshard_after_forward` within an FSDP setup, default "default"
	reshard_after_forward (Literal["default", "always", "never"]): The policy for applying `reshard_after_forward` within an FSDP setup, default "default"

	use_orig_params (bool): Whether to use original parameters when initialize FSDP1, default False
	use_orig_params (bool): Whether to use original parameters when initializing FSDP, default False

	model.hf_assets_path=${MODEL_PATH}
	model.hf_assets_path=${MODEL_PATH} \

Conversation

acisseJZhong commented Jan 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Checklist Before Starting

Test

API and Usage Example

Design & Code Changes

Checklist Before Submitting

Uh oh!

CLAassistant commented Jan 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Jan 27, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jan 27, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jan 27, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jan 27, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jan 27, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jan 27, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jan 27, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jan 27, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jan 27, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jan 27, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wuxibin89 commented Feb 12, 2026

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Feb 12, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Feb 12, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Feb 12, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

acisseJZhong commented Jan 27, 2026 •

edited

Loading

CLAassistant commented Jan 27, 2026 •

edited

Loading

acisseJZhong Feb 13, 2026 •

edited

Loading

wuxibin89 commented Feb 12, 2026 •

edited

Loading

wuxibin89 Feb 13, 2026 •

edited

Loading

acisseJZhong Feb 14, 2026 •

edited

Loading