Skip to content

[trainer] feat: Add Torchtitan as alternative training engine#5051

Merged
wuxibin89 merged 33 commits intoverl-project:mainfrom
acisseJZhong:torchtitan_engine
Feb 20, 2026
Merged

[trainer] feat: Add Torchtitan as alternative training engine#5051
wuxibin89 merged 33 commits intoverl-project:mainfrom
acisseJZhong:torchtitan_engine

Conversation

@acisseJZhong
Copy link
Collaborator

@acisseJZhong acisseJZhong commented Jan 27, 2026

What does this PR do?

Integrate Torchtitan as a new training engine in Verl. This PR implements the basic APIs needed by Torchtitan Engine, and tested SFT trainer in verl (qwen3 0.6b):

  • Torchtitan Engine matches exactly with FSDP engine for SFT trainer
  • use_remove_padding=True matches use_remove_padding=False
  • TP/SP and FSDP work with both varlen and flex attention; numerics match with single process.

Relevant PRs:

Todos:
See roadmap here: #5306

Checklist Before Starting

  • Search for similar PRs. Paste at least one query link here: ...
  • Format the PR title as [{modules}] {type}: {description} (This will be checked by the CI)
    • {modules} include fsdp, megatron, veomni, sglang, vllm, rollout, trainer, ci, training_utils, recipe, hardware, deployment, ray, worker, single_controller, misc, perf, model, algo, env, tool, ckpt, doc, data, cfg, reward
    • If this PR involves multiple modules, separate them with , like [megatron, fsdp, doc]
    • {type} is in feat, fix, refactor, chore, test
    • If this PR breaks any API (CLI arguments, config, function signature, etc.), add [BREAKING] to the beginning of the title.
    • Example: [BREAKING][fsdp, megatron] feat: dynamic batching

Test

MODEL_ID=Qwen/Qwen3-0.6B BACKEND=torchtitan bash tests/special_e2e/sft/run_sft_engine.sh
MODEL_ID=Qwen/Qwen3-0.6B BACKEND=fsdp bash tests/special_e2e/sft/run_sft_engine.sh

use_remove_padding=True
image
use_remove_padding=False
image

Test TP and FSDP
image

For changes that can not be tested by CI (e.g., algorithm implementation, new model support), validate by experiment(s) and show results like training curve plots, evaluation results, etc.

API and Usage Example

Demonstrate how the API changes if any, and provide usage example(s) if possible.

# Add code snippet or script demonstrating how to use this

Design & Code Changes

Demonstrate the high-level design if this PR is complex, and list the specific changes.

Checklist Before Submitting

Important

Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review.

@CLAassistant
Copy link

CLAassistant commented Jan 27, 2026

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
You have signed the CLA already but the status is still pending? Let us recheck it.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

The pull request introduces the Torchtitan engine, model, and optimizer configurations, along with their implementation. The changes involve adding new dataclasses and integrating them into the existing system. While the overall structure is in place, several critical issues related to configuration consistency, attribute definitions, and potential runtime errors have been identified. These issues primarily stem from mismatches between the new Torchtitan configurations and the expectations of the underlying Torchtitan and Verl utility functions. Addressing these will be crucial for the correct functioning of the new engine.

Comment on lines +121 to +122
context_parallel_degree=engine_config.context_parallel_size,
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The Parallelism constructor and ParallelDims in _init_device_mesh expect an expert_tensor_parallel_degree (or etp). However, TorchtitanEngineConfig does not define an expert_tensor_parallel_size attribute. This omission will lead to an AttributeError when constructing the parallelism configuration. Please add expert_tensor_parallel_size to TorchtitanEngineConfig with an appropriate default value.

            expert_parallel_degree=engine_config.expert_parallel_size,
            expert_tensor_parallel_degree=engine_config.expert_tensor_parallel_size,


input_ids_rmpad_rolled = torch.roll(input_ids_rmpad, shifts=-1, dims=1)

if self.use_ulysses_sp:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The attribute self.use_ulysses_sp is accessed but not defined in TorchTitanEngineWithLMHead or its base class TorchTitanEngine. This will cause an AttributeError at runtime. This flag is used to control Ulysses sequence parallelism logic. Please ensure it is properly defined and initialized.

            if hasattr(self, "use_ulysses_sp") and self.use_ulysses_sp:

assert isinstance(self.engine, TorchTitanEngine)

# Reshard the root FSDP module
if self.engine.engine_config.fsdp_size > 1:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The self.engine.engine_config.fsdp_size attribute is accessed in EngineEvalModeCtx but TorchtitanEngineConfig does not have an fsdp_size attribute. This will cause an AttributeError at runtime. If FSDP resharding logic is needed for Torchtitan, fsdp_size or an equivalent parameter should be added to TorchtitanEngineConfig.

        if hasattr(self.engine.engine_config, "fsdp_size") and self.engine.engine_config.fsdp_size > 1:

"""Zero gradients."""
dist_utils.clip_grad_norm_(
[p for m in self.module for p in m.parameters()],
self.job_config.training.max_norm,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The self.job_config attribute is accessed within optimizer_zero_grad but was not assigned to self in the __init__ method. The JobConfig instance was created as a local variable config. This will result in an AttributeError. Please assign the JobConfig instance to self.job_config in __init__.

            [p for m in self.module for p in m.parameters()],
            self.job_config.training.max_norm,

class TorchtitanModelConfig(BaseConfig):
name: str = "llama3"
flavor: str = "debugmodel"
hf_assets_path: str = "./tests/assets/tokenizer"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The default value for hf_assets_path is set to "./tests/assets/tokenizer". Using a path specifically located within the tests/assets directory as a default for a production configuration class is highly problematic. This path is intended for testing and will likely cause issues or failures in non-test environments where these assets might not exist or be appropriate. Please provide a more suitable default path or make this field mandatory if there's no universal default.

Suggested change
hf_assets_path: str = "./tests/assets/tokenizer"
hf_assets_path: str = ""

Comment on lines +286 to +287
mixed_precision (bool): Mixed precision configuration for FSDP, default None
data_parallel_size (int): FSDP group size, default 1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The docstring for mixed_precision states "default None", but the field is defined as mixed_precision: bool = False. This creates a discrepancy between the documentation and the actual implementation. Please update the docstring to reflect the boolean type and its default value.

Suggested change
mixed_precision (bool): Mixed precision configuration for FSDP, default None
data_parallel_size (int): FSDP group size, default 1
mixed_precision (bool): Mixed precision configuration for FSDP, default False


@dataclass
class TorchtitanOptimizerConfig(OptimizerConfig):
"""VeOmni optimizer configuration extending base OptimizerConfig.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The docstring for TorchtitanOptimizerConfig incorrectly states "VeOmni optimizer configuration". It should be "Torchtitan optimizer configuration" to match the class name.

Suggested change
"""VeOmni optimizer configuration extending base OptimizerConfig.
"""Torchtitan optimizer configuration extending base OptimizerConfig.


Args:
wrap_policy (Dict[str, Any]): Configuration for FSDP wrap policy.
reshard_after_forward (str): The policy for applying `reshard_after_forward` within an FSDP setup, default "default"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The docstring for reshard_after_forward specifies (str) as its type, but the field is defined as Literal["default", "always", "never"]. The docstring should accurately reflect the Literal type for clarity.

Suggested change
reshard_after_forward (str): The policy for applying `reshard_after_forward` within an FSDP setup, default "default"
reshard_after_forward (Literal["default", "always", "never"]): The policy for applying `reshard_after_forward` within an FSDP setup, default "default"

wrap_policy (Dict[str, Any]): Configuration for FSDP wrap policy.
reshard_after_forward (str): The policy for applying `reshard_after_forward` within an FSDP setup, default "default"
forward_prefetch (bool): Whether to prefetch parameters for next forward pass, default False
use_orig_params (bool): Whether to use original parameters when initialize FSDP1, default False
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The docstring for use_orig_params mentions "FSDP1". Given that Torchtitan is described as using "FSDP2 + TP + PP", this reference might be outdated or misleading. Please clarify if this parameter is still relevant for Torchtitan's FSDP2 implementation or update the description accordingly.

Suggested change
use_orig_params (bool): Whether to use original parameters when initialize FSDP1, default False
use_orig_params (bool): Whether to use original parameters when initializing FSDP, default False

Comment on lines +288 to +294
data_parallel_replicate_size (int): Data parallel replicate size, default 1
data_parallel_shard_size (int): Data parallel shard degree, default 1
tensor_parallel_size (int): Tensor parallel size, default 1
expert_parallel_size (int): Expert parallel size, default 1
pipeline_parallel_size (int): Pipeline parallel size, default 1
context_parallel_size (int): Ring-attn context parallel size, default 1
strategy (str): Strategy to use for distributed training, default "torchtitan"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The descriptions for data_parallel_size, data_parallel_replicate_size, data_parallel_shard_size, tensor_parallel_size, expert_parallel_size, pipeline_parallel_size, and context_parallel_size in the docstring refer to "FSDP group size" or similar FSDP-specific terms. These are general parallelism parameters, and the descriptions should be more generic to avoid confusion, or the FSDP reference should be removed if it's not directly related to FSDP for Torchtitan.

Suggested change
data_parallel_replicate_size (int): Data parallel replicate size, default 1
data_parallel_shard_size (int): Data parallel shard degree, default 1
tensor_parallel_size (int): Tensor parallel size, default 1
expert_parallel_size (int): Expert parallel size, default 1
pipeline_parallel_size (int): Pipeline parallel size, default 1
context_parallel_size (int): Ring-attn context parallel size, default 1
strategy (str): Strategy to use for distributed training, default "torchtitan"
data_parallel_size (int): Data parallel group size, default 1
data_parallel_replicate_size (int): Data parallel replicate size, default 1
data_parallel_shard_size (int): Data parallel shard degree, default 1
tensor_parallel_size (int): Tensor parallel size, default 1
expert_parallel_size (int): Expert parallel size, default 1
pipeline_parallel_size (int): Pipeline parallel size, default 1
context_parallel_size (int): Context parallel size, default 1

@acisseJZhong acisseJZhong marked this pull request as draft January 27, 2026 23:38
@acisseJZhong acisseJZhong marked this pull request as ready for review February 6, 2026 23:51
@acisseJZhong acisseJZhong force-pushed the torchtitan_engine branch 2 times, most recently from 40da68e to 7960604 Compare February 7, 2026 01:20
@acisseJZhong acisseJZhong changed the title [WIP] Try Add Torchtitan Engine [trainer] feat: Add Torchtitan as alternative training engine Feb 10, 2026
@@ -743,6 +573,8 @@ def get_cosine_schedule_with_warmup(
assert init_lr_ratio >= 0 and init_lr_ratio <= 1.0

def lr_lambda(current_step):
# # 0-indexed step, hence + 1 adjustments
current_step += 1
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see lr discussions here pytorch/torchtitan#2333 (comment)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we still aligned with megatron if we made this changes?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If not, then shall we make this configurable?

@wuxibin89
Copy link
Collaborator

/gemini review

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces Torchtitan as a new training engine, which is a significant feature addition. The implementation includes the necessary configuration files, the core engine logic, and updates to the end-to-end test script. While the overall structure is good, I've identified a critical bug in the test script that will prevent it from running, a high-risk change to a shared utility function that could cause unintended side effects, and a maintainability concern regarding the use of monkey-patching in the engine's implementation. Please address these points to ensure the stability and maintainability of the codebase.

engine=${backend} \
model=torchtitan_model \
model.attn_type=varlen \
model.hf_assets_path=${MODEL_PATH}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

There's a missing backslash \ at the end of this line. In a multi-line shell variable assignment, each line except the last one must end with a \. Without it, the shell will concatenate this line with the next one (optim=${backend} \), leading to an invalid command argument and causing the script to fail.

Suggested change
model.hf_assets_path=${MODEL_PATH}
model.hf_assets_path=${MODEL_PATH} \

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@acisseJZhong Please resolve this comment.

Comment on lines +576 to +577
# # 0-indexed step, hence + 1 adjustments
current_step += 1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Modifying the current_step within the shared utility function get_cosine_schedule_with_warmup introduces a significant risk of unintended side effects. This change effectively converts the step counting from 0-indexed to 1-indexed for all callers of this function, which could break the learning rate scheduling for other engines (e.g., FSDP) that rely on the original behavior. A safer approach would be to handle the step indexing within the specific calling code that requires 1-based indexing, or to create a new, separate scheduler utility (e.g., get_cosine_schedule_with_warmup_1_indexed) to avoid impacting existing functionality.

Comment on lines +95 to +104
import torchtitan.protocols.train_spec as train_spec_module

original_get_train_spec = train_spec_module.get_train_spec

def _get_train_spec_without_dataloader(model_name):
train_spec = original_get_train_spec(model_name)
train_spec.build_dataloader_fn = None
return train_spec

train_spec_module.get_train_spec = _get_train_spec_without_dataloader
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

This global monkey-patch of torchtitan.protocols.train_spec.get_train_spec is risky and can lead to maintenance issues. It makes the code dependent on the internal implementation details of torchtitan. If torchtitan's train_spec module is refactored in a future version, this will break in a non-obvious way. It would be safer to investigate if torchtitan.train.Trainer can be initialized without a dataloader through its public API, or if a modified train_spec object can be constructed and passed without patching the module globally. If patching is unavoidable, it should be scoped as narrowly as possible and clearly documented as a fragile dependency.

engine=${backend} \
model=torchtitan_model \
model.attn_type=varlen \
model.hf_assets_path=${MODEL_PATH}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@acisseJZhong Please resolve this comment.

# 3. Inline comments (after a field on the same line) are not allowed.
# 4. Indentation level is respected for nested fields.

_target_: verl.workers.config.TorchtitanModelConfig
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not reuse verl/trainer/config/model/hf_model.yaml?

"""Move model and/or optimizer to CPU or GPU."""
super().to(device=device, model=model, optimizer=optimizer, grad=grad)

if self.engine_config.forward_only:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does torchtitan share same load/offload mechanism with fsdp backend? In fsdp backend, we use CPUOffloadPolicy for forward_only reference model, I don't see any cpu offload policy here.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice catch, in Titan it's passed through here https://github.com/pytorch/torchtitan/blob/27930cb612df4d6ebafdb21909749a9694bc167e/torchtitan/config/job_config.py#L270. Will update this in next commit.

@@ -97,6 +122,10 @@ elif [ "$backend" = "veomni" ]; then
ENGINE_CONFIG="$VEOMNI_ENGINE_CONFIG"
echo "Using veomni engine"
exp_name=gsm8k-${backend}-sp${SP_SIZE}-fsdp${FSDP_SIZE}-pad-${PAD_MODE}-use_remove_padding-${USE_REMOVE_PADDING}-mode-${mode}
elif [ "$backend" = "torchtitan" ]; then
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please verify different parallelism in tests/special_e2e/sft/test_sft_engine_all.sh

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sounds good. I will incorporate TP/SP with this PR, for other parallelism, will have separate PRs.

if hasattr(model_args, "attn_mask_type"):
model_args.attn_mask_type = self.model_config.attn_mask_type

model = Model(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a model support list in torchtitan? or does it support any huggingface model?

Copy link
Collaborator Author

@acisseJZhong acisseJZhong Feb 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is the model list we supported https://github.com/pytorch/torchtitan/tree/main/torchtitan/models. Right now it doesn't support HF models, but should be relatively easy if user wants to add one.

We are also actively working on adding VLMs. cc @shuhuayu @tianyu-l

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have a model backend aiming to support transformers models out-of-box.
https://github.com/pytorch/torchtitan/tree/main/torchtitan/experiments/transformers_modeling_backend

data_parallel_size: 1

# Data parallel replicate size
data_parallel_replicate_size: 1
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any document explain these parallelism?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wuxibin89
Copy link
Collaborator

wuxibin89 commented Feb 12, 2026

@acisseJZhong Since there're quite some works to do, please open an issue to track torchtitan integration roadmap. #4880

@acisseJZhong
Copy link
Collaborator Author

created roadmap here #5306, @wuxibin89 please feel free to add any items I might be missing.


# test with torchtitan fsdp=1
echo "run with tp1 pp1 cp1 fsdp2 num_gpus2"
BACKEND=torchtitan TP_SIZE=1 PP_SIZE=1 CP_SIZE=1 FSDP_SIZE=2 NUM_GPUS=2 bash tests/special_e2e/sft/run_sft_engine.sh
Copy link
Collaborator

@wuxibin89 wuxibin89 Feb 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well...it breaks NPU and VLM ci, I think we better temporary disable it in ci once we're ready. Checkout ci workflows:

.github/workflows/e2e_sft_llm_ascend.yml
.github/workflows/e2e_sft_llm.yml
.github/workflows/e2e_sft_vlm.yml

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah it's because I didnt pip install torchtitan in the CI flow. let me install and see if it passes.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems CI needs to manual approval after I added pip install torchtitan. I will disable torchtitan run in CI for now, we can add back once we have everything including RL ready.

@wuxibin89
Copy link
Collaborator

sanity check failed:

python3 tests/special_sanity/check_device_api_usage.py --directory ./verl

@@ -0,0 +1,25 @@
# Format checks enforced on CI:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use verl/trainer/config/model/hf_model.yaml. All the model should start from huggingface

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can add backend specific fields in hf_model.yaml, e.g:

torchtitan:
  name: qwen3
  flavor: "0.6B"

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not desired. We should strictly start from huggingface naming and checkpoint because this is where people create their models

Copy link
Collaborator Author

@acisseJZhong acisseJZhong Feb 14, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the name and flavor fields are required by Titan to get corresponding train spec and model args. It does 1-1 mapping to hugging face models. See more here https://github.com/pytorch/torchtitan/blob/fde830de29c34c55b4cdc0209ac51f5b8084244e/torchtitan/models/llama3/__init__.py#L51

@vermouth1992 @wuxibin89 let me know if you have better ideas, but for now I think what the user need to do is just explicitly pass in name and flavor. It still aligns with the HF naming and ckpt.

@@ -743,6 +573,8 @@ def get_cosine_schedule_with_warmup(
assert init_lr_ratio >= 0 and init_lr_ratio <= 1.0

def lr_lambda(current_step):
# # 0-indexed step, hence + 1 adjustments
current_step += 1
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we still aligned with megatron if we made this changes?

@@ -743,6 +573,8 @@ def get_cosine_schedule_with_warmup(
assert init_lr_ratio >= 0 and init_lr_ratio <= 1.0

def lr_lambda(current_step):
# # 0-indexed step, hence + 1 adjustments
current_step += 1
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If not, then shall we make this configurable?

acisseJZhong added a commit to pytorch/torchtitan that referenced this pull request Feb 13, 2026
**Goal:** This PR makes the changes so that we can integrate Torchtitan
as a trainer to Verl: verl-project/verl#5051

**Major changes:** 
1. ~~Change LR schedule to be 0 indexed instead of 1 indexed; to align
with Verl's [fsdp util
](https://github.com/verl-project/verl/blob/d987199906f09ba53139df13e4528b2d575ec4ce/verl/utils/torch_functional.py#L745)
See more analysis in
https://docs.google.com/document/d/1YiFUvIa_JqTYpBd2Xj7ReH3Bw6wS07nKldycBX--uVE/edit?usp=sharing~~
==> We decide not to change Titan's LR Scheduler behavior.
<img width="993" height="571" alt="image"
src="https://github.com/user-attachments/assets/e4012dbd-5624-45ff-b82b-a6225b91e1c0"
/>


2. ~~add `position_block_causal` attn mask type, which creates block
causal mask based on `position_id` for both varlen and flex attention:
[transformers
reference](https://github.com/huggingface/transformers/blob/0c89522f2af2f85cf997193645a1e727d6b8c1d7/src/transformers/masking_utils.py#L708)~~
==> this is added in Verl's Torchtitan Engine code instead


**Todos:**
1. Enable PP, right now [`pp_schedule.eval()`
](https://github.com/pytorch/pytorch/blob/03406903616077227734f772d682fc6027513ecf/torch/distributed/pipelining/schedules.py#L402)does
the microbatch split for us, as it takes in the whole batch. However, in
verl we split batch into microbatches before pp, and we'd love to pass
in a list of pre-split microbatches to pp schedule. (thanks for
@H-Huang's help)
@acisseJZhong
Copy link
Collaborator Author

two failing CI test seems irrelevant to this PR, will add titan engine CI after RL trainer is enabled.


# Torchtitan backend configuration
# Only used when engine backend is set to "torchtitan"
torchtitan:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is still not desirable. All the models including names and flavors must start from a single huggingface folder. We can introduce a general model_implementation dict so that users can write attn_type and attn_mask_type inside this sub-config

Copy link
Collaborator Author

@acisseJZhong acisseJZhong Feb 14, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added a helper function to derive model name and flavor from hf config, and get rid of attn_mask_type since it's not used. For attn_type, I moved it to TorchtitanEngineConfig since it's more torchtitan specific field(I don't want other training engine to have this field). Please let me know if you have different opinions @vermouth1992

@wuxibin89 wuxibin89 merged commit f5c34bb into verl-project:main Feb 20, 2026
87 of 129 checks passed
Superjomn pushed a commit to Superjomn/verl that referenced this pull request Mar 2, 2026
…roject#5051)

### What does this PR do?
Integrate Torchtitan as a new training engine in Verl. This PR
implements the basic APIs needed by Torchtitan Engine, and tested SFT
trainer in verl (qwen3 0.6b):
- Torchtitan Engine matches exactly with FSDP engine for SFT trainer
- `use_remove_padding=True` matches `use_remove_padding=False`
- TP/SP and FSDP work with both varlen and flex attention; numerics
match with single process.

**Relevant PRs:**
- Torchtitan side changes in
pytorch/torchtitan#2333.
- RFC for engine interfaces
verl-project#1371
- Training engine interface design
verl-project#1977
- Add Veomini Engine verl-project#4072

**Todos:**
See roadmap here: verl-project#5306
- [ ] enable parallelism: enable PP, EP, CP
- [ ] make Torchtitan Engine work with RL trainer 
- [ ] test multimodal input(ref:
https://github.com/verl-project/verl/pull/4492/changes)

### Checklist Before Starting

- [ ] Search for similar PRs. Paste at least one query link here: ...
- [ ] Format the PR title as `[{modules}] {type}: {description}` (This
will be checked by the CI)
- `{modules}` include `fsdp`, `megatron`, `veomni`, `sglang`, `vllm`,
`rollout`, `trainer`, `ci`, `training_utils`, `recipe`, `hardware`,
`deployment`, `ray`, `worker`, `single_controller`, `misc`, `perf`,
`model`, `algo`, `env`, `tool`, `ckpt`, `doc`, `data`, `cfg`, `reward`
- If this PR involves multiple modules, separate them with `,` like
`[megatron, fsdp, doc]`
  - `{type}` is in `feat`, `fix`, `refactor`, `chore`, `test`
- If this PR breaks any API (CLI arguments, config, function signature,
etc.), add `[BREAKING]` to the beginning of the title.
  - Example: `[BREAKING][fsdp, megatron] feat: dynamic batching`

### Test
```
MODEL_ID=Qwen/Qwen3-0.6B BACKEND=torchtitan bash tests/special_e2e/sft/run_sft_engine.sh
MODEL_ID=Qwen/Qwen3-0.6B BACKEND=fsdp bash tests/special_e2e/sft/run_sft_engine.sh
```
`use_remove_padding=True`
<img width="1372" height="658" alt="image"
src="https://github.com/user-attachments/assets/42c01ce6-f561-4c81-a562-e412a51ac296"
/>
`use_remove_padding=False`
<img width="1353" height="610" alt="image"
src="https://github.com/user-attachments/assets/c8db130e-626c-4d38-8932-9b3218431da3"
/>

Test TP and FSDP
<img width="1324" height="683" alt="image"
src="https://github.com/user-attachments/assets/12f8414d-041c-41fb-b915-012ed75c4adb"
/>



> For changes that can not be tested by CI (e.g., algorithm
implementation, new model support), validate by experiment(s) and show
results like training curve plots, evaluation results, etc.

### API and Usage Example

> Demonstrate how the API changes if any, and provide usage example(s)
if possible.

```python
# Add code snippet or script demonstrating how to use this
```

### Design & Code Changes

> Demonstrate the high-level design if this PR is complex, and list the
specific changes.

### Checklist Before Submitting

> [!IMPORTANT]
> Please check all the following items before requesting a review,
otherwise the reviewer might deprioritize this PR for review.

- [X] Read the [Contribute
Guide](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md).
- [X] Apply [pre-commit
checks](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md#code-linting-and-formatting):
`pre-commit install && pre-commit run --all-files --show-diff-on-failure
--color=always`
- [ ] Add / Update [the
documentation](https://github.com/volcengine/verl/tree/main/docs).
- [ ] Add unit or end-to-end test(s) to [the CI
workflow](https://github.com/volcengine/verl/tree/main/.github/workflows)
to cover all the code. If not feasible, explain why: ...
- [ ] Once your PR is ready for CI, send a message in [the `ci-request`
channel](https://verl-project.slack.com/archives/C091TCESWB1) in [the
`verl` Slack
workspace](https://join.slack.com/t/verl-project/shared_invite/zt-3855yhg8g-CTkqXu~hKojPCmo7k_yXTQ).
(If not accessible, please try [the Feishu group
(飞书群)](https://applink.larkoffice.com/client/chat/chatter/add_by_link?link_token=772jd4f1-cd91-441e-a820-498c6614126a).)
- [ ] If your PR is related to the `recipe` submodule, please also
update the reference to the submodule commit via `git submodule update
--remote` or `cd recipe && git pull origin main`.
SchumiDing pushed a commit to SchumiDing/verl that referenced this pull request Mar 2, 2026
…roject#5051)

### What does this PR do?
Integrate Torchtitan as a new training engine in Verl. This PR
implements the basic APIs needed by Torchtitan Engine, and tested SFT
trainer in verl (qwen3 0.6b):
- Torchtitan Engine matches exactly with FSDP engine for SFT trainer
- `use_remove_padding=True` matches `use_remove_padding=False`
- TP/SP and FSDP work with both varlen and flex attention; numerics
match with single process.

**Relevant PRs:**
- Torchtitan side changes in
pytorch/torchtitan#2333.
- RFC for engine interfaces
verl-project#1371
- Training engine interface design
verl-project#1977
- Add Veomini Engine verl-project#4072

**Todos:**
See roadmap here: verl-project#5306
- [ ] enable parallelism: enable PP, EP, CP
- [ ] make Torchtitan Engine work with RL trainer 
- [ ] test multimodal input(ref:
https://github.com/verl-project/verl/pull/4492/changes)

### Checklist Before Starting

- [ ] Search for similar PRs. Paste at least one query link here: ...
- [ ] Format the PR title as `[{modules}] {type}: {description}` (This
will be checked by the CI)
- `{modules}` include `fsdp`, `megatron`, `veomni`, `sglang`, `vllm`,
`rollout`, `trainer`, `ci`, `training_utils`, `recipe`, `hardware`,
`deployment`, `ray`, `worker`, `single_controller`, `misc`, `perf`,
`model`, `algo`, `env`, `tool`, `ckpt`, `doc`, `data`, `cfg`, `reward`
- If this PR involves multiple modules, separate them with `,` like
`[megatron, fsdp, doc]`
  - `{type}` is in `feat`, `fix`, `refactor`, `chore`, `test`
- If this PR breaks any API (CLI arguments, config, function signature,
etc.), add `[BREAKING]` to the beginning of the title.
  - Example: `[BREAKING][fsdp, megatron] feat: dynamic batching`

### Test
```
MODEL_ID=Qwen/Qwen3-0.6B BACKEND=torchtitan bash tests/special_e2e/sft/run_sft_engine.sh
MODEL_ID=Qwen/Qwen3-0.6B BACKEND=fsdp bash tests/special_e2e/sft/run_sft_engine.sh
```
`use_remove_padding=True`
<img width="1372" height="658" alt="image"
src="https://github.com/user-attachments/assets/42c01ce6-f561-4c81-a562-e412a51ac296"
/>
`use_remove_padding=False`
<img width="1353" height="610" alt="image"
src="https://github.com/user-attachments/assets/c8db130e-626c-4d38-8932-9b3218431da3"
/>

Test TP and FSDP
<img width="1324" height="683" alt="image"
src="https://github.com/user-attachments/assets/12f8414d-041c-41fb-b915-012ed75c4adb"
/>



> For changes that can not be tested by CI (e.g., algorithm
implementation, new model support), validate by experiment(s) and show
results like training curve plots, evaluation results, etc.

### API and Usage Example

> Demonstrate how the API changes if any, and provide usage example(s)
if possible.

```python
# Add code snippet or script demonstrating how to use this
```

### Design & Code Changes

> Demonstrate the high-level design if this PR is complex, and list the
specific changes.

### Checklist Before Submitting

> [!IMPORTANT]
> Please check all the following items before requesting a review,
otherwise the reviewer might deprioritize this PR for review.

- [X] Read the [Contribute
Guide](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md).
- [X] Apply [pre-commit
checks](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md#code-linting-and-formatting):
`pre-commit install && pre-commit run --all-files --show-diff-on-failure
--color=always`
- [ ] Add / Update [the
documentation](https://github.com/volcengine/verl/tree/main/docs).
- [ ] Add unit or end-to-end test(s) to [the CI
workflow](https://github.com/volcengine/verl/tree/main/.github/workflows)
to cover all the code. If not feasible, explain why: ...
- [ ] Once your PR is ready for CI, send a message in [the `ci-request`
channel](https://verl-project.slack.com/archives/C091TCESWB1) in [the
`verl` Slack
workspace](https://join.slack.com/t/verl-project/shared_invite/zt-3855yhg8g-CTkqXu~hKojPCmo7k_yXTQ).
(If not accessible, please try [the Feishu group
(飞书群)](https://applink.larkoffice.com/client/chat/chatter/add_by_link?link_token=772jd4f1-cd91-441e-a820-498c6614126a).)
- [ ] If your PR is related to the `recipe` submodule, please also
update the reference to the submodule commit via `git submodule update
--remote` or `cd recipe && git pull origin main`.
dubin555 added a commit to dubin555/verl that referenced this pull request Mar 6, 2026
After verl-project#5051 added a `dp_group is not None` guard in
rearrange_micro_batches, the FSDP actor/critic calls to
prepare_dynamic_batch (which do not pass dp_group) silently
skipped the num_micro_batches all_reduce across data-parallel
ranks.

Under dynamic batching with uneven sequence lengths across DP
ranks, this causes different ranks to compute different numbers
of micro-batches. Since FSDP performs reduce-scatter on every
backward() call, mismatched micro-batch counts lead to a
deadlock where one rank waits for the other to participate in
a collective that never comes.

This is the same root cause as verl-project#5451 which fixed the megatron
backend. This PR applies the equivalent fix to the FSDP backend.

Fix: Pass the data-parallel process group to prepare_dynamic_batch
in both dp_actor.py and dp_critic.py to restore proper DP
synchronization of micro-batch counts.
guillemgt pushed a commit to guillemgt/verl that referenced this pull request Mar 9, 2026
…roject#5051)

### What does this PR do?
Integrate Torchtitan as a new training engine in Verl. This PR
implements the basic APIs needed by Torchtitan Engine, and tested SFT
trainer in verl (qwen3 0.6b):
- Torchtitan Engine matches exactly with FSDP engine for SFT trainer
- `use_remove_padding=True` matches `use_remove_padding=False`
- TP/SP and FSDP work with both varlen and flex attention; numerics
match with single process.

**Relevant PRs:**
- Torchtitan side changes in
pytorch/torchtitan#2333.
- RFC for engine interfaces
verl-project#1371
- Training engine interface design
verl-project#1977
- Add Veomini Engine verl-project#4072

**Todos:**
See roadmap here: verl-project#5306
- [ ] enable parallelism: enable PP, EP, CP
- [ ] make Torchtitan Engine work with RL trainer 
- [ ] test multimodal input(ref:
https://github.com/verl-project/verl/pull/4492/changes)

### Checklist Before Starting

- [ ] Search for similar PRs. Paste at least one query link here: ...
- [ ] Format the PR title as `[{modules}] {type}: {description}` (This
will be checked by the CI)
- `{modules}` include `fsdp`, `megatron`, `veomni`, `sglang`, `vllm`,
`rollout`, `trainer`, `ci`, `training_utils`, `recipe`, `hardware`,
`deployment`, `ray`, `worker`, `single_controller`, `misc`, `perf`,
`model`, `algo`, `env`, `tool`, `ckpt`, `doc`, `data`, `cfg`, `reward`
- If this PR involves multiple modules, separate them with `,` like
`[megatron, fsdp, doc]`
  - `{type}` is in `feat`, `fix`, `refactor`, `chore`, `test`
- If this PR breaks any API (CLI arguments, config, function signature,
etc.), add `[BREAKING]` to the beginning of the title.
  - Example: `[BREAKING][fsdp, megatron] feat: dynamic batching`

### Test
```
MODEL_ID=Qwen/Qwen3-0.6B BACKEND=torchtitan bash tests/special_e2e/sft/run_sft_engine.sh
MODEL_ID=Qwen/Qwen3-0.6B BACKEND=fsdp bash tests/special_e2e/sft/run_sft_engine.sh
```
`use_remove_padding=True`
<img width="1372" height="658" alt="image"
src="https://github.com/user-attachments/assets/42c01ce6-f561-4c81-a562-e412a51ac296"
/>
`use_remove_padding=False`
<img width="1353" height="610" alt="image"
src="https://github.com/user-attachments/assets/c8db130e-626c-4d38-8932-9b3218431da3"
/>

Test TP and FSDP
<img width="1324" height="683" alt="image"
src="https://github.com/user-attachments/assets/12f8414d-041c-41fb-b915-012ed75c4adb"
/>



> For changes that can not be tested by CI (e.g., algorithm
implementation, new model support), validate by experiment(s) and show
results like training curve plots, evaluation results, etc.

### API and Usage Example

> Demonstrate how the API changes if any, and provide usage example(s)
if possible.

```python
# Add code snippet or script demonstrating how to use this
```

### Design & Code Changes

> Demonstrate the high-level design if this PR is complex, and list the
specific changes.

### Checklist Before Submitting

> [!IMPORTANT]
> Please check all the following items before requesting a review,
otherwise the reviewer might deprioritize this PR for review.

- [X] Read the [Contribute
Guide](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md).
- [X] Apply [pre-commit
checks](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md#code-linting-and-formatting):
`pre-commit install && pre-commit run --all-files --show-diff-on-failure
--color=always`
- [ ] Add / Update [the
documentation](https://github.com/volcengine/verl/tree/main/docs).
- [ ] Add unit or end-to-end test(s) to [the CI
workflow](https://github.com/volcengine/verl/tree/main/.github/workflows)
to cover all the code. If not feasible, explain why: ...
- [ ] Once your PR is ready for CI, send a message in [the `ci-request`
channel](https://verl-project.slack.com/archives/C091TCESWB1) in [the
`verl` Slack
workspace](https://join.slack.com/t/verl-project/shared_invite/zt-3855yhg8g-CTkqXu~hKojPCmo7k_yXTQ).
(If not accessible, please try [the Feishu group
(飞书群)](https://applink.larkoffice.com/client/chat/chatter/add_by_link?link_token=772jd4f1-cd91-441e-a820-498c6614126a).)
- [ ] If your PR is related to the `recipe` submodule, please also
update the reference to the submodule commit via `git submodule update
--remote` or `cd recipe && git pull origin main`.
guillemgt added a commit to guillemgt/verl that referenced this pull request Mar 9, 2026
…roject#5051)

### What does this PR do?
Integrate Torchtitan as a new training engine in Verl. This PR
implements the basic APIs needed by Torchtitan Engine, and tested SFT
trainer in verl (qwen3 0.6b):
- Torchtitan Engine matches exactly with FSDP engine for SFT trainer
- `use_remove_padding=True` matches `use_remove_padding=False`
- TP/SP and FSDP work with both varlen and flex attention; numerics
match with single process.

**Relevant PRs:**
- Torchtitan side changes in
pytorch/torchtitan#2333.
- RFC for engine interfaces
verl-project#1371
- Training engine interface design
verl-project#1977
- Add Veomini Engine verl-project#4072

**Todos:**
See roadmap here: verl-project#5306
- [ ] enable parallelism: enable PP, EP, CP
- [ ] make Torchtitan Engine work with RL trainer
- [ ] test multimodal input(ref:
https://github.com/verl-project/verl/pull/4492/changes)

### Checklist Before Starting

- [ ] Search for similar PRs. Paste at least one query link here: ...
- [ ] Format the PR title as `[{modules}] {type}: {description}` (This
will be checked by the CI)
- `{modules}` include `fsdp`, `megatron`, `veomni`, `sglang`, `vllm`,
`rollout`, `trainer`, `ci`, `training_utils`, `recipe`, `hardware`,
`deployment`, `ray`, `worker`, `single_controller`, `misc`, `perf`,
`model`, `algo`, `env`, `tool`, `ckpt`, `doc`, `data`, `cfg`, `reward`
- If this PR involves multiple modules, separate them with `,` like
`[megatron, fsdp, doc]`
  - `{type}` is in `feat`, `fix`, `refactor`, `chore`, `test`
- If this PR breaks any API (CLI arguments, config, function signature,
etc.), add `[BREAKING]` to the beginning of the title.
  - Example: `[BREAKING][fsdp, megatron] feat: dynamic batching`

### Test
```
MODEL_ID=Qwen/Qwen3-0.6B BACKEND=torchtitan bash tests/special_e2e/sft/run_sft_engine.sh
MODEL_ID=Qwen/Qwen3-0.6B BACKEND=fsdp bash tests/special_e2e/sft/run_sft_engine.sh
```
`use_remove_padding=True`
<img width="1372" height="658" alt="image"
src="https://github.com/user-attachments/assets/42c01ce6-f561-4c81-a562-e412a51ac296"
/>
`use_remove_padding=False`
<img width="1353" height="610" alt="image"
src="https://github.com/user-attachments/assets/c8db130e-626c-4d38-8932-9b3218431da3"
/>

Test TP and FSDP
<img width="1324" height="683" alt="image"
src="https://github.com/user-attachments/assets/12f8414d-041c-41fb-b915-012ed75c4adb"
/>

> For changes that can not be tested by CI (e.g., algorithm
implementation, new model support), validate by experiment(s) and show
results like training curve plots, evaluation results, etc.

### API and Usage Example

> Demonstrate how the API changes if any, and provide usage example(s)
if possible.

```python
# Add code snippet or script demonstrating how to use this
```

### Design & Code Changes

> Demonstrate the high-level design if this PR is complex, and list the
specific changes.

### Checklist Before Submitting

> [!IMPORTANT]
> Please check all the following items before requesting a review,
otherwise the reviewer might deprioritize this PR for review.

- [X] Read the [Contribute
Guide](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md).
- [X] Apply [pre-commit
checks](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md#code-linting-and-formatting):
`pre-commit install && pre-commit run --all-files --show-diff-on-failure
--color=always`
- [ ] Add / Update [the
documentation](https://github.com/volcengine/verl/tree/main/docs).
- [ ] Add unit or end-to-end test(s) to [the CI
workflow](https://github.com/volcengine/verl/tree/main/.github/workflows)
to cover all the code. If not feasible, explain why: ...
- [ ] Once your PR is ready for CI, send a message in [the `ci-request`
channel](https://verl-project.slack.com/archives/C091TCESWB1) in [the
`verl` Slack
workspace](https://join.slack.com/t/verl-project/shared_invite/zt-3855yhg8g-CTkqXu~hKojPCmo7k_yXTQ).
(If not accessible, please try [the Feishu group
(飞书群)](https://applink.larkoffice.com/client/chat/chatter/add_by_link?link_token=772jd4f1-cd91-441e-a820-498c6614126a).)
- [ ] If your PR is related to the `recipe` submodule, please also
update the reference to the submodule commit via `git submodule update
--remote` or `cd recipe && git pull origin main`.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants