-
Notifications
You must be signed in to change notification settings - Fork 3.5k
[trainer] feat: Add Torchtitan as alternative training engine #5051
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 28 commits
542ab37
17e3f5f
745cb09
8f1183c
ad0f8d6
9eeb171
200fb15
e303e98
22adbab
1de4d17
f6deb69
82fe47d
0571b51
2e6aac0
46cefc9
f73eaad
df16152
f2bd36c
bada868
26da997
9f4510b
902916f
9703d2b
f448b27
95abca1
f55959f
71e432b
133e69e
712b38b
f61d0ae
ccbece3
db55a2e
543b1d4
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,65 @@ | ||
| # Target class for this configuration | ||
| _target_: verl.workers.config.TorchtitanEngineConfig | ||
|
|
||
| # policy for wrapping the model | ||
| wrap_policy: | ||
| # Minimum number of parameters to trigger wrapping a layer with FSDP | ||
| min_num_params: 0 | ||
|
|
||
| # The policy for applying `reshard_after_forward` within an FSDP setup | ||
| # Options: "default", "always", "never" | ||
| reshard_after_forward: default | ||
|
|
||
| # Prefetch the next forward-pass all-gather before the current forward computation. | ||
| forward_prefetch: false | ||
|
|
||
| # Whether to use original parameters | ||
| use_orig_params: false | ||
|
|
||
| # Mixed precision configuration for FSDP | ||
| mixed_precision: false | ||
|
|
||
| # Whether to use torch compile | ||
| use_torch_compile: true | ||
|
|
||
| # Whether to use entropy_from_logits_with_chunking | ||
| entropy_from_logits_with_chunking: false | ||
|
|
||
| # Whether to use entropy checkpointing | ||
| entropy_checkpointing: false | ||
|
|
||
| # Data parallel size (FSDP group size) | ||
| data_parallel_size: 1 | ||
|
|
||
| # Data parallel replicate size | ||
| data_parallel_replicate_size: 1 | ||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is there any document explain these parallelism?
Collaborator
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. |
||
|
|
||
| # Data parallel shard size | ||
| data_parallel_shard_size: 1 | ||
|
|
||
| # Tensor parallel size | ||
| tensor_parallel_size: 1 | ||
|
|
||
| # Expert parallel size | ||
| expert_parallel_size: 1 | ||
|
|
||
| # Pipeline parallel size | ||
| pipeline_parallel_size: 1 | ||
|
|
||
| # Context parallel size | ||
| context_parallel_size: 1 | ||
|
|
||
| # Strategy | ||
| strategy: torchtitan | ||
|
|
||
| # Random seed for reproducibility | ||
| seed: 42 | ||
|
|
||
| # Whether to enable full determinism for distributed training, only for debugging | ||
| full_determinism: false | ||
|
|
||
| # Whether to use forward only | ||
| forward_only: false | ||
|
|
||
| # Mixed precision training param dtype | ||
| dtype: bfloat16 | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -95,3 +95,19 @@ mtp: | |
|
|
||
| method: mtp | ||
| num_speculative_tokens: 1 | ||
|
|
||
| # Torchtitan backend configuration | ||
| # Only used when engine backend is set to "torchtitan" | ||
| torchtitan: | ||
|
||
|
|
||
| # model name for torchtitan (e.g., "qwen3", "llama3") | ||
| name: null | ||
|
|
||
| # model flavor/size (e.g., "0.6B", "8B") | ||
| flavor: null | ||
|
|
||
| # attention type (e.g., "sdpa", "flex", "varlen") | ||
| attn_type: sdpa | ||
|
|
||
| # attention mask type (e.g., "causal", "block_causal") | ||
| attn_mask_type: causal | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,35 @@ | ||
| # Target class for this configuration | ||
| _target_: verl.workers.config.TorchtitanOptimizerConfig | ||
wuxibin89 marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
|
||
| # Optimizer name | ||
| name: AdamW | ||
|
|
||
| # Learning rate | ||
| lr: 1e-3 | ||
|
|
||
| # LR warmup steps ratio | ||
| lr_warmup_steps_ratio: 0.0 | ||
|
|
||
| # Total training steps | ||
| total_training_steps: -1 | ||
|
|
||
| # Weight decay | ||
| weight_decay: 0.01 | ||
|
|
||
| # LR warmup steps | ||
| lr_warmup_steps: -1 | ||
|
|
||
| # Betas for Adam optimizer | ||
| betas: [0.9, 0.999] | ||
|
|
||
| # Clip gradient | ||
| clip_grad: 1.0 | ||
|
|
||
| # Epsilon for Adam optimizer | ||
| eps: 1e-8 | ||
|
|
||
| # Decay type: "linear", "sqrt", or "cosine" | ||
| decay_type: linear | ||
|
|
||
| # Minimum LR factor for cosine schedule | ||
| min_lr_factor: 0.0 | ||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please verify different parallelism in
tests/special_e2e/sft/test_sft_engine_all.shThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sounds good. I will incorporate TP/SP with this PR, for other parallelism, will have separate PRs.