Megatron GRPO

Version Requirement: ms-swift >= 3.11

If you are new to GRPO, please refer to the GRPO documentation first.

Megatron GRPO currently supports the following features:

Training Modes: Full parameter training and LoRA fine-tuning
Parallelism Strategies: Context Parallelism (CP), Pipeline Parallelism (PP), Tensor Parallelism (TP), and Expert Parallelism (EP)
Inference Acceleration: vLLM colocate mode and server mode
Model Support: Compatible with LLMs and MLLMs (multimodal large models) in Megatron Swift
Algorithm Support: Covers most features of Swift GRPO

The following parameters or features will be gradually supported in future versions:

Entropy-related Configuration: e.g., top_entropy_quantile, log_entropy
Reward Model / Reward Model Plugin
Multi-turn Rollout Scheduling (multi_turn_scheduler): Multi-turn conversation policy optimization
Virtual Pipeline Parallelism (VPP)
Reference Model Synchronization (sync_ref_model)
Async Generate (async_generate)
num_iterations
SwanLab Logging Integration

⚠️ Note: The following parameters are not effective in Megatron GRPO:

use_vllm: Megatron GRPO supports rollout inference using vLLM only.
move_model_batches: This parameter is specific to DeepSpeed ZeRO-3 optimization and is invalid in the Megatron architecture.

Similar to ms-swift GRPO, all batch size-related parameters in Megatron GRPO are at the completion-level, meaning they represent the number of completions generated by the model, not the number of prompts.

Parameter Comparison

The following table compares the batch-related parameters between ms-swift and Megatron-SWIFT:

ms-swift Parameter	Megatron-SWIFT Parameter	Description
`per_device_train_batch_size`	`micro_batch_size`	Training batch size per GPU (completion-level)
`gradient_accumulation_steps`	-	Gradient accumulation steps, already included in `global_batch_size` calculation in Megatron-SWIFT
-	`global_batch_size`	Global batch size (completion-level) Megatron-SWIFT: `micro_batch_size × dp_size × gradient_accumulation_steps` ms-swift: `per_device_train_batch_size × world_size × gradient_accumulation_steps`
`num_generations`	`num_generations`	Number of completions generated per prompt
`steps_per_generation`	`steps_per_generation`	Ratio of Rollout batch size to training batch size Note: In ms-swift, must be an integer multiple of `gradient_accumulation_steps`
`generation_batch_size`	`generation_batch_size`	Batch size during Rollout phase (completion-level), must be an integer multiple of `global_batch_size`

The following formulas are used to calculate batch sizes in Megatron GRPO:

Data Parallel Size: dp_size = world_size / (TP × PP × CP)
Global Batch Size: global_batch_size = micro_batch_size × dp_size × gradient_accumulation_steps
Generation Batch Size: generation_batch_size = global_batch_size × steps_per_generation
Rollout Prompt Count: num_rollout_prompts = generation_batch_size / num_generations
Training Prompt Count: num_train_prompts = global_batch_size / num_generations
Training Prompt Count per DP Group: num_prompts_per_dp_group = global_batch_size / num_generations / dp_size

Note: In Megatron GRPO, the training prompt count per DP group must satisfy that num_prompts_per_dp_group is an integer multiple of micro_batch_size to ensure proper batch allocation during training.

For more parameters, please refer to the Command-line Parameters documentation.

For training scripts, please refer to Megatron GRPO Scripts.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Megatron GRPO

Parameter Comparison

FilesExpand file tree

GRPO.md

Latest commit

History

GRPO.md

File metadata and controls

Megatron GRPO

Parameter Comparison