Version Requirement: ms-swift >= 3.11
If you are new to GRPO, please refer to the GRPO documentation first.
Megatron GRPO currently supports the following features:
- Training Modes: Full parameter training and LoRA fine-tuning
- Parallelism Strategies: Context Parallelism (CP), Pipeline Parallelism (PP), Tensor Parallelism (TP), and Expert Parallelism (EP)
- Inference Acceleration: vLLM colocate mode and server mode
- Model Support: Compatible with LLMs and MLLMs (multimodal large models) in Megatron Swift
- Algorithm Support: Covers most features of Swift GRPO
The following parameters or features will be gradually supported in future versions:
- Entropy-related Configuration: e.g.,
top_entropy_quantile,log_entropy - Reward Model / Reward Model Plugin
- Multi-turn Rollout Scheduling (
multi_turn_scheduler): Multi-turn conversation policy optimization - Virtual Pipeline Parallelism (VPP)
- Reference Model Synchronization (
sync_ref_model) - Async Generate (
async_generate) - num_iterations
- SwanLab Logging Integration
use_vllm: Megatron GRPO supports rollout inference using vLLM only.move_model_batches: This parameter is specific to DeepSpeed ZeRO-3 optimization and is invalid in the Megatron architecture.
Similar to ms-swift GRPO, all batch size-related parameters in Megatron GRPO are at the completion-level, meaning they represent the number of completions generated by the model, not the number of prompts.
The following table compares the batch-related parameters between ms-swift and Megatron-SWIFT:
| ms-swift Parameter | Megatron-SWIFT Parameter | Description |
|---|---|---|
per_device_train_batch_size |
micro_batch_size |
Training batch size per GPU (completion-level) |
gradient_accumulation_steps |
- | Gradient accumulation steps, already included in global_batch_size calculation in Megatron-SWIFT |
| - | global_batch_size |
Global batch size (completion-level) Megatron-SWIFT: micro_batch_size × dp_size × gradient_accumulation_stepsms-swift: per_device_train_batch_size × world_size × gradient_accumulation_steps |
num_generations |
num_generations |
Number of completions generated per prompt |
steps_per_generation |
steps_per_generation |
Ratio of Rollout batch size to training batch size Note: In ms-swift, must be an integer multiple of gradient_accumulation_steps |
generation_batch_size |
generation_batch_size |
Batch size during Rollout phase (completion-level), must be an integer multiple of global_batch_size |
The following formulas are used to calculate batch sizes in Megatron GRPO:
- Data Parallel Size:
dp_size = world_size / (TP × PP × CP) - Global Batch Size:
global_batch_size = micro_batch_size × dp_size × gradient_accumulation_steps - Generation Batch Size:
generation_batch_size = global_batch_size × steps_per_generation - Rollout Prompt Count:
num_rollout_prompts = generation_batch_size / num_generations - Training Prompt Count:
num_train_prompts = global_batch_size / num_generations - Training Prompt Count per DP Group:
num_prompts_per_dp_group = global_batch_size / num_generations / dp_size
Note: In Megatron GRPO, the training prompt count per DP group must satisfy that num_prompts_per_dp_group is an integer multiple of micro_batch_size to ensure proper batch allocation during training.
For more parameters, please refer to the Command-line Parameters documentation.
For training scripts, please refer to Megatron GRPO Scripts.