Skip to content

Latest commit

 

History

History
59 lines (42 loc) · 3.87 KB

File metadata and controls

59 lines (42 loc) · 3.87 KB

Megatron GRPO

Version Requirement: ms-swift >= 3.11

If you are new to GRPO, please refer to the GRPO documentation first.

Megatron GRPO currently supports the following features:

  • Training Modes: Full parameter training and LoRA fine-tuning
  • Parallelism Strategies: Context Parallelism (CP), Pipeline Parallelism (PP), Tensor Parallelism (TP), and Expert Parallelism (EP)
  • Inference Acceleration: vLLM colocate mode and server mode
  • Model Support: Compatible with LLMs and MLLMs (multimodal large models) in Megatron Swift
  • Algorithm Support: Covers most features of Swift GRPO

The following parameters or features will be gradually supported in future versions:

  • Entropy-related Configuration: e.g., top_entropy_quantile, log_entropy
  • Reward Model / Reward Model Plugin
  • Multi-turn Rollout Scheduling (multi_turn_scheduler): Multi-turn conversation policy optimization
  • Virtual Pipeline Parallelism (VPP)
  • Reference Model Synchronization (sync_ref_model)
  • Async Generate (async_generate)
  • num_iterations
  • SwanLab Logging Integration

⚠️ Note: The following parameters are not effective in Megatron GRPO:

  • use_vllm: Megatron GRPO supports rollout inference using vLLM only.
  • move_model_batches: This parameter is specific to DeepSpeed ZeRO-3 optimization and is invalid in the Megatron architecture.

Similar to ms-swift GRPO, all batch size-related parameters in Megatron GRPO are at the completion-level, meaning they represent the number of completions generated by the model, not the number of prompts.

Parameter Comparison

The following table compares the batch-related parameters between ms-swift and Megatron-SWIFT:

ms-swift Parameter Megatron-SWIFT Parameter Description
per_device_train_batch_size micro_batch_size Training batch size per GPU (completion-level)
gradient_accumulation_steps - Gradient accumulation steps, already included in global_batch_size calculation in Megatron-SWIFT
- global_batch_size Global batch size (completion-level)
Megatron-SWIFT: micro_batch_size × dp_size × gradient_accumulation_steps
ms-swift: per_device_train_batch_size × world_size × gradient_accumulation_steps
num_generations num_generations Number of completions generated per prompt
steps_per_generation steps_per_generation Ratio of Rollout batch size to training batch size
Note: In ms-swift, must be an integer multiple of gradient_accumulation_steps
generation_batch_size generation_batch_size Batch size during Rollout phase (completion-level), must be an integer multiple of global_batch_size

The following formulas are used to calculate batch sizes in Megatron GRPO:

  • Data Parallel Size: dp_size = world_size / (TP × PP × CP)
  • Global Batch Size: global_batch_size = micro_batch_size × dp_size × gradient_accumulation_steps
  • Generation Batch Size: generation_batch_size = global_batch_size × steps_per_generation
  • Rollout Prompt Count: num_rollout_prompts = generation_batch_size / num_generations
  • Training Prompt Count: num_train_prompts = global_batch_size / num_generations
  • Training Prompt Count per DP Group: num_prompts_per_dp_group = global_batch_size / num_generations / dp_size

Note: In Megatron GRPO, the training prompt count per DP group must satisfy that num_prompts_per_dp_group is an integer multiple of micro_batch_size to ensure proper batch allocation during training.

For more parameters, please refer to the Command-line Parameters documentation.

For training scripts, please refer to Megatron GRPO Scripts.