Skip to content

Release v0.6.0

Latest

Choose a tag to compare

@terrykong terrykong released this 30 Apr 19:23
· 3 commits to r0.6.0 since this release
5fb5889

📝 Blog

NeMo RL: Run High throughput Reinforcement Learning with End to End FP8 Precision

✨ Highlights

Container

Both linux/amd64 and linux/arm64 Docker containers are available on NGC as nvcr.io/nvidia/nemo-rl:v0.6.0.

Here are the major software components included in the container:

Software Component Version
NeMo-RL 0.6.0
NeMo-Gym 0.3.0rc0+1a4912e
NeMo-Automodel 0.3.0rc0+92635e7
Megatron-Bridge 0.5.0+95e5f38
Megatron-Core 0.18.0+d30c3ae
Pytorch 2.10.0
vllm 0.17.1

The NeMo-RL container is built on top of the nvcr.io/nvidia/cuda-dl-base:25.05-cuda12.9-devel-ubuntu24.04

If you would like to build this container, or nightly containers, yourself, we provide the exact instructions we use at https://docs.nvidia.com/nemo/rl/latest/docker.html#release-image.

LoRA for GRPO and DPO

Building on the LoRA SFT support introduced in v0.5, NeMo RL v0.6 extends LoRA (Low-Rank Adaptation) to GRPO and DPO workflows. This enables parameter-efficient reinforcement learning and preference optimization with minimal modifications to existing recipes. LoRA for GRPO and DPO is supported with both the Megatron backend and the DTensor V2 (Automodel) backend.

Megatron LoRA GRPO:

policy:
  megatron_cfg:
    enabled: true
    peft:
      enabled: true
      dim: 128
      alpha: 512
      exclude_modules: ['*out_proj*']

DTensor V2 LoRA GRPO:

policy:
  dtensor_cfg:
    lora_cfg:
      enabled: true
      dim: 128
      alpha: 512
      exclude_modules: ['*out_proj*']
      match_all_linear: false
      use_triton: false

Example recipes:

GDPO: Multi-Reward RL Training

NeMo RL v0.6 introduces GDPO (Group reward-Decoupled Normalization Policy Optimization), a reinforcement learning method designed for multi-reward training. While existing approaches commonly apply GRPO in multi-reward settings, they can lead to reward advantage collapse, reducing training signal resolution and causing unstable or failed convergence. GDPO resolves this by decoupling reward normalization across individual rewards, preserving their relative differences and enabling more faithful preference optimization.

To enable GDPO:

grpo:
  adv_estimator:
    name: "gdpo"
    normalize_rewards: true
    use_leave_one_out_baseline: false

Note that this method only has an effect when training involves more than one reward function. GDPO also supports async RL mode. See the GRPO guide for details.

ProRLv2

NeMo RL v0.6 adds the ProRLv2 configuration pattern (blog), which bundles GRPO with a set of stability and efficiency techniques commonly used for long-horizon RL fine-tuning:

  • DAPO dynamic sampling: skip prompt-groups with zero reward variance
  • Decoupled (asymmetric) clipping: different lower/upper clip bounds for better exploration
  • Token-level policy gradient loss
  • Importance sampling correction: ICE-POP / seq-mask-tis for backend-mismatch filtering
  • Reinforce++-Baseline: decoupled local/global advantage normalization
  • "Stop properly" penalty for truncated responses
uv run examples/run_grpo_math.py --config examples/configs/prorlv2.yaml

For the full walkthrough, see the ProRLv2 guide.

Speculative Decoding

NeMo RL now supports speculative decoding for rollout acceleration, including methods such as external draft models, Eagle3, and MTP. A smaller draft model runs in vLLM and proposes tokens that the policy model verifies, speeding up generation. Two modes are available:

  • Offline: a fixed draft model is used only for faster generation; the RL loop does not update it.
  • Online: NeMo RL currently supports online draft model training only for Eagle3. It attaches an Eagle3 draft model to the Megatron policy worker, trains it alongside the policy, and refits both policy and draft weights into vLLM — keeping the drafter aligned with RL updates.

Generation-only example:

policy:
  generation:
    backend: "vllm"
    vllm_kwargs:
      speculative_config:
        method: "eagle3"
        model: /path/to/eagle3-draft
        num_speculative_tokens: 3

Online draft training example:

policy:
  megatron_cfg:
    enabled: true
  draft:
    enabled: true
    model_name: ${policy.generation.vllm_kwargs.speculative_config.model}
    loss_weight: 1.0
  generation:
    backend: "vllm"
    vllm_kwargs:
      speculative_config:
        method: "eagle3"
        model: /path/to/eagle3-draft
        num_speculative_tokens: 3
        draft_tensor_parallel_size: 1

Example recipe: examples/configs/recipes/llm/grpo-qwen3-1.7b-1n8g-megatron-eagle3.yaml. For the full guide, see the Eagle3 Speculative Decoding documentation.

SGLang Inference Backend

NeMo RL now supports SGLang as a generation backend alongside vLLM and Megatron inference. SGLang can be used for GRPO rollouts with a simple config change:

policy:
  generation:
    backend: "sglang"
    sglang_cfg:
      model_path: ${policy.model_name}
      gpus_per_server: 1
      dtype: ${policy.precision}
      context_length: 512
      mem_fraction_static: 0.7

SGLang is currently supported with the DTensor V2 (Automodel) policy backend only. We are actively working with the SGLang team on improving this integration and adding support for the Megatron backend.

Example recipes: grpo-qwen3-0.6b-1n8g-sglang.yaml, grpo-qwen2.5-math-1.5b-instruct-1n8g-fsdp2tp1-sglang.yaml.

Muon Optimizer

NeMo RL now supports the Muon (MomentUm Orthogonalized by Newton-schulz) optimizer for SFT and RL training. Muon achieves higher sample efficiency compared to AdamW by applying Newton-Schulz orthogonalization to momentum-based updates. Muon is supported with the Megatron backend.

policy:
  megatron_cfg:
    enabled: true
    optimizer:
      optimizer: "dist_muon"
      muon_momentum: 0.95
      muon_scale_mode: "spectral"
      muon_num_ns_steps: 5
      use_distributed_optimizer: false
      use_precision_aware_optimizer: false

For the full guide, see the Muon Optimizer documentation.

YaRN Long-Context Training

YaRN (Yet another RoPE extensioN) extends a model's usable context window beyond the length it was pretrained on by rescaling RoPE frequencies. NeMo RL supports YaRN RoPE scaling for SFT, GRPO, DPO, RM, and distillation workflows via the Megatron backend.

policy:
  max_total_sequence_length: 65536
  megatron_cfg:
    enabled: true
  hf_config_overrides:
    rope_scaling:
      rope_type: yarn
      rope_theta: 1000000
      factor: ${div:${policy.max_total_sequence_length},${policy.hf_config_overrides.rope_scaling.original_max_position_embeddings}}
      original_max_position_embeddings: 40960
      truncate: true
      beta_fast: 32
      beta_slow: 1
      mscale: 1
      mscale_all_dim: 0

Example recipes: grpo-qwen2.5-1.5B-4n8g-megatron-yarn-256k.yaml, sft-qwen3-0.6B-1n8g-megatron-yarn-64k.yaml. For the full guide, see the YaRN documentation.

Chunked Linear Cross-Entropy Fusion Loss

A memory-efficient cross-entropy loss that computes the loss directly from hidden states by chunking the sequence dimension, projecting each chunk to logits on the fly, computing per-token log probabilities, and discarding logits before moving to the next chunk. This extends the maximum trainable sequence length significantly (e.g. from <65K to >100K tokens) and produces numerically equivalent loss values.

Now supported for both SFT and DPO workflows with the Megatron backend:

policy:
  megatron_cfg:
    use_linear_ce_fusion_loss: true
    linear_ce_fusion_chunk_size: 256

Example recipes: sft-qwen2.5-math7b-1n8g-megatron_chunked_linear_ce_loss.yaml, dpo-qwen2.5-math7b-1n8g-megatron_chunked_linear_ce_loss.yaml.

Model Support

Nemotron

  • Nemotron Nano v3 is now supported on main. See this guide for reproducible instructions on how to post-train the Nemotron 3 Nano model with NeMo RL.
  • Nemotron Super v3 is supported on the super-v3 branch. See the Nemotron 3 Super guide for details.

Qwen3.5 and GLM-4.7-Flash

NeMo RL adds GRPO training support for Qwen3.5 dense and MoE models (both LLM and VLM), and GLM-4.7-Flash. Example recipes:

For the full model support matrix, please refer to our model support documentation.

⚡ Performance Optimizations

  • Fused sequence packing for loss: A new fuse_loss option under sequence_packing config eliminates the overhead of separating packed sequences for individual loss computation (#1904).
  • Reduced memory footprint for ChunkedDistributedLogProb: Optimized the chunked distributed log-probability computation to reduce peak GPU memory usage (#1895).
  • Shard concat overhead reduction: Reduced overhead in the shard concatenation operation used during distributed training data sharding (#2002).
  • MoE alltoall token dispatcher default: Changed the default MoE token dispatcher type to alltoall for improved MoE model performance (#2004).

View the v0.6.0 performance numbers from our published recipes at https://docs.nvidia.com/nemo/rl/latest/about/performance-summary.html .

SWE-RL Benchmark

NeMo RL now includes a SWE RL release benchmark demonstrating a long-context, multi-step RL rollout. See the performance numbers here with the accompanying recipe and scripts in #2327. SWE support currently can be found on the super-v3 branch.

For information about replicating SWE RL on the Nemotron Super V3 model, see this guide for details.

Notable Additions

  • Top-p and top-k sampling in GRPO: Users can now configure top-p and top-k sampling parameters for GRPO advantage estimation, enabling more controlled sampling during training (#2053).

  • Configurable attention backend for Megatron: A new attention_backend config parameter for the Megatron training backend allows users to select different attention implementations (e.g. FlashAttention, TransformerEngine DotProductAttention) (#1628).

  • LoRA checkpoint merge and HF export: New tooling to merge LoRA adapter weights back into a base Megatron checkpoint and export as a standalone Hugging Face checkpoint, enabling deployment of LoRA-trained models without the separate adapter at inference time (#2173).

  • save_optimizer flag: A new save_optimizer boolean in the checkpoint config (default: true). When set to false, optimizer state is excluded from checkpoints, reducing checkpoint size and save time (#1843).

  • Fault tolerance launcher: NeMo RL integrates with nvidia-resiliency-ext for automatic fault tolerance and recovery for distributed training runs. Install via the nvrx optional extra and use the ft_launcher to get heartbeat monitoring, automatic restarts, and recovery from checkpoints. See the Fault Tolerance Launcher Guide.

  • Major dependency upgrades: Python ≥3.13.13, PyTorch 2.10.0, Ray 2.54.0, Transformers 5.3.0, vLLM 0.17.1, SGLang 0.5.10. These enable compatibility with the latest ecosystem and unlock new features across all backends.

  • System prompt support in math data processor: Added system_prompt
    support to math_hf_data_processor (#2216).

Notable Fixes

  • Fixed a checkpoint loading bug in Megatron LoRA GRPO (#2075).
  • Fixed FP8 _apply_state_dict_to_model for correct checkpoint restoration (#2233).
  • Fixed use_linear_ce_fusion_loss when used with certain configurations (#2232).
  • Fixed GPT-OSS export and bumped Megatron-Bridge for compatibility (#2257).
  • Fixed Gemma3 model support (#2185).
  • Fixed make_sequence_length_divisible_by in config (#2135).
  • Fixed async GRPO offload (#2119).
  • Fixed Megatron checkpoint loading without optimizer and improved warning detection (#2159).
  • Allowed wandb config value changes on resume (#2137).
  • Addressed security vulnerabilities and CVEs (#2236, #2214, #2201).

📊 Release Runs

We have provided Tensorboard logs to release runs to give you a head start on what to expect from our recipes.

To view these Tensorboard logs easily, we've provided a Google Collab to download and serve the Tensorboard logs.

What's Changed

New Contributors

Full Changelog: v0.5.0...v0.6.0