@@ -43,27 +43,25 @@ The performance data includes:
4343
4444---
4545
46- ## Nemo RL v0.5
46+ ## Nemo RL v0.6
4747
4848### H100 BF16 Benchmarks
49- * GRPO Dataset: [ OpenMathInstruct-2] ( https://huggingface.co/datasets/nvidia/OpenMathInstruct-2 ) ; DAPO dataset: [ DAPOMath17k] ( https://huggingface.co/datasets/BytedTsinghua-SIA/DAPO-Math-17k )
49+ * GRPO Dataset: [ OpenMathInstruct-2] ( https://huggingface.co/datasets/nvidia/OpenMathInstruct-2 ) ; DAPO dataset: [ DAPOMath17k] ( https://huggingface.co/datasets/BytedTsinghua-SIA/DAPO-Math-17k ) ; SWE dataset: refer to [ Nemotron super-v3 documentation - stage 2.2 ] ( https://github.com/NVIDIA-NeMo/RL/blob/super-v3/docs/guides/nemotron-3-super.md#stage-22---swe-2-64-nodes )
5050* System: DGX-H100
5151* Precision: Training BF16, Generation BF16
5252* Training Backend: Megatron-core.
5353
5454| Algorithm | Model | On/Off policy| T-Max Sequence Length| G-Average Seq len| #-GPUs| G-GBS| T-GBS| Generation [ TP,PP] | Training [ TP,CP,EP,PP,VPP] | Tokens / sec / GPU| Total Step time(s)|
5555| --------- | ------- | -------- | ----- | ----- | ------| ---- | ---- | ---- | ---- | --- | ---|
56- | GRPO | LLAMA3.1_8B| On policy | 4,096 | 1,019 | 16 | 2,048| 512 | [ 1,1] | [ 1,1,1,1,1,2,n/a] | 1,581 | 92.8|
57- | GRPO | LLAMA3.1_8B| 1-step Off | 4,096 | 1,123 | 16 | 2,048| 512 | [ 1,1] | [ 1,1,1,1,1,1,n/a] | 2,478 | 64.8|
58- | GRPO | DeepSeek V3| On policy | 1,536 | 744 | 256 | 512 | 512 | [ 32,1] | [ 1,1,16,16,n/a] | 12.7 | 134|
59- | GRPO | DeepSeek V3| 1-step Off | 1,536 | 738 | 512 | 512 | 512 | [ 32,1] | [ 1,1,16,16,n/a] | 13.1 | 64.9|
60- | DAPO | DeepSeek V3| On policy | 1,536 | 974 | 512 | 512 | 512 | [ 64,1] | [ 8,4,32,8,n/a] | 2.45 | 458|
61- | GRPO | Qwen3-235B | On policy | 8,192 | 5,700 | 128 | 512 | 512 | [ 16,1] | [ 2,2,16,8,n/a] | 54.1 | 431|
62- | GRPO | Qwen3-235B | 1-step Off | 8,192 | 5,707 | 256 | 512 | 512 | [ 8,1] | [ 4,1,16,8,n/a] | 58.7 | 203|
63- | GRPO | Qwen3-30B3A| On policy | 4,096 | 3,196 | 32 | 2,048| 512 | [ 2,1] | [ 1,1,8,1,n/a] | 1066 | 198|
64- | GRPO | Qwen3-30B3A| 1-step Off | 4,096 | 3,201 | 32 | 2,048| 512 | [ 2,1] | [ 1,1,8,2,n/a] | 1391 | 154|
65- | GRPO | Qwen3-32B | On policy | 4,096 | 3,251 | 32 | 2,048| 512 | [ 4,1] | [ 4,1,1,4,n/a] | 571 | 376|
66- | GRPO | Qwen3-32B | 1-step Off | 4,096 | 3,252 | 64 | 2,048| 512 | [ 4,1] | [ 4,1,1,4,n/a] | 538 | 200|
56+ | GRPO | DeepSeek V3| On policy | 1,536 | 701 | 256 | 512 | 512 | [ 32,1] | [ 1,1,16,16,n/a] | 12.1 | 134|
57+ | GRPO | DeepSeek V3| On policy | 1,536 | 697 | 512 | 512 | 512 | [ 32,1] | [ 1,1,16,16,n/a] | 7.24 | 111|
58+ | GRPO | DeepSeek V3| 1-step Off | 1,536 | 710 | 512 | 512 | 512 | [ 32,1] | [ 1,1,16,16,n/a] | 12.8 | 64.1|
59+ | GRPO | Qwen3-235B | On policy | 8,192 | 5,698 | 128 | 512 | 512 | [ 16,1] | [ 2,2,16,8,n/a] | 58.9 | 395|
60+ | GRPO | Qwen3-235B | On policy | 8,192 | 5,713 | 256 | 512 | 512 | [ 16,1] | [ 2,2,16,8,n/a] | 37.4 | 312|
61+ | GRPO | Qwen3-235B | 1-step Off | 8,192 | 5,721 | 256 | 512 | 512 | [ 8,1] | [ 4,1,16,8,n/a] | 58.7 | 231|
62+ | GRPO | Qwen3-30B3A| On policy | 4,096 | 3,203 | 32 | 2,048| 512 | [ 2,1] | [ 1,1,8,1,n/a] | 1102 | 192|
63+ | GRPO | Qwen3-30B3A| 1-step Off | 4,096 | 3,201 | 32 | 2,048| 512 | [ 2,1] | [ 1,1,8,2,n/a] | 1414 | 152|
64+ | GRPO | Qwen3-30B3A| 8-step Off | 4,096 | 3,206 | 192 | 2,048| 512 | [ 2,1] | [ 1,1,8,1,n/a] | 1025 | 34.5|
6765
6866### H100 FP8 Benchmarks
6967* GRPO Dataset: [ OpenMathInstruct-2] ( https://huggingface.co/datasets/nvidia/OpenMathInstruct-2 )
@@ -73,8 +71,7 @@ The performance data includes:
7371
7472| Algorithm | Model | On/Off policy| T-Max Sequence Length| G-Average Seq len| #-GPUs| G-GBS| T-GBS| Generation [ TP,PP] | Training [ TP,CP,EP,PP,VPP] | Tokens / sec / GPU| Total Step time(s)|
7573| --------- | ------- | -------- | ----- | ----- | ------| ---- | ---- | ---- | ---- | --- | ---|
76- | GRPO | LLAMA3.1_8B| 1-step Off | 4,096 | 1,128 | 16 | 2,048| 512 | [ 1,1] | [ 1,1,1,1,1,1,n/a] | 3,052 | 53.0|
77- | GRPO | DeepSeek V3| 1-step Off | 1,536 | 761 | 512 | 512 | 512 | [ 16,1] | [ 1,1,16,16,n/a] | 14.1 | 67.6|
74+ | GRPO | DeepSeek V3| 1-step Off | 1,536 | 721 | 512 | 512 | 512 | [ 16,1] | [ 1,1,16,16,n/a] | 14.1 | 59.2|
7875
7976### GB200 BF16 Benchmarks
8077* GRPO Dataset: [ OpenMathInstruct-2] ( https://huggingface.co/datasets/nvidia/OpenMathInstruct-2 )
@@ -84,18 +81,18 @@ The performance data includes:
8481
8582| Algorithm | Model | On/Off policy| T-Max Sequence Length| G-Average Seq len| #-GPUs| G-GBS| T-GBS| Generation [ TP,PP] | Training [ TP,CP,EP,PP,VPP] | Tokens / sec / GPU| Total Step time(s)|
8683| --------- | ------- | -------- | ----- | ----- | ------| ---- | ---- | ---- | ---- | --- | ---|
87- | GRPO | LLAMA3.1_8B| On policy | 4,096 | 1,066 | 8 | 2,048| 512 | [ 1,1] | [ 1,1,1,1,1,1,n/a] | 3,359 | 91.0|
88- | GRPO | LLAMA3.1_8B| 1-step Off | 4,096 | 1,107 | 8 | 2,048| 512 | [ 1,1] | [ 1,1,1,1,1,1,n/a] | 4,463 | 71.1|
89- | GRPO | DeepSeek V3| On policy | 1,536 | 996 | 128 | 512 | 512 | [ 32,1] | [ 1,1,16,8,n/a] | 34.3 | 128|
90- | GRPO | DeepSeek V3| 1-step Off | 1,536 | 994 | 256 | 512 | 512 | [ 16,1] | [ 1,1,16,8,n/a] | 31.7 | 64.5|
91- | GRPO | Qwen3-235B | On policy | 8,192 | 5,711 | 64 | 512 | 512 | [ 8,1] | [ 2,2,16,4,n/a] | 140 | 332|
92- | GRPO | Qwen3-235B | 1-step Off | 8,192 | 5,711 | 128 | 512 | 512 | [ 8,1] | [ 4,1,16,4,n/a] | 87.9 | 268|
93- | GRPO | Qwen3-30B3A| On policy | 4,096 | 3,198 | 16 | 2,048| 512 | [ 1,1] | [ 1,1,16,1,n/a] | 1,822 | 232|
94- | GRPO | Qwen3-30B3A| 1-step Off | 4,096 | 3,204 | 32 | 2,048| 512 | [ 1,1] | [ 1,1,16,1,n/a] | 1,558 | 136|
95- | GRPO | Qwen3-32B | On policy | 4,096 | 3,253 | 16 | 2,048| 512 | [ 1,1] | [ 2,1,1,1,n/a] | 1,127 | 381|
96- | GRPO | Qwen3-32B | 1-step Off | 4,096 | 3,258 | 32 | 2,048| 512 | [ 1,1] | [ 2,1,1,1,n/a] | 1,025 | 210|
84+ | GRPO | DeepSeek V3| On policy | 1,536 | 711 | 128 | 512 | 512 | [ 32,1] | [ 1,1,16,8,n/a] | 30.2 | 108|
85+ | GRPO | DeepSeek V3| On policy | 1,536 | 700 | 256 | 512 | 512 | [ 32,1] | [ 1,1,16,8,n/a] | 16.4 | 98.7|
86+ | GRPO | DeepSeek V3| 1-step Off | 1,536 | 708 | 256 | 512 | 512 | [ 16,1] | [ 1,1,16,8,n/a] | 26.7 | 61.7|
87+ | GRPO | Qwen3-235B | On policy | 8,192 | 5,709 | 64 | 512 | 512 | [ 8,1] | [ 2,2,16,4,n/a] | 163 | 286|
88+ | GRPO | Qwen3-235B | On policy | 8,192 | 5,693 | 128 | 512 | 512 | [ 8,1] | [ 2,2,16,4,n/a] | 67.4 | 345|
89+ | GRPO | Qwen3-235B | 1-step Off | 8,192 | 5,705 | 128 | 512 | 512 | [ 8,1] | [ 4,1,16,4,n/a] | 85.5 | 278|
90+ | GRPO | Qwen3-30B3A| On policy | 4,096 | 3,199 | 16 | 2,048| 512 | [ 1,1] | [ 1,1,16,1,n/a] | 1,910 | 221|
91+ | GRPO | Qwen3-30B3A| 1-step Off | 4,096 | 3,197 | 16 | 2,048| 512 | [ 1,1] | [ 1,1,16,1,n/a] | 1,406 | 301|
92+ | SWE | Nemotron-3-Nano-30B-A3B| 1-step Off | 131,072 | 31,599 | 128 | 512 | 512 | [ 8,1] | [ 8,8,8,1,n/a] | 37.5 | 430|
9793
9894Note:
9995
10096* All Mixture-of-expert (MoE) model training uses token drop-less.
10197* The following metrics are extracted from the average of 5 steps: G-Average Seq len, Tokens/sec/gpu, Total Step time(s). Because of the averaging, the numbers in the table do not completely match the equation stated in Performance Metrics above but the difference is small.
98+ * There was a change in pretrained checkpoint (see [ docs/guides/deepseek.md] ( https://github.com/NVIDIA-NeMo/RL/blob/r0.6.0/docs/guides/deepseek.md ) ) for DeepSeek V3 leading to lower Average Seq len. The reported throughput is not comparable across versions. Please use equivalent checkpoints for comparison. For example, using the newer checkpoint ` DeepSeek V3 on-policy GRPO #-GPUs: 128 ` v0.5.0 performs at ` 26.1 Tokens / sec / GPU ` compared to v0.6.0 at ` 30.2 Tokens / sec / GPU ` .
0 commit comments