Skip to content

Commit 5fb5889

Browse files
svcnvidia-nemo-ciguyueh1parthmannan
authored
cp: docs: Perf page update for v0.6 (2346) into r0.6.0 (#2364)
Signed-off-by: Guyue Huang <guyueh@nvidia.com> Signed-off-by: Guyue Huang <140554423+guyueh1@users.noreply.github.com> Signed-off-by: Parth Mannan <pmannan@nvidia.com> Signed-off-by: NeMo Bot <nemo-bot@nvidia.com> Co-authored-by: Guyue Huang <140554423+guyueh1@users.noreply.github.com> Co-authored-by: Parth Mannan <pmannan@nvidia.com>
1 parent fbbbbd5 commit 5fb5889

1 file changed

Lines changed: 22 additions & 25 deletions

File tree

docs/about/performance-summary.md

Lines changed: 22 additions & 25 deletions
Original file line numberDiff line numberDiff line change
@@ -43,27 +43,25 @@ The performance data includes:
4343

4444
---
4545

46-
## Nemo RL v0.5
46+
## Nemo RL v0.6
4747

4848
### H100 BF16 Benchmarks
49-
* GRPO Dataset: [OpenMathInstruct-2](https://huggingface.co/datasets/nvidia/OpenMathInstruct-2); DAPO dataset: [DAPOMath17k](https://huggingface.co/datasets/BytedTsinghua-SIA/DAPO-Math-17k)
49+
* GRPO Dataset: [OpenMathInstruct-2](https://huggingface.co/datasets/nvidia/OpenMathInstruct-2); DAPO dataset: [DAPOMath17k](https://huggingface.co/datasets/BytedTsinghua-SIA/DAPO-Math-17k); SWE dataset: refer to [Nemotron super-v3 documentation - stage 2.2](https://github.com/NVIDIA-NeMo/RL/blob/super-v3/docs/guides/nemotron-3-super.md#stage-22---swe-2-64-nodes)
5050
* System: DGX-H100
5151
* Precision: Training BF16, Generation BF16
5252
* Training Backend: Megatron-core.
5353

5454
| Algorithm | Model |On/Off policy|T-Max Sequence Length|G-Average Seq len|#-GPUs|G-GBS|T-GBS|Generation [TP,PP]|Training [TP,CP,EP,PP,VPP]|Tokens / sec / GPU|Total Step time(s)|
5555
|--------- |------- |-------- |----- |----- |------|---- |---- |---- |---- |--- |---|
56-
| GRPO |LLAMA3.1_8B|On policy |4,096 |1,019 |16 |2,048|512 |[1,1] |[1,1,1,1,1,2,n/a] |1,581 | 92.8|
57-
| GRPO |LLAMA3.1_8B|1-step Off |4,096 |1,123 |16 |2,048|512 |[1,1] |[1,1,1,1,1,1,n/a] |2,478 | 64.8|
58-
| GRPO |DeepSeek V3|On policy |1,536 |744 |256 |512 |512 |[32,1] |[1,1,16,16,n/a] |12.7 | 134|
59-
| GRPO |DeepSeek V3|1-step Off |1,536 |738 |512 |512 |512 |[32,1] |[1,1,16,16,n/a] |13.1 | 64.9|
60-
| DAPO |DeepSeek V3|On policy |1,536 |974 |512 |512 |512 |[64,1] |[8,4,32,8,n/a] |2.45 | 458|
61-
| GRPO |Qwen3-235B |On policy |8,192 |5,700 |128 |512 |512 |[16,1] |[2,2,16,8,n/a] |54.1 | 431|
62-
| GRPO |Qwen3-235B |1-step Off |8,192 |5,707 |256 |512 |512 |[8,1] |[4,1,16,8,n/a] |58.7 | 203|
63-
| GRPO |Qwen3-30B3A|On policy |4,096 |3,196 |32 |2,048|512 |[2,1] |[1,1,8,1,n/a] |1066 | 198|
64-
| GRPO |Qwen3-30B3A|1-step Off |4,096 |3,201 |32 |2,048|512 |[2,1] |[1,1,8,2,n/a] |1391 | 154|
65-
| GRPO |Qwen3-32B |On policy |4,096 |3,251 |32 |2,048|512 |[4,1] |[4,1,1,4,n/a] |571 | 376|
66-
| GRPO |Qwen3-32B |1-step Off |4,096 |3,252 |64 |2,048|512 |[4,1] |[4,1,1,4,n/a] |538 | 200|
56+
| GRPO |DeepSeek V3|On policy |1,536 |701 |256 |512 |512 |[32,1] |[1,1,16,16,n/a] |12.1 | 134|
57+
| GRPO |DeepSeek V3|On policy |1,536 |697 |512 |512 |512 |[32,1] |[1,1,16,16,n/a] |7.24 | 111|
58+
| GRPO |DeepSeek V3|1-step Off |1,536 |710 |512 |512 |512 |[32,1] |[1,1,16,16,n/a] |12.8 | 64.1|
59+
| GRPO |Qwen3-235B |On policy |8,192 |5,698 |128 |512 |512 |[16,1] |[2,2,16,8,n/a] |58.9 | 395|
60+
| GRPO |Qwen3-235B |On policy |8,192 |5,713 |256 |512 |512 |[16,1] |[2,2,16,8,n/a] |37.4 | 312|
61+
| GRPO |Qwen3-235B |1-step Off |8,192 |5,721 |256 |512 |512 |[8,1] |[4,1,16,8,n/a] |58.7 | 231|
62+
| GRPO |Qwen3-30B3A|On policy |4,096 |3,203 |32 |2,048|512 |[2,1] |[1,1,8,1,n/a] |1102 | 192|
63+
| GRPO |Qwen3-30B3A|1-step Off |4,096 |3,201 |32 |2,048|512 |[2,1] |[1,1,8,2,n/a] |1414 | 152|
64+
| GRPO |Qwen3-30B3A|8-step Off |4,096 |3,206 |192 |2,048|512 |[2,1] |[1,1,8,1,n/a] |1025 | 34.5|
6765

6866
### H100 FP8 Benchmarks
6967
* GRPO Dataset: [OpenMathInstruct-2](https://huggingface.co/datasets/nvidia/OpenMathInstruct-2)
@@ -73,8 +71,7 @@ The performance data includes:
7371

7472
| Algorithm | Model |On/Off policy|T-Max Sequence Length|G-Average Seq len|#-GPUs|G-GBS|T-GBS|Generation [TP,PP]|Training [TP,CP,EP,PP,VPP]|Tokens / sec / GPU|Total Step time(s)|
7573
|--------- |------- |-------- |----- |----- |------|---- |---- |---- |---- |--- |---|
76-
| GRPO |LLAMA3.1_8B|1-step Off |4,096 |1,128 |16 |2,048|512 |[1,1] |[1,1,1,1,1,1,n/a] |3,052 | 53.0|
77-
| GRPO |DeepSeek V3|1-step Off |1,536 |761 |512 |512 |512 |[16,1] |[1,1,16,16,n/a] |14.1 | 67.6|
74+
| GRPO |DeepSeek V3|1-step Off |1,536 |721 |512 |512 |512 |[16,1] |[1,1,16,16,n/a] |14.1 | 59.2|
7875

7976
### GB200 BF16 Benchmarks
8077
* GRPO Dataset: [OpenMathInstruct-2](https://huggingface.co/datasets/nvidia/OpenMathInstruct-2)
@@ -84,18 +81,18 @@ The performance data includes:
8481

8582
| Algorithm | Model |On/Off policy|T-Max Sequence Length|G-Average Seq len|#-GPUs|G-GBS|T-GBS|Generation [TP,PP]|Training [TP,CP,EP,PP,VPP]|Tokens / sec / GPU|Total Step time(s)|
8683
|--------- |------- |-------- |----- |----- |------|---- |---- |---- |---- |--- |---|
87-
| GRPO |LLAMA3.1_8B|On policy |4,096 |1,066 |8 |2,048|512 |[1,1] |[1,1,1,1,1,1,n/a] |3,359 | 91.0|
88-
| GRPO |LLAMA3.1_8B|1-step Off |4,096 |1,107 |8 |2,048|512 |[1,1] |[1,1,1,1,1,1,n/a] |4,463 | 71.1|
89-
| GRPO |DeepSeek V3|On policy |1,536 |996 |128 |512 |512 |[32,1] |[1,1,16,8,n/a] |34.3 | 128|
90-
| GRPO |DeepSeek V3|1-step Off |1,536 |994 |256 |512 |512 |[16,1] |[1,1,16,8,n/a] |31.7 | 64.5|
91-
| GRPO |Qwen3-235B |On policy |8,192 |5,711 |64 |512 |512 |[8,1] |[2,2,16,4,n/a] |140 | 332|
92-
| GRPO |Qwen3-235B |1-step Off |8,192 |5,711 |128 |512 |512 |[8,1] |[4,1,16,4,n/a] |87.9 | 268|
93-
| GRPO |Qwen3-30B3A|On policy |4,096 |3,198 |16 |2,048|512 |[1,1] |[1,1,16,1,n/a] |1,822 | 232|
94-
| GRPO |Qwen3-30B3A|1-step Off |4,096 |3,204 |32 |2,048|512 |[1,1] |[1,1,16,1,n/a] |1,558 | 136|
95-
| GRPO |Qwen3-32B |On policy |4,096 |3,253 |16 |2,048|512 |[1,1] |[2,1,1,1,n/a] |1,127 | 381|
96-
| GRPO |Qwen3-32B |1-step Off |4,096 |3,258 |32 |2,048|512 |[1,1] |[2,1,1,1,n/a] |1,025 | 210|
84+
| GRPO |DeepSeek V3|On policy |1,536 |711 |128 |512 |512 |[32,1] |[1,1,16,8,n/a] |30.2 | 108|
85+
| GRPO |DeepSeek V3|On policy |1,536 |700 |256 |512 |512 |[32,1] |[1,1,16,8,n/a] |16.4 | 98.7|
86+
| GRPO |DeepSeek V3|1-step Off |1,536 |708 |256 |512 |512 |[16,1] |[1,1,16,8,n/a] |26.7 | 61.7|
87+
| GRPO |Qwen3-235B |On policy |8,192 |5,709 |64 |512 |512 |[8,1] |[2,2,16,4,n/a] |163 | 286|
88+
| GRPO |Qwen3-235B |On policy |8,192 |5,693 |128 |512 |512 |[8,1] |[2,2,16,4,n/a] |67.4 | 345|
89+
| GRPO |Qwen3-235B |1-step Off |8,192 |5,705 |128 |512 |512 |[8,1] |[4,1,16,4,n/a] |85.5 | 278|
90+
| GRPO |Qwen3-30B3A|On policy |4,096 |3,199 |16 |2,048|512 |[1,1] |[1,1,16,1,n/a] |1,910 | 221|
91+
| GRPO |Qwen3-30B3A|1-step Off |4,096 |3,197 |16 |2,048|512 |[1,1] |[1,1,16,1,n/a] |1,406 | 301|
92+
| SWE |Nemotron-3-Nano-30B-A3B|1-step Off |131,072 |31,599 |128 |512 |512 |[8,1] |[8,8,8,1,n/a] |37.5 | 430|
9793

9894
Note:
9995

10096
* All Mixture-of-expert (MoE) model training uses token drop-less.
10197
* The following metrics are extracted from the average of 5 steps: G-Average Seq len, Tokens/sec/gpu, Total Step time(s). Because of the averaging, the numbers in the table do not completely match the equation stated in Performance Metrics above but the difference is small.
98+
* There was a change in pretrained checkpoint (see [docs/guides/deepseek.md](https://github.com/NVIDIA-NeMo/RL/blob/r0.6.0/docs/guides/deepseek.md)) for DeepSeek V3 leading to lower Average Seq len. The reported throughput is not comparable across versions. Please use equivalent checkpoints for comparison. For example, using the newer checkpoint `DeepSeek V3 on-policy GRPO #-GPUs: 128` v0.5.0 performs at `26.1 Tokens / sec / GPU` compared to v0.6.0 at `30.2 Tokens / sec / GPU`.

0 commit comments

Comments
 (0)