@@ -33,12 +33,113 @@ Below are performance benchmarks for various large language models organized by
3333
3434The performance data includes:
3535
36- - ** Pre-training Performance** : Throughput metrics for various model sizes and architectures
36+ - ** Pre-training Performance** : Throughput metrics for various model sizes and architectures[ ^ moe-training-note ]
3737- ** System Configurations** : Results across different GPU systems (DGX-GB300, DGX-GB200, DGX-B300, DGX-B200, DGX-H100)
3838- ** Precision Options** : Performance comparisons between different precision modes (BF16, FP8, MXFP8, NVFP4)
3939
4040---
4141
42+ ## 26.04.01 NeMo Container
43+
44+ ### Pre-Training Performance
45+
46+ #### Model: LLAMA3_70B
47+
48+ | System | #-GPUs | Precision | GBS | MBS | Sequence Length | FSDP | TP | PP | CP | VP | EP | Tokens / sec / GPU | Model TFLOP / sec / GPU |
49+ | --------| --------| -----------| -----| -----| -----------------| ------| ----| ----| ----| ----| ----| -----------------------| -------------------------|
50+ | DGX-GB300 | 64 | FP8 | 256 | 2 | 8192 | 64 | 1 | 1 | 1 | n/a | n/a | 5248 | 2348 |
51+ | DGX-GB300 | 64 | MXFP8 | 256 | 1 | 8192 | 0 | 1 | 4 | 1 | 5 | n/a | 4864 | 2186 |
52+ | DGX-GB300 | 64 | NVFP4 | 256 | 1 | 8192 | 0 | 1 | 4 | 1 | 5 | n/a | 7296 | 3253 |
53+ | DGX-GB200 | 64 | FP8 | 256 | 2 | 8192 | 64 | 1 | 1 | 1 | n/a | n/a | 4224 | 1892 |
54+ | DGX-GB200 | 64 | MXFP8 | 256 | 1 | 8192 | 0 | 2 | 4 | 1 | 5 | n/a | 3712 | 1664 |
55+ | DGX-GB200 | 64 | NVFP4 | 256 | 1 | 8192 | 0 | 2 | 4 | 1 | 5 | n/a | 4864 | 2202 |
56+ | DGX-H100 | 64 | FP8 | 256 | 1 | 8192 | 0 | 4 | 8 | 1 | 5 | n/a | 1664 | 731 |
57+
58+ #### Model: LLAMA3.1_405B
59+
60+ | System | #-GPUs | Precision | GBS | MBS | Sequence Length | FSDP | TP | PP | CP | VP | EP | Tokens / sec / GPU | Model TFLOP / sec / GPU |
61+ | --------| --------| -----------| -----| -----| -----------------| ------| ----| ----| ----| ----| ----| -----------------------| -------------------------|
62+ | DGX-GB300 | 256 | FP8 | 1536 | 1 | 8192 | 0 | 4 | 8 | 1 | 4 | n/a | 1024 | 2617 |
63+ | DGX-GB300 | 256 | MXFP8 | 1536 | 1 | 8192 | 0 | 2 | 8 | 2 | 4 | n/a | 960 | 2453 |
64+ | DGX-GB300 | 256 | NVFP4 | 1536 | 1 | 8192 | 0 | 4 | 8 | 1 | 4 | n/a | 1440 | 3653 |
65+ | DGX-GB200 | 256 | FP8 | 1536 | 1 | 8192 | 0 | 4 | 16 | 1 | 4 | n/a | 864 | 2144 |
66+ | DGX-GB200 | 256 | MXFP8 | 1536 | 1 | 8192 | 0 | 4 | 16 | 1 | 8 | n/a | 800 | 1994 |
67+ | DGX-GB200 | 256 | NVFP4 | 1536 | 1 | 8192 | 0 | 4 | 16 | 1 | 8 | n/a | 1184 | 2960 |
68+ | DGX-H100 | 1024 | FP8 | 1536 | 1 | 8192 | 0 | 8 | 8 | 2 | 8 | n/a | 328 | 827 |
69+
70+ #### Model: DeepSeekV3
71+
72+ | System | #-GPUs | Precision | GBS | MBS | Sequence Length | FSDP | TP | PP | CP | VP | EP | Tokens / sec / GPU | Model TFLOP / sec / GPU |
73+ | --------| --------| -----------| -----| -----| -----------------| ------| ----| ----| ----| ----| ----| -----------------------| -------------------------|
74+ | DGX-GB300 | 256 | MXFP8 | 4096 | 2 | 4096 | 0 | 1 | 2 | 1 | 8 | 32 | 4992 | 1298 |
75+ | DGX-GB200 | 256 | MXFP8 | 4096 | 1 | 4096 | 0 | 1 | 4 | 1 | 4 | 64 | 4256 | 1106 |
76+ | DGX-B300 | 256 | MXFP8 | 4096 | 2 | 4096 | 0 | 1 | 8 | 1 | n/a | 8 | 3456 | 898 |
77+ | DGX-B200 | 256 | MXFP8 | 4096 | 1 | 4096 | 0 | 1 | 8 | 1 | 2 | 32 | 3328 | 864 |
78+
79+ #### Model: GPT OSS 120B
80+
81+ | System | #-GPUs | Precision | GBS | MBS | Sequence Length | FSDP | TP | PP | CP | VP | EP | Tokens / sec / GPU | Model TFLOP / sec / GPU |
82+ | --------| --------| -----------| -----| -----| -----------------| ------| ----| ----| ----| ----| ----| -----------------------| -------------------------|
83+ | DGX-GB300 | 64 | BF16 | 1280 | 4 | 4096 | 0 | 1 | 1 | 1 | n/a | 64 | 19200 | 523 |
84+ | DGX-GB200 | 64 | BF16 | 1280 | 4 | 4096 | 0 | 1 | 1 | 1 | n/a | 64 | 16640 | 452 |
85+ | DGX-B300 | 64 | BF16 | 1280 | 4 | 4096 | 0 | 1 | 1 | 1 | n/a | 8 | 15232 | 414 |
86+ | DGX-B200 | 64 | BF16 | 1280 | 4 | 4096 | 0 | 1 | 1 | 1 | n/a | 8 | 13568 | 369 |
87+ | DGX-H100 | 64 | BF16 | 1280 | 1 | 4096 | 0 | 1 | 4 | 1 | n/a | 8 | 5824 | 158 |
88+
89+ #### Model: Qwen3_30B_a3B
90+
91+ | System | #-GPUs | Precision | GBS | MBS | Sequence Length | FSDP | TP | PP | CP | VP | EP | Tokens / sec / GPU | Model TFLOP / sec / GPU |
92+ | --------| --------| -----------| -----| -----| -----------------| ------| ----| ----| ----| ----| ----| -----------------------| -------------------------|
93+ | DGX-GB300 | 8 | MXFP8 | 512 | 8 | 4096 | 0 | 1 | 1 | 1 | n/a | 8 | 31744 | 729 |
94+ | DGX-GB200 | 8 | MXFP8 | 512 | 4 | 4096 | 0 | 1 | 1 | 1 | n/a | 8 | 26112 | 599 |
95+ | DGX-B300 | 8 | MXFP8 | 512 | 8 | 4096 | 0 | 1 | 1 | 1 | n/a | 8 | 30720 | 704 |
96+ | DGX-B200 | 8 | MXFP8 | 512 | 4 | 4096 | 0 | 1 | 1 | 1 | n/a | 8 | 27136 | 619 |
97+ | DGX-H100 | 16 | FP8 | 1024 | 1 | 4096 | 0 | 1 | 1 | 1 | n/a | 16 | 8960 | 206 |
98+
99+ #### Model: Qwen3_235B_a22B
100+
101+ | System | #-GPUs | Precision | GBS | MBS | Sequence Length | FSDP | TP | PP | CP | VP | EP | Tokens / sec / GPU | Model TFLOP / sec / GPU |
102+ | --------| --------| -----------| -----| -----| -----------------| ------| ----| ----| ----| ----| ----| -----------------------| -------------------------|
103+ | DGX-GB300 | 256 | MXFP8 | 8192 | 2 | 4096 | 0 | 1 | 4 | 1 | 12 | 32 | 6944 | 1029 |
104+ | DGX-GB200 | 256 | MXFP8 | 8192 | 1 | 4096 | 0 | 1 | 8 | 1 | 3 | 32 | 5680 | 840 |
105+ | DGX-B300 | 256 | MXFP8 | 8192 | 2 | 4096 | 0 | 1 | 8 | 1 | n/a | 8 | 5936 | 878 |
106+ | DGX-B200 | 256 | MXFP8 | 8192 | 1 | 4096 | 0 | 1 | 8 | 1 | n/a | 8 | 3776 | 560 |
107+ | DGX-H100 | 256 | FP8 | 8192 | 1 | 4096 | 0 | 2 | 8 | 1 | 4 | 32 | 1712 | 253 |
108+
109+ #### Model: Kimi_K2
110+
111+ | System | #-GPUs | Precision | GBS | MBS | Sequence Length | FSDP | TP | PP | CP | VP | EP | Tokens / sec / GPU | Model TFLOP / sec / GPU |
112+ | --------| --------| -----------| -----| -----| -----------------| ------| ----| ----| ----| ----| ----| -----------------------| -------------------------|
113+ | DGX-GB300 | 256 | MXFP8 | 4096 | 2 | 4096 | 0 | 1 | 4 | 1 | 4 | 64 | 5328 | 1088 |
114+
115+ - Muon optimizer was used for pre-training Kimi-K2.
116+
117+ #### Model: Nemotron_3_Nano
118+
119+ | System | #-GPUs | Precision | GBS | MBS | Sequence Length | FSDP | TP | PP | CP | VP | EP | Tokens / sec / GPU | Model TFLOP / sec / GPU |
120+ | --------| --------| -----------| -----| -----| -----------------| ------| ----| ----| ----| ----| ----| -----------------------| -------------------------|
121+ | DGX-GB300 | 8 | MXFP8 | 512 | 4 | 8192 | 0 | 1 | 1 | 1 | n/a | 8 | 37888 | 845 |
122+ | DGX-GB200 | 8 | MXFP8 | 512 | 2 | 8192 | 0 | 1 | 1 | 1 | n/a | 8 | 32768 | 725 |
123+ | DGX-B300 | 8 | MXFP8 | 512 | 4 | 8192 | 0 | 1 | 1 | 1 | n/a | 8 | 35840 | 794 |
124+ | DGX-B200 | 8 | MXFP8 | 512 | 2 | 8192 | 0 | 1 | 1 | 1 | n/a | 8 | 32768 | 726 |
125+ | DGX-H100 | 16 | FP8 | 1024 | 1 | 8192 | 0 | 1 | 1 | 1 | n/a | 8 | 14336 | 321 |
126+
127+ #### Model: Nemotron_3_Super
128+
129+ | System | #-GPUs | Precision | GBS | MBS | Sequence Length | FSDP | TP | PP | CP | VP | EP | Tokens / sec / GPU | Model TFLOP / sec / GPU |
130+ | --------| --------| -----------| -----| -----| -----------------| ------| ----| ----| ----| ----| ----| -----------------------| -------------------------|
131+ | DGX-GB300 | 64 | MXFP8 | 512 | 1 | 8192 | 0 | 1 | 1 | 1 | n/a | 64 | 9344 | 795 |
132+ | DGX-GB300 | 64 | NVFP4 | 512 | 1 | 8192 | 0 | 1 | 1 | 1 | n/a | 64 | 9600 | 817 |
133+ | DGX-GB200 | 64 | MXFP8 | 512 | 1 | 8192 | 0 | 2 | 1 | 1 | n/a | 64 | 6656 | 564 |
134+ | DGX-GB200 | 64 | NVFP4 | 512 | 1 | 8192 | 0 | 2 | 1 | 1 | n/a | 64 | 6784 | 574 |
135+ | DGX-B300 | 64 | MXFP8 | 512 | 1 | 8192 | 0 | 1 | 1 | 1 | n/a | 8 | 7296 | 623 |
136+ | DGX-B300 | 64 | NVFP4 | 512 | 1 | 8192 | 0 | 1 | 1 | 1 | n/a | 8 | 7424 | 634 |
137+ | DGX-B200 | 64 | MXFP8 | 512 | 1 | 8192 | 0 | 1 | 1 | 1 | n/a | 64 | 6400 | 542 |
138+ | DGX-B200 | 64 | NVFP4 | 512 | 1 | 8192 | 0 | 2 | 1 | 1 | n/a | 64 | 5632 | 475[ ^ nemotron-3-super-b200-nvfp4-note ] |
139+
140+ [ ^ moe-training-note ] : In MoE training benchmarks, we force-balance the token distribution among experts and all benchmarks are token-dropless.
141+ [ ^ nemotron-3-super-b200-nvfp4-note ] : Mapping used for MXFP8 precision could not fit for NVFP4 precision for this model. We expect to achieve better performance for NVFP4 precision in future when NVFP4 param gather is supported.
142+
42143## 26.04 NeMo Container
43144
44145### Pre-Training Performance
0 commit comments