Skip to content

Commit 7ec6c28

Browse files
authored
add nemotron3 super docs (#2757)
Signed-off-by: Li Ding <liding@nvidia.com>
1 parent 35d2e00 commit 7ec6c28

File tree

4 files changed

+323
-0
lines changed

4 files changed

+323
-0
lines changed

README.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -155,6 +155,7 @@ For more details on supported models, see our documentation:
155155
| [Moonlight](https://github.com/NVIDIA-NeMo/Megatron-Bridge/tree/main/src/megatron/bridge/models/deepseek) || ✅ ([16B](https://github.com/NVIDIA-NeMo/Megatron-Bridge/blob/main/src/megatron/bridge/recipes/moonlight/moonlight_16b.py)) | ✅ ([16B](https://github.com/NVIDIA-NeMo/Megatron-Bridge/blob/main/src/megatron/bridge/recipes/moonlight/moonlight_16b.py)) |
156156
| [Nemotron](https://github.com/NVIDIA-NeMo/Megatron-Bridge/tree/main/src/megatron/bridge/models/nemotron) || Coming soon | Coming soon |
157157
| [Nemotron-nano-v3](https://github.com/NVIDIA-NeMo/Megatron-Bridge/tree/nano-v3/src/megatron/bridge/models/nemotronh) || ✅ ([30B-A3B](https://github.com/NVIDIA-NeMo/Megatron-Bridge/blob/nano-v3/src/megatron/bridge/recipes/nemotronh/nemotron_3_nano.py)) | ✅ ([A3B](https://github.com/NVIDIA-NeMo/Megatron-Bridge/blob/nano-v3/src/megatron/bridge/recipes/nemotronh/nemotron_3_nano.py)) |
158+
| [Nemotron-super-v3](https://github.com/NVIDIA-NeMo/Megatron-Bridge/tree/super-v3/src/megatron/bridge/models/nemotronh) || ✅ ([120B-A12B](https://github.com/NVIDIA-NeMo/Megatron-Bridge/blob/super-v3/src/megatron/bridge/recipes/nemotronh/nemotron_3_super.py)) | ✅ ([A12B](https://github.com/NVIDIA-NeMo/Megatron-Bridge/blob/super-v3/src/megatron/bridge/recipes/nemotronh/nemotron_3_super.py)) |
158159
| [Nemotron-H](https://github.com/NVIDIA-NeMo/Megatron-Bridge/tree/main/src/megatron/bridge/models/nemotronh) || ✅ ([4B/8B/47B/56B](https://github.com/NVIDIA-NeMo/Megatron-Bridge/blob/main/src/megatron/bridge/recipes/nemotronh/nemotronh.py)) | Coming soon |
159160
| [Nemotron Nano v2](https://github.com/NVIDIA-NeMo/Megatron-Bridge/tree/main/src/megatron/bridge/models/nemotronh) || ✅ ([9B/12B](https://github.com/NVIDIA-NeMo/Megatron-Bridge/blob/main/src/megatron/bridge/recipes/nemotronh/nemotron_nano_v2.py)) | Coming soon |
160161
| [Nemotron Nano v2 VL](https://github.com/NVIDIA-NeMo/Megatron-Bridge/tree/main/src/megatron/bridge/models/nemotron_vl) || Coming soon | ✅ ([9B/12B](https://github.com/NVIDIA-NeMo/Megatron-Bridge/blob/main/src/megatron/bridge/recipes/nemotron_vl/nemotron_nano_v2_vl.py)) |

docs/models/llm/README.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -19,6 +19,7 @@ Megatron Bridge supports the following LLM families:
1919
| **Mistral** | [mistral.md](mistral.md) | Mistral AI models |
2020
| **Moonlight** | [moonlight.md](moonlight.md) | Moonlight model family |
2121
| **Nemotron-3** | [nemotron3.md](nemotron3.md) | NVIDIA Nemotron-3 models |
22+
| **Nemotron-3 Super** | [nemotron3-super.md](nemotron3-super.md) | NVIDIA Nemotron-3 Super models |
2223
| **Nemotron-H** | [nemotronh.md](nemotronh.md) | NVIDIA Nemotron-H models |
2324
| **OLMoE** | [olmoe.md](olmoe.md) | OLMoE (Open Language Model - Mixture of Experts) |
2425
| **Qwen** | [qwen.md](qwen.md) | Alibaba Cloud Qwen model family |

docs/models/llm/index.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -16,6 +16,7 @@ llama-nemotron.md
1616
mistral.md
1717
moonlight.md
1818
nemotron3.md
19+
nemotron3-super.md
1920
nemotronh.md
2021
olmoe.md
2122
qwen.md

docs/models/llm/nemotron3-super.md

Lines changed: 320 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,320 @@
1+
# Nemotron 3 Super
2+
[Nemotron 3 Super](https://huggingface.co/collections/nvidia/nvidia-nemotron-v3)is a large language model (LLM) trained by NVIDIA, designed to deliver strong agentic, reasoning, and conversational capabilities. It is employs a hybrid **Latent Mixture-of-Experts (LatentMoE)** architecture, utilizing interleaved Mamba-2 and MoE layers, along with select Attention layers. Distinct from the Nano model, the Super model incorporates **Multi-Token Prediction (MTP)** layers for faster text generation and improved quality, and it is trained using **NVFP4** quantization to maximize compute efficiency. The model has **12B active parameters** and **120B parameters in total**.
3+
4+
NeMo Megatron Bridge supports pretraining, full parameters finetuning, and LoRA finetuning this model. The finetuned model can be converted back to the 🤗 Hugging Face format for downstream evaluation.
5+
6+
```{important}
7+
Please use the custom container `nvcr.io/nvidia/nemo:26.02.nemotron_3_super` when working with this model.
8+
9+
Run all commands from `/opt/Megatron-Bridge` (e.g. `docker run -w /opt/Megatron-Bridge ...`)
10+
```
11+
12+
## Getting the Latest Code
13+
14+
For the best experience, it is recommended to use the latest code from the `super-v3` branch. There are two ways to do this:
15+
16+
### Option 1: Update the Code Inside the Container
17+
18+
Launch the container and update the code in-place:
19+
20+
```bash
21+
# Pull the latest changes from the super-v3 branch
22+
cd /opt/megatron
23+
git pull origin super-v3
24+
```
25+
26+
### Option 2: Mount the Repo from Host
27+
28+
This approach lets you work with the code on your host machine and mount it into the container at runtime.
29+
30+
**Step 1 — Pull the latest `super-v3` branch on the host:**
31+
32+
```bash
33+
git checkout super-v3 && git pull origin super-v3
34+
```
35+
36+
**Step 2 — Mount the repo when launching the container:**
37+
38+
```bash
39+
MEGATRON_BRIDGE_PATH=/path/to/Megatron-Bridge # set this to your local clone
40+
41+
docker run --rm -it \
42+
-v $MEGATRON_BRIDGE_PATH:/opt/Megatron-Bridge \
43+
-w /opt/Megatron-Bridge \
44+
nvcr.io/nvidia/nemo:26.02.nemotron_3_super \
45+
bash
46+
```
47+
48+
---
49+
50+
## Conversion with 🤗 Hugging Face
51+
52+
### Import HF → Megatron
53+
To import the HF model to your desired `$MEGATRON_MODEL_PATH`, run the following command.
54+
55+
```bash
56+
HF_MODEL=/path/to/hf/model
57+
MEGATRON_PATH=/path/to/output/megatron/ckpt
58+
59+
torchrun --nproc-per-node=8 examples/conversion/convert_checkpoints.py import \
60+
--hf-model $HF_MODEL \
61+
--megatron-path $MEGATRON_PATH \
62+
--tp 1 \
63+
--ep 8
64+
```
65+
66+
Notes:
67+
- The default parallelism is TP=1, EP=8 (Expert Parallel)
68+
- Adjust `--nproc-per-node` based on your available GPUs
69+
70+
### Export Megatron → HF
71+
```bash
72+
HF_MODEL=/path/to/hf/model
73+
MEGATRON_PATH=/path/to/trained/megatron/ckpt
74+
OUTPUT_PATH=/path/to/output/hf/ckpt
75+
76+
torchrun --nproc-per-node=8 examples/conversion/convert_checkpoints.py export \
77+
--hf-model $HF_MODEL \
78+
--megatron-path $MEGATRON_PATH \
79+
--hf-path $OUTPUT_PATH \
80+
--tp 1 \
81+
--ep 8
82+
```
83+
84+
### Roundtrip Testing
85+
To verify the correctness of import/export conversions:
86+
87+
```bash
88+
HF_MODEL=/path/to/hf/model
89+
MEGATRON_PATH=/path/to/megatron/ckpt
90+
91+
torchrun --nproc-per-node=8 examples/conversion/hf_megatron_roundtrip_multi_gpu.py \
92+
--hf-model-id $HF_MODEL \
93+
--megatron-load-path $MEGATRON_PATH \
94+
--tp 1 \
95+
--ep 8 \
96+
--trust-remote-code
97+
```
98+
99+
### Compare HF and Megatron Outputs
100+
To compare outputs between HF and Megatron models:
101+
102+
```bash
103+
HF_MODEL=/path/to/hf/model
104+
MEGATRON_PATH=/path/to/megatron/ckpt
105+
106+
torchrun --nproc-per-node=8 examples/conversion/compare_hf_and_megatron/compare.py \
107+
--hf_model_path $HF_MODEL \
108+
--megatron_model_path $MEGATRON_PATH \
109+
--prompt "Hello who are " \
110+
--tp 8 \
111+
--ep 8 \
112+
--trust_remote_code
113+
```
114+
115+
## Pretraining Examples
116+
117+
### Pretraining with Real Data
118+
```bash
119+
BLEND_PATH=/path/to/dataset/blend.json
120+
CHECKPOINT_DIR=/path/to/checkpoints
121+
122+
torchrun --nproc-per-node=8 examples/models/nemotron_3/pretrain_nemotron_3_super.py \
123+
--per-split-data-args-path=${BLEND_PATH} \
124+
logger.wandb_project=your_project \
125+
logger.wandb_entity=nvidia \
126+
logger.log_interval=5 \
127+
checkpoint.load=${CHECKPOINT_DIR} \
128+
checkpoint.save=${CHECKPOINT_DIR} \
129+
checkpoint.save_interval=100 \
130+
train.global_batch_size=8 \
131+
train.micro_batch_size=1 \
132+
train.train_iters=1280 \
133+
scheduler.lr_warmup_iters=128 \
134+
scheduler.lr_decay_iters=1152 \
135+
scheduler.lr_wsd_decay_iters=1152 \
136+
model.tensor_model_parallel_size=4 \
137+
model.context_parallel_size=1 \
138+
model.expert_model_parallel_size=64 \
139+
model.sequence_parallel=True
140+
```
141+
142+
Notes:
143+
- **GPU Requirements**: Requires B200 GPUs for NVFP4 support. Minimum of 8 nodes (64 GPUs) required
144+
- The default parallelism settings are TP=4, EP=64, PP=1, CP=1 with sequence parallel enabled
145+
- Expert parallelism (EP) is set to 64 for the MoE architecture
146+
- Adjust batch sizes and iteration counts based on your training requirements
147+
- Make sure to set up WandB credentials if using WandB logging
148+
149+
### Pretraining with Mock Data
150+
For quick testing without a dataset:
151+
152+
```bash
153+
CHECKPOINT_DIR=/path/to/checkpoints
154+
155+
torchrun --nproc-per-node=8 examples/models/nemotron_3/pretrain_nemotron_3_super.py \
156+
logger.wandb_project=your_project \
157+
logger.wandb_entity=nvidia \
158+
checkpoint.load=${CHECKPOINT_DIR} \
159+
checkpoint.save=${CHECKPOINT_DIR} \
160+
checkpoint.save_interval=100 \
161+
train.global_batch_size=128 \
162+
train.train_iters=100 \
163+
scheduler.lr_warmup_iters=10 \
164+
model.hybrid_override_pattern="MEME*ME" \
165+
model.num_layers=7
166+
```
167+
168+
Notes:
169+
- If `BLEND_PATH` is not specified, mock dataset will be used
170+
- The `hybrid_override_pattern` can be used to customize the MoE layer pattern
171+
- Useful for debugging and testing the training pipeline
172+
173+
174+
## Finetuning Recipes
175+
176+
### Full Parameter Fine-Tuning
177+
```bash
178+
MEGATRON_PATH=/path/to/pretrained/megatron/ckpt
179+
CHECKPOINT_DIR=/path/to/finetuned/checkpoints
180+
181+
torchrun --nproc-per-node=8 examples/models/nemotron_3/finetune_nemotron_3_super.py \
182+
logger.wandb_project=your_project \
183+
logger.wandb_entity=nvidia \
184+
logger.log_interval=5 \
185+
checkpoint.load=${CHECKPOINT_DIR} \
186+
checkpoint.save=${CHECKPOINT_DIR} \
187+
checkpoint.save_interval=50 \
188+
train.global_batch_size=16 \
189+
train.train_iters=200 \
190+
scheduler.lr_warmup_iters=10 \
191+
model.tensor_model_parallel_size=4 \
192+
model.sequence_parallel=True \
193+
checkpoint.pretrained_checkpoint=$MEGATRON_PATH
194+
```
195+
196+
Notes:
197+
- Default parallelism TP=4, EP=8, PP=1, CP=1 with sequence parallel enabled
198+
- By default, the [SQuAD](https://huggingface.co/datasets/rajpurkar/squad) dataset is used.
199+
- Fine-tuning requires a pretrained Megatron checkpoint, which can be obtained from the "Import HF → Megatron" section above
200+
- Adjust `global_batch_size` and parallelism settings based on your GPU memory and requirements
201+
202+
203+
### LoRA Fine-Tuning
204+
To enable LoRA fine-tuning, pass `--peft lora` to the script:
205+
206+
```bash
207+
MEGATRON_PATH=/path/to/pretrained/megatron/ckpt
208+
CHECKPOINT_DIR=/path/to/lora/checkpoints
209+
210+
torchrun --nproc-per-node=8 examples/models/nemotron_3/finetune_nemotron_3_super.py \
211+
--peft lora \
212+
logger.wandb_project=your_project \
213+
logger.wandb_entity=nvidia \
214+
logger.log_interval=5 \
215+
checkpoint.load=${CHECKPOINT_DIR} \
216+
checkpoint.save=${CHECKPOINT_DIR} \
217+
checkpoint.save_interval=100 \
218+
train.global_batch_size=4 \
219+
train.train_iters=200 \
220+
model.tensor_model_parallel_size=4 \
221+
model.context_parallel_size=2 \
222+
model.sequence_parallel=True \
223+
scheduler.lr_warmup_iters=30 \
224+
checkpoint.pretrained_checkpoint=$MEGATRON_PATH
225+
```
226+
227+
Notes:
228+
- By default, the target modules are linear layers `["linear_qkv", "linear_proj", "linear_fc1", "linear_fc2", "in_proj", "out_proj"]` in the model
229+
- LoRA fine-tuning uses less memory and can work with smaller batch sizes
230+
- Consider using Context Parallel (CP) for longer sequences
231+
232+
233+
## Quantization (PTQ and QAT)
234+
235+
```{important}
236+
Quantization support requires the latest code from the `super-v3` branch. See [Getting the Latest Code](#getting-the-latest-code) for instructions.
237+
```
238+
239+
Nemotron 3 Super supports four quantization configurations:
240+
241+
| Config Name | Format | Description |
242+
|---|---|---|
243+
| `mamba_moe_fp8_aggressive` | FP8 | Aggressive FP8 quantization for Mamba-MoE |
244+
| `mamba_moe_fp8_conservative` | FP8 | Conservative FP8 quantization for Mamba-MoE |
245+
| `mamba_moe_nvfp4_aggressive` | NVFP4 | Aggressive NVFP4 quantization for Mamba-MoE |
246+
| `mamba_moe_nvfp4_conservative` | NVFP4 | Conservative NVFP4 quantization for Mamba-MoE |
247+
248+
Pass the desired config name via `--export-quant-cfg` to `quantize.py`.
249+
250+
### Quantize
251+
```bash
252+
export HF_MODEL=/path/to/hf/model
253+
export MEGATRON_SAVE_PATH=/path/to/quantized/megatron/ckpt
254+
255+
torchrun --nproc_per_node=8 examples/quantization/quantize.py \
256+
--hf-model-id $HF_MODEL \
257+
--export-quant-cfg mamba_moe_nvfp4_conservative \
258+
--megatron-save-path $MEGATRON_SAVE_PATH \
259+
--pp 1 \
260+
--tp 8 \
261+
--ep 8 \
262+
--trust-remote-code
263+
```
264+
265+
### Verify with PTQ Generate
266+
```bash
267+
torchrun --nproc_per_node=8 examples/quantization/ptq_generate.py \
268+
--hf-model-id $HF_MODEL \
269+
--megatron-load-path $MEGATRON_SAVE_PATH \
270+
--pp 1 \
271+
--tp 8 \
272+
--ep 8 \
273+
--trust-remote-code
274+
```
275+
276+
Notes:
277+
- For multi-node setups (e.g. 2 nodes with 8× H100), increase `--pp` accordingly (e.g. `--pp 2`) and use a job scheduler like SLURM to launch across nodes.
278+
279+
### Export Quantized Megatron Checkpoint → HF
280+
281+
After quantization, export the Megatron checkpoint back to Hugging Face format:
282+
283+
```bash
284+
HF_MODEL=/path/to/hf/model
285+
MEGATRON_LOAD_PATH=/path/to/quantized/megatron/ckpt
286+
EXPORT_DIR=/path/to/output/hf/ckpt
287+
288+
torchrun --nproc_per_node=8 examples/quantization/export.py \
289+
--hf-model-id $HF_MODEL \
290+
--megatron-load-path $MEGATRON_LOAD_PATH \
291+
--export-dir $EXPORT_DIR \
292+
--pp 8 \
293+
--dtype bfloat16 \
294+
--trust-remote-code
295+
```
296+
297+
### Quantization-Aware Training (QAT)
298+
299+
After quantization, further improve model quality with QAT by continuing training from a quantized Megatron checkpoint.
300+
301+
```bash
302+
MEGATRON_PATH=/path/to/quantized/megatron/ckpt
303+
CHECKPOINT_DIR=/path/to/qat/checkpoints
304+
305+
torchrun --nproc-per-node=8 examples/models/nemotron_3/qat_nemotron_3_super.py \
306+
--megatron-load-path=${MEGATRON_PATH} \
307+
--seq-length=8192 \
308+
--packed-sequence \
309+
logger.wandb_project=your_project \
310+
logger.wandb_entity=nvidia \
311+
logger.log_interval=5 \
312+
checkpoint.load=${CHECKPOINT_DIR} \
313+
checkpoint.save=${CHECKPOINT_DIR} \
314+
checkpoint.save_interval=50 \
315+
train.global_batch_size=16 \
316+
train.train_iters=200 \
317+
scheduler.lr_warmup_iters=10 \
318+
model.tensor_model_parallel_size=4 \
319+
model.sequence_parallel=True
320+
```

0 commit comments

Comments
 (0)