|
| 1 | +# Nemotron 3 Super |
| 2 | +[Nemotron 3 Super](https://huggingface.co/collections/nvidia/nvidia-nemotron-v3)is a large language model (LLM) trained by NVIDIA, designed to deliver strong agentic, reasoning, and conversational capabilities. It is employs a hybrid **Latent Mixture-of-Experts (LatentMoE)** architecture, utilizing interleaved Mamba-2 and MoE layers, along with select Attention layers. Distinct from the Nano model, the Super model incorporates **Multi-Token Prediction (MTP)** layers for faster text generation and improved quality, and it is trained using **NVFP4** quantization to maximize compute efficiency. The model has **12B active parameters** and **120B parameters in total**. |
| 3 | + |
| 4 | +NeMo Megatron Bridge supports pretraining, full parameters finetuning, and LoRA finetuning this model. The finetuned model can be converted back to the 🤗 Hugging Face format for downstream evaluation. |
| 5 | + |
| 6 | +```{important} |
| 7 | +Please use the custom container `nvcr.io/nvidia/nemo:26.02.nemotron_3_super` when working with this model. |
| 8 | +
|
| 9 | +Run all commands from `/opt/Megatron-Bridge` (e.g. `docker run -w /opt/Megatron-Bridge ...`) |
| 10 | +``` |
| 11 | + |
| 12 | +## Getting the Latest Code |
| 13 | + |
| 14 | +For the best experience, it is recommended to use the latest code from the `super-v3` branch. There are two ways to do this: |
| 15 | + |
| 16 | +### Option 1: Update the Code Inside the Container |
| 17 | + |
| 18 | +Launch the container and update the code in-place: |
| 19 | + |
| 20 | +```bash |
| 21 | +# Pull the latest changes from the super-v3 branch |
| 22 | +cd /opt/megatron |
| 23 | +git pull origin super-v3 |
| 24 | +``` |
| 25 | + |
| 26 | +### Option 2: Mount the Repo from Host |
| 27 | + |
| 28 | +This approach lets you work with the code on your host machine and mount it into the container at runtime. |
| 29 | + |
| 30 | +**Step 1 — Pull the latest `super-v3` branch on the host:** |
| 31 | + |
| 32 | +```bash |
| 33 | +git checkout super-v3 && git pull origin super-v3 |
| 34 | +``` |
| 35 | + |
| 36 | +**Step 2 — Mount the repo when launching the container:** |
| 37 | + |
| 38 | +```bash |
| 39 | +MEGATRON_BRIDGE_PATH=/path/to/Megatron-Bridge # set this to your local clone |
| 40 | + |
| 41 | +docker run --rm -it \ |
| 42 | + -v $MEGATRON_BRIDGE_PATH:/opt/Megatron-Bridge \ |
| 43 | + -w /opt/Megatron-Bridge \ |
| 44 | + nvcr.io/nvidia/nemo:26.02.nemotron_3_super \ |
| 45 | + bash |
| 46 | +``` |
| 47 | + |
| 48 | +--- |
| 49 | + |
| 50 | +## Conversion with 🤗 Hugging Face |
| 51 | + |
| 52 | +### Import HF → Megatron |
| 53 | +To import the HF model to your desired `$MEGATRON_MODEL_PATH`, run the following command. |
| 54 | + |
| 55 | +```bash |
| 56 | +HF_MODEL=/path/to/hf/model |
| 57 | +MEGATRON_PATH=/path/to/output/megatron/ckpt |
| 58 | + |
| 59 | +torchrun --nproc-per-node=8 examples/conversion/convert_checkpoints.py import \ |
| 60 | +--hf-model $HF_MODEL \ |
| 61 | +--megatron-path $MEGATRON_PATH \ |
| 62 | +--tp 1 \ |
| 63 | +--ep 8 |
| 64 | +``` |
| 65 | + |
| 66 | +Notes: |
| 67 | +- The default parallelism is TP=1, EP=8 (Expert Parallel) |
| 68 | +- Adjust `--nproc-per-node` based on your available GPUs |
| 69 | + |
| 70 | +### Export Megatron → HF |
| 71 | +```bash |
| 72 | +HF_MODEL=/path/to/hf/model |
| 73 | +MEGATRON_PATH=/path/to/trained/megatron/ckpt |
| 74 | +OUTPUT_PATH=/path/to/output/hf/ckpt |
| 75 | + |
| 76 | +torchrun --nproc-per-node=8 examples/conversion/convert_checkpoints.py export \ |
| 77 | +--hf-model $HF_MODEL \ |
| 78 | +--megatron-path $MEGATRON_PATH \ |
| 79 | +--hf-path $OUTPUT_PATH \ |
| 80 | +--tp 1 \ |
| 81 | +--ep 8 |
| 82 | +``` |
| 83 | + |
| 84 | +### Roundtrip Testing |
| 85 | +To verify the correctness of import/export conversions: |
| 86 | + |
| 87 | +```bash |
| 88 | +HF_MODEL=/path/to/hf/model |
| 89 | +MEGATRON_PATH=/path/to/megatron/ckpt |
| 90 | + |
| 91 | +torchrun --nproc-per-node=8 examples/conversion/hf_megatron_roundtrip_multi_gpu.py \ |
| 92 | +--hf-model-id $HF_MODEL \ |
| 93 | +--megatron-load-path $MEGATRON_PATH \ |
| 94 | +--tp 1 \ |
| 95 | +--ep 8 \ |
| 96 | +--trust-remote-code |
| 97 | +``` |
| 98 | + |
| 99 | +### Compare HF and Megatron Outputs |
| 100 | +To compare outputs between HF and Megatron models: |
| 101 | + |
| 102 | +```bash |
| 103 | +HF_MODEL=/path/to/hf/model |
| 104 | +MEGATRON_PATH=/path/to/megatron/ckpt |
| 105 | + |
| 106 | +torchrun --nproc-per-node=8 examples/conversion/compare_hf_and_megatron/compare.py \ |
| 107 | +--hf_model_path $HF_MODEL \ |
| 108 | +--megatron_model_path $MEGATRON_PATH \ |
| 109 | +--prompt "Hello who are " \ |
| 110 | +--tp 8 \ |
| 111 | +--ep 8 \ |
| 112 | +--trust_remote_code |
| 113 | +``` |
| 114 | + |
| 115 | +## Pretraining Examples |
| 116 | + |
| 117 | +### Pretraining with Real Data |
| 118 | +```bash |
| 119 | +BLEND_PATH=/path/to/dataset/blend.json |
| 120 | +CHECKPOINT_DIR=/path/to/checkpoints |
| 121 | + |
| 122 | +torchrun --nproc-per-node=8 examples/models/nemotron_3/pretrain_nemotron_3_super.py \ |
| 123 | +--per-split-data-args-path=${BLEND_PATH} \ |
| 124 | +logger.wandb_project=your_project \ |
| 125 | +logger.wandb_entity=nvidia \ |
| 126 | +logger.log_interval=5 \ |
| 127 | +checkpoint.load=${CHECKPOINT_DIR} \ |
| 128 | +checkpoint.save=${CHECKPOINT_DIR} \ |
| 129 | +checkpoint.save_interval=100 \ |
| 130 | +train.global_batch_size=8 \ |
| 131 | +train.micro_batch_size=1 \ |
| 132 | +train.train_iters=1280 \ |
| 133 | +scheduler.lr_warmup_iters=128 \ |
| 134 | +scheduler.lr_decay_iters=1152 \ |
| 135 | +scheduler.lr_wsd_decay_iters=1152 \ |
| 136 | +model.tensor_model_parallel_size=4 \ |
| 137 | +model.context_parallel_size=1 \ |
| 138 | +model.expert_model_parallel_size=64 \ |
| 139 | +model.sequence_parallel=True |
| 140 | +``` |
| 141 | + |
| 142 | +Notes: |
| 143 | +- **GPU Requirements**: Requires B200 GPUs for NVFP4 support. Minimum of 8 nodes (64 GPUs) required |
| 144 | +- The default parallelism settings are TP=4, EP=64, PP=1, CP=1 with sequence parallel enabled |
| 145 | +- Expert parallelism (EP) is set to 64 for the MoE architecture |
| 146 | +- Adjust batch sizes and iteration counts based on your training requirements |
| 147 | +- Make sure to set up WandB credentials if using WandB logging |
| 148 | + |
| 149 | +### Pretraining with Mock Data |
| 150 | +For quick testing without a dataset: |
| 151 | + |
| 152 | +```bash |
| 153 | +CHECKPOINT_DIR=/path/to/checkpoints |
| 154 | + |
| 155 | +torchrun --nproc-per-node=8 examples/models/nemotron_3/pretrain_nemotron_3_super.py \ |
| 156 | +logger.wandb_project=your_project \ |
| 157 | +logger.wandb_entity=nvidia \ |
| 158 | +checkpoint.load=${CHECKPOINT_DIR} \ |
| 159 | +checkpoint.save=${CHECKPOINT_DIR} \ |
| 160 | +checkpoint.save_interval=100 \ |
| 161 | +train.global_batch_size=128 \ |
| 162 | +train.train_iters=100 \ |
| 163 | +scheduler.lr_warmup_iters=10 \ |
| 164 | +model.hybrid_override_pattern="MEME*ME" \ |
| 165 | +model.num_layers=7 |
| 166 | +``` |
| 167 | + |
| 168 | +Notes: |
| 169 | +- If `BLEND_PATH` is not specified, mock dataset will be used |
| 170 | +- The `hybrid_override_pattern` can be used to customize the MoE layer pattern |
| 171 | +- Useful for debugging and testing the training pipeline |
| 172 | + |
| 173 | + |
| 174 | +## Finetuning Recipes |
| 175 | + |
| 176 | +### Full Parameter Fine-Tuning |
| 177 | +```bash |
| 178 | +MEGATRON_PATH=/path/to/pretrained/megatron/ckpt |
| 179 | +CHECKPOINT_DIR=/path/to/finetuned/checkpoints |
| 180 | + |
| 181 | +torchrun --nproc-per-node=8 examples/models/nemotron_3/finetune_nemotron_3_super.py \ |
| 182 | +logger.wandb_project=your_project \ |
| 183 | +logger.wandb_entity=nvidia \ |
| 184 | +logger.log_interval=5 \ |
| 185 | +checkpoint.load=${CHECKPOINT_DIR} \ |
| 186 | +checkpoint.save=${CHECKPOINT_DIR} \ |
| 187 | +checkpoint.save_interval=50 \ |
| 188 | +train.global_batch_size=16 \ |
| 189 | +train.train_iters=200 \ |
| 190 | +scheduler.lr_warmup_iters=10 \ |
| 191 | +model.tensor_model_parallel_size=4 \ |
| 192 | +model.sequence_parallel=True \ |
| 193 | +checkpoint.pretrained_checkpoint=$MEGATRON_PATH |
| 194 | +``` |
| 195 | + |
| 196 | +Notes: |
| 197 | +- Default parallelism TP=4, EP=8, PP=1, CP=1 with sequence parallel enabled |
| 198 | +- By default, the [SQuAD](https://huggingface.co/datasets/rajpurkar/squad) dataset is used. |
| 199 | +- Fine-tuning requires a pretrained Megatron checkpoint, which can be obtained from the "Import HF → Megatron" section above |
| 200 | +- Adjust `global_batch_size` and parallelism settings based on your GPU memory and requirements |
| 201 | + |
| 202 | + |
| 203 | +### LoRA Fine-Tuning |
| 204 | +To enable LoRA fine-tuning, pass `--peft lora` to the script: |
| 205 | + |
| 206 | +```bash |
| 207 | +MEGATRON_PATH=/path/to/pretrained/megatron/ckpt |
| 208 | +CHECKPOINT_DIR=/path/to/lora/checkpoints |
| 209 | + |
| 210 | +torchrun --nproc-per-node=8 examples/models/nemotron_3/finetune_nemotron_3_super.py \ |
| 211 | +--peft lora \ |
| 212 | +logger.wandb_project=your_project \ |
| 213 | +logger.wandb_entity=nvidia \ |
| 214 | +logger.log_interval=5 \ |
| 215 | +checkpoint.load=${CHECKPOINT_DIR} \ |
| 216 | +checkpoint.save=${CHECKPOINT_DIR} \ |
| 217 | +checkpoint.save_interval=100 \ |
| 218 | +train.global_batch_size=4 \ |
| 219 | +train.train_iters=200 \ |
| 220 | +model.tensor_model_parallel_size=4 \ |
| 221 | +model.context_parallel_size=2 \ |
| 222 | +model.sequence_parallel=True \ |
| 223 | +scheduler.lr_warmup_iters=30 \ |
| 224 | +checkpoint.pretrained_checkpoint=$MEGATRON_PATH |
| 225 | +``` |
| 226 | + |
| 227 | +Notes: |
| 228 | +- By default, the target modules are linear layers `["linear_qkv", "linear_proj", "linear_fc1", "linear_fc2", "in_proj", "out_proj"]` in the model |
| 229 | +- LoRA fine-tuning uses less memory and can work with smaller batch sizes |
| 230 | +- Consider using Context Parallel (CP) for longer sequences |
| 231 | + |
| 232 | + |
| 233 | +## Quantization (PTQ and QAT) |
| 234 | + |
| 235 | +```{important} |
| 236 | +Quantization support requires the latest code from the `super-v3` branch. See [Getting the Latest Code](#getting-the-latest-code) for instructions. |
| 237 | +``` |
| 238 | + |
| 239 | +Nemotron 3 Super supports four quantization configurations: |
| 240 | + |
| 241 | +| Config Name | Format | Description | |
| 242 | +|---|---|---| |
| 243 | +| `mamba_moe_fp8_aggressive` | FP8 | Aggressive FP8 quantization for Mamba-MoE | |
| 244 | +| `mamba_moe_fp8_conservative` | FP8 | Conservative FP8 quantization for Mamba-MoE | |
| 245 | +| `mamba_moe_nvfp4_aggressive` | NVFP4 | Aggressive NVFP4 quantization for Mamba-MoE | |
| 246 | +| `mamba_moe_nvfp4_conservative` | NVFP4 | Conservative NVFP4 quantization for Mamba-MoE | |
| 247 | + |
| 248 | +Pass the desired config name via `--export-quant-cfg` to `quantize.py`. |
| 249 | + |
| 250 | +### Quantize |
| 251 | +```bash |
| 252 | +export HF_MODEL=/path/to/hf/model |
| 253 | +export MEGATRON_SAVE_PATH=/path/to/quantized/megatron/ckpt |
| 254 | + |
| 255 | +torchrun --nproc_per_node=8 examples/quantization/quantize.py \ |
| 256 | + --hf-model-id $HF_MODEL \ |
| 257 | + --export-quant-cfg mamba_moe_nvfp4_conservative \ |
| 258 | + --megatron-save-path $MEGATRON_SAVE_PATH \ |
| 259 | + --pp 1 \ |
| 260 | + --tp 8 \ |
| 261 | + --ep 8 \ |
| 262 | + --trust-remote-code |
| 263 | +``` |
| 264 | + |
| 265 | +### Verify with PTQ Generate |
| 266 | +```bash |
| 267 | +torchrun --nproc_per_node=8 examples/quantization/ptq_generate.py \ |
| 268 | + --hf-model-id $HF_MODEL \ |
| 269 | + --megatron-load-path $MEGATRON_SAVE_PATH \ |
| 270 | + --pp 1 \ |
| 271 | + --tp 8 \ |
| 272 | + --ep 8 \ |
| 273 | + --trust-remote-code |
| 274 | +``` |
| 275 | + |
| 276 | +Notes: |
| 277 | +- For multi-node setups (e.g. 2 nodes with 8× H100), increase `--pp` accordingly (e.g. `--pp 2`) and use a job scheduler like SLURM to launch across nodes. |
| 278 | + |
| 279 | +### Export Quantized Megatron Checkpoint → HF |
| 280 | + |
| 281 | +After quantization, export the Megatron checkpoint back to Hugging Face format: |
| 282 | + |
| 283 | +```bash |
| 284 | +HF_MODEL=/path/to/hf/model |
| 285 | +MEGATRON_LOAD_PATH=/path/to/quantized/megatron/ckpt |
| 286 | +EXPORT_DIR=/path/to/output/hf/ckpt |
| 287 | + |
| 288 | +torchrun --nproc_per_node=8 examples/quantization/export.py \ |
| 289 | + --hf-model-id $HF_MODEL \ |
| 290 | + --megatron-load-path $MEGATRON_LOAD_PATH \ |
| 291 | + --export-dir $EXPORT_DIR \ |
| 292 | + --pp 8 \ |
| 293 | + --dtype bfloat16 \ |
| 294 | + --trust-remote-code |
| 295 | +``` |
| 296 | + |
| 297 | +### Quantization-Aware Training (QAT) |
| 298 | + |
| 299 | +After quantization, further improve model quality with QAT by continuing training from a quantized Megatron checkpoint. |
| 300 | + |
| 301 | +```bash |
| 302 | +MEGATRON_PATH=/path/to/quantized/megatron/ckpt |
| 303 | +CHECKPOINT_DIR=/path/to/qat/checkpoints |
| 304 | + |
| 305 | +torchrun --nproc-per-node=8 examples/models/nemotron_3/qat_nemotron_3_super.py \ |
| 306 | +--megatron-load-path=${MEGATRON_PATH} \ |
| 307 | +--seq-length=8192 \ |
| 308 | +--packed-sequence \ |
| 309 | +logger.wandb_project=your_project \ |
| 310 | +logger.wandb_entity=nvidia \ |
| 311 | +logger.log_interval=5 \ |
| 312 | +checkpoint.load=${CHECKPOINT_DIR} \ |
| 313 | +checkpoint.save=${CHECKPOINT_DIR} \ |
| 314 | +checkpoint.save_interval=50 \ |
| 315 | +train.global_batch_size=16 \ |
| 316 | +train.train_iters=200 \ |
| 317 | +scheduler.lr_warmup_iters=10 \ |
| 318 | +model.tensor_model_parallel_size=4 \ |
| 319 | +model.sequence_parallel=True |
| 320 | +``` |
0 commit comments