diff --git a/docs/source/deployment-guide/deployment-guide-for-nemotron-3-super-on-trtllm.md b/docs/source/deployment-guide/deployment-guide-for-nemotron-3-super-on-trtllm.md new file mode 100644 index 00000000000..0a2e03b2e9c --- /dev/null +++ b/docs/source/deployment-guide/deployment-guide-for-nemotron-3-super-on-trtllm.md @@ -0,0 +1,268 @@ +# Deployment Guide for Nemotron v3 Super on TensorRT LLM - Blackwell & Hopper Hardware + +## Introduction + +This deployment guide provides step-by-step instructions for running the NVIDIA Nemotron v3 Super 120B-A12B model using TensorRT LLM. Nemotron v3 Super is a hybrid architecture model combining Mixture-of-Experts (MoE) with SSM (Mamba) and attention layers, delivering 120B total parameters with only 12B active parameters per token for efficient inference. This guide covers model access, environment setup, server configuration, and inference validation. + +## Prerequisites + +* GPU: NVIDIA Blackwell or Hopper Architecture +* OS: Linux +* Drivers: CUDA Driver 575 or Later +* Docker with NVIDIA Container Toolkit installed +* Python3 and python3-pip (Optional, for accuracy evaluation only) + +## Models + +* [NVIDIA-Nemotron-3-Super-120B-A12B-Base-BF16](https://huggingface.co/nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-Base-BF16) +* [NVIDIA-Nemotron-3-Super-120B-A12B-FP8](https://huggingface.co/nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-FP8) +* [NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4](https://huggingface.co/nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4) + +All models are available under the [nvidia/nvidia-nemotron-v3](https://huggingface.co/collections/nvidia/nvidia-nemotron-v3) collection on Hugging Face. + +## GPU Requirements + +Nemotron v3 Super 120B-A12B has 120B total parameters. The minimum GPU memory required depends on the precision: + +| Checkpoint | Minimum GPUs (H100/H200 80GB) | Minimum GPUs (B200/GB200 192GB) | +|------------|-------------------------------|---------------------------------| +| BF16 | 4x H100/H200 | 2x B200/GB200 | +| NVFP4 | 2x H100/H200 | 1x B200/GB200 | + +## Deployment Steps + +### Run Docker Container + +Run the docker container using the TensorRT LLM NVIDIA NGC image. + +```shell +docker run --rm -it \ +--ipc=host \ +--gpus all \ +-p 8000:8000 \ +-v ~/.cache:/root/.cache:rw \ +--name tensorrt_llm \ +nvcr.io/nvidia/tensorrt-llm/release:x.y.z \ +/bin/bash +``` + +Note: + +* The command mounts your user `.cache` directory to save the downloaded model checkpoints which are saved to `~/.cache/huggingface/hub/` by default. This prevents having to redownload the weights each time you rerun the container. If the `~/.cache` directory doesn't exist please create it using `$ mkdir ~/.cache`. +* You can mount additional directories and paths using the `-v :` flag if needed, such as mounting the downloaded weight paths. +* The command also maps port `8000` from the container to your host so you can access the LLM API endpoint from your host. +* See the for all the available containers. The containers published in the main branch weekly have `rcN` suffix, while the monthly release with QA tests has no `rcN` suffix. Use the `rc` release to get the latest model and feature support. + +If you want to use latest main branch, you can choose to build from source to install TensorRT LLM, the steps refer to [https://nvidia.github.io/TensorRT-LLM/latest/installation/build-from-source-linux.html](https://nvidia.github.io/TensorRT-LLM/latest/installation/build-from-source-linux.html) + +### Recommended Performance Settings + +We maintain YAML configuration files with recommended performance settings in the [`examples/configs`](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/configs) directory. These config files are present in the TensorRT LLM container at the path `/app/tensorrt_llm/examples/configs`. You can use these out-of-the-box, or adjust them to your specific use case. + +```shell +TRTLLM_DIR=/app/tensorrt_llm # change as needed to match your environment +EXTRA_LLM_API_FILE=${TRTLLM_DIR}/examples/configs/curated/nemotron-3-super-throughput.yaml +``` + +Note: if you don't have access to the source code locally, you can manually create the YAML config file using the code in the dropdown below. + +````{admonition} Show code +:class: dropdown + +```{literalinclude} ../../../examples/configs/curated/nemotron-3-super-throughput.yaml +--- +language: shell +prepend: | + EXTRA_LLM_API_FILE=/tmp/config.yml + + cat << EOF > ${EXTRA_LLM_API_FILE} +append: EOF +--- +``` +```` + +### Launch the TensorRT LLM Server + +Below are example commands to launch the TensorRT LLM server with the Nemotron v3 Super model from within the container. + +**NVFP4 model (recommended, lowest memory footprint):** + +```shell +trtllm-serve nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 --host 0.0.0.0 --port 8000 --config ${EXTRA_LLM_API_FILE} +``` + + +After the server is set up, the client can now send prompt requests to the server and receive results. + +### LLM API Options (YAML Configuration) + + + +These options provide control over TensorRT LLM's behavior and are set within the YAML file passed to the `trtllm-serve` command via the `--config` argument. + +#### `tensor_parallel_size` + +* **Description:** Sets the **tensor-parallel size**. This should typically match the number of GPUs you intend to use for a single model instance. For BF16, use 4 or more GPUs on H100/H200. For NVFP4, 2 GPUs on H100/H200 may suffice. + +#### `moe_expert_parallel_size` + +* **Description:** Sets the **expert-parallel size** for Mixture-of-Experts (MoE) models. Like `tensor_parallel_size`, this should generally match the number of GPUs you're using. + +#### `kv_cache_free_gpu_memory_fraction` + +* **Description:** A value between `0.0` and `1.0` that specifies the fraction of free GPU memory to reserve for the KV cache after the model is loaded. Since memory usage can fluctuate, this buffer helps prevent out-of-memory (OOM) errors. +* **Recommendation:** If you experience OOM errors, try reducing this value to `0.7` or lower. + +#### `max_batch_size` + +* **Description:** The maximum number of user requests that can be grouped into a single batch for processing. The actual max batch size that can be achieved depends on total sequence length (input + output). + +#### `max_num_tokens` + +* **Description:** The maximum total number of tokens (across all requests) allowed inside a single scheduled batch. + +#### `max_seq_len` + +* **Description:** The maximum possible sequence length for a single request, including both input and generated output tokens. We won't specifically set it. It will be inferred from model config. + +#### `trust_remote_code` +* **Description:** Allows TensorRT LLM to download models and tokenizers from Hugging Face. This flag is passed directly to the Hugging Face API. + +#### `cuda_graph_config` + +* **Description**: A section for configuring CUDA graphs to optimize performance. + +* **Options**: + + * `enable_padding`: If `true`, input batches are padded to the nearest `cuda_graph_batch_size`. This can significantly improve performance. + + **Default**: `false` + + * `batch_sizes`: List of batch sizes for which CUDA graphs will be pre-captured. + + **Recommendation**: Set this to cover the range of batch sizes you expect in production. + +See the [`TorchLlmArgs` class](https://nvidia.github.io/TensorRT-LLM/llm-api/reference.html#tensorrt_llm.llmapi.TorchLlmArgs) for the full list of options which can be used in the YAML configuration file. + +## Testing API Endpoint + +### Basic Test + +Start a new terminal on the host to test the TensorRT LLM server you just launched. + +You can query the health/readiness of the server using: + +```shell +curl -s -o /dev/null -w "Status: %{http_code}\n" "http://localhost:8000/health" +``` + +When the `Status: 200` code is returned, the server is ready for queries. Note that the very first query may take longer due to initialization and compilation. + +After the TensorRT LLM server is set up and shows Application startup complete, you can send requests to the server. + +```shell +curl http://localhost:8000/v1/chat/completions -H "Content-Type: application/json" -d '{ + "model": "nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4", + "messages": [ + { + "role": "user", + "content": "What is the capital of France?" + } + ], + "max_tokens": 512, + "temperature": 0.7, + "top_p": 0.95 +}' -w "\n" +``` + +Here is an example response: + +```json +{ + "id": "chatcmpl-abc123def456", + "object": "chat.completion", + "created": 1759022940, + "model": "nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4", + "choices": [ + { + "index": 0, + "message": { + "role": "assistant", + "content": "The capital of France is Paris. Paris is not only the capital but also the largest city in France, known for its rich history, culture, art, and iconic landmarks such as the Eiffel Tower, the Louvre Museum, and Notre-Dame Cathedral." + }, + "logprobs": null, + "finish_reason": "stop" + } + ], + "usage": { + "prompt_tokens": 15, + "completion_tokens": 58, + "total_tokens": 73 + } +} +``` + +### Troubleshooting Tips + +* If you encounter CUDA out-of-memory errors, try reducing `max_batch_size`, `max_num_tokens`, or `kv_cache_free_gpu_memory_fraction`. +* Ensure your model checkpoints are compatible with the expected format. +* For performance issues, check GPU utilization with `nvidia-smi` while the server is running. +* If the container fails to start, verify that the NVIDIA Container Toolkit is properly installed. +* For connection issues, make sure the server port (`8000` in this guide) is not being used by another application. +* Nemotron v3 Super is a hybrid SSM/attention model with MoE — ensure you have sufficient GPU memory for the full 120B parameter weights even though only 12B parameters are active per token. + +## Benchmarking Performance + +To benchmark the performance of your TensorRT LLM server you can leverage the built-in `benchmark_serving.py` script. To do this, first create a wrapper `bench.sh` script. + +```shell +cat <<'EOF' > bench.sh +#!/usr/bin/env bash +set -euo pipefail + +# Adjust the model name based on which Nemotron v3 Super variant you're benchmarking +MODEL_NAME="nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4" + +concurrency_list="1 2 4 8 16 32 64 128" +multi_round=5 +isl=1024 +osl=1024 +result_dir=/tmp/nemotron_super_output + +for concurrency in ${concurrency_list}; do + num_prompts=$((concurrency * multi_round)) + python -m tensorrt_llm.serve.scripts.benchmark_serving \ + --model ${MODEL_NAME} \ + --backend openai \ + --dataset-name "random" \ + --random-input-len ${isl} \ + --random-output-len ${osl} \ + --random-prefix-len 0 \ + --random-ids \ + --num-prompts ${num_prompts} \ + --max-concurrency ${concurrency} \ + --ignore-eos \ + --tokenize-on-client \ + --percentile-metrics "ttft,tpot,itl,e2el" +done +EOF +chmod +x bench.sh +``` + +To achieve max throughput, with attention DP on, one needs to sweep up to `concurrency = max_batch_size * num_gpus`. + +If you want to save the results to a file add the following options. + +```shell +--save-result \ +--result-dir "${result_dir}" \ +--result-filename "concurrency_${concurrency}.json" +``` + +For more benchmarking options see [benchmark_serving.py](https://github.com/NVIDIA/TensorRT-LLM/blob/main/tensorrt_llm/serve/scripts/benchmark_serving.py) + +Run `bench.sh` to begin a serving benchmark. This will take a long time if you run all the concurrencies mentioned in the above `bench.sh` script. + +```shell +./bench.sh +``` diff --git a/docs/source/deployment-guide/index.rst b/docs/source/deployment-guide/index.rst index 93e70564b77..0025b73f74f 100644 --- a/docs/source/deployment-guide/index.rst +++ b/docs/source/deployment-guide/index.rst @@ -32,6 +32,11 @@ This table is designed to provide a straightforward starting point; for detailed - Inference Scenario - Config - Command + * - `Nemotron v3 Super (NVFP4) `_ + - B200, GB200 + - Max Throughput + - `nemotron-3-super-throughput.yaml `_ + - ``trtllm-serve nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 --config ${TRTLLM_DIR}/examples/configs/curated/nemotron-3-super-throughput.yaml`` * - `DeepSeek-R1 `_ - H100, H200 - Max Throughput @@ -97,6 +102,7 @@ The deployment guides below provide more detailed instructions for serving speci :maxdepth: 1 :name: Deployment Guides + deployment-guide-for-nemotron-3-super-on-trtllm.md deployment-guide-for-deepseek-r1-on-trtllm.md deployment-guide-for-llama3.3-70b-on-trtllm.md deployment-guide-for-llama4-scout-on-trtllm.md diff --git a/docs/source/models/supported-models.md b/docs/source/models/supported-models.md index 74e38b380aa..a8565a7a8fc 100644 --- a/docs/source/models/supported-models.md +++ b/docs/source/models/supported-models.md @@ -23,7 +23,7 @@ The following is a table of supported models for the PyTorch backend: | `MixtralForCausalLM` | Mixtral | `mistralai/Mixtral-8x7B-v0.1` | | `MllamaForConditionalGeneration` | Llama 3.2 | `meta-llama/Llama-3.2-11B-Vision` | | `NemotronForCausalLM` | Nemotron-3, Nemotron-4, Minitron | `nvidia/Minitron-8B-Base` | -| `NemotronHForCausalLM` | Nemotron-3-Nano | `nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-FP8` | +| `NemotronHForCausalLM` | Nemotron-3-Nano, Nemotron-3-Super | `nvidia/nvidia-nemotron-v3` | | `NemotronNASForCausalLM` | NemotronNAS | `nvidia/Llama-3_3-Nemotron-Super-49B-v1` | | `Phi3ForCausalLM` | Phi-4 | `microsoft/Phi-4` | | `Qwen2ForCausalLM` | QwQ, Qwen2 | `Qwen/Qwen2-7B-Instruct` | @@ -50,6 +50,7 @@ Note: Support for other models may vary. Features marked "N/A" are not applicabl | `GptOssForCausalLM` | Yes | Yes | Yes | Yes | Yes | No | Yes | Yes [^4] | Yes | Yes | Yes | N/A | Yes | Yes | | `Qwen3_5MoeForCausalLM` [^5] | Yes | Yes | Untested | Untested | Yes | No | No | No | Yes | Untested | Yes | N/A | Untested | Untested | | `Glm4MoeLiteForCausalLM` [^6] | Yes | Yes | Untested | Untested | Yes | No | No | No | Yes | Untested | Untested | N/A | Untested | Untested | +| `NemotronHForCausalLM` (Super) | Yes | Yes | Untested | Untested | Yes | Yes | No | No | Yes | Yes | Untested | N/A | Untested | Untested | [^1]: Chunked Prefill for MLA can only be enabled on SM100/SM103. [^2]: KV cache reuse for MLA can only be enabled on SM90/SM100/SM103 and in BF16/FP8 KV cache dtype. diff --git a/examples/configs/curated/nemotron-3-super-throughput.yaml b/examples/configs/curated/nemotron-3-super-throughput.yaml new file mode 100644 index 00000000000..835f6451f9c --- /dev/null +++ b/examples/configs/curated/nemotron-3-super-throughput.yaml @@ -0,0 +1,13 @@ +max_batch_size: 512 +max_num_tokens: 2048 +tensor_parallel_size: 4 +moe_expert_parallel_size: 4 +trust_remote_code: true +enable_attention_dp: true +cuda_graph_config: + enable_padding: true + max_batch_size: 256 +kv_cache_config: + free_gpu_memory_fraction: 0.8 + enable_block_reuse: false +num_postprocess_workers: 4 diff --git a/examples/models/core/nemotron/README_nemotron_super_v3.md b/examples/models/core/nemotron/README_nemotron_super_v3.md new file mode 100644 index 00000000000..077134bf50b --- /dev/null +++ b/examples/models/core/nemotron/README_nemotron_super_v3.md @@ -0,0 +1,197 @@ +# Nemotron Super V3 model + +## Table of Contents + +- [Overview](#overview) +- [Supported Hardware](#supported-hardware) +- [Usage](#usage) + - [Online serving example](#online-serving-example) + - [DGX Spark](#dgx-spark) + - [SSM Stochastic Rounding with MTP](#ssm-stochastic-rounding-with-mtp) + - [Offline inference example](#offline-inference-example) +- [Notes](#notes) + +## Overview + +The Nemotron Super V3 model uses a hybrid Mamba-Transformer MoE architecture with 120B total +parameters and only 12B active parameters per token, delivering efficient high-throughput inference. +It supports long context lengths and is optimized for complex, multi-document, and long-duration +applications. + +This document outlines the procedures for executing Nemotron Super V3 using TensorRT LLM. The +implementation supports both single and multi-GPU configurations via the PyTorch backend. +Additionally, ModelOpt was employed to derive NVFP4 checkpoints from the source checkpoint. +The model repositories are: +* [Base BF16 repository](https://huggingface.co/nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-Base-BF16) +* [BF16 repository](https://huggingface.co/nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-BF16) +* [NVFP4 repository](https://huggingface.co/nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4) + +All models are available under the [nvidia/nvidia-nemotron-v3](https://huggingface.co/collections/nvidia/nvidia-nemotron-v3) collection on Hugging Face. + +Nemotron Super V3 supports the following features: +* BF16, NVFP4 model formats. +* Single and multi-GPU inference. +* Mixture-of-Experts (MoE) with expert parallelism. +* Hybrid SSM (Mamba) + Attention architecture. +* MTP (Multi-Token Prediction) speculative decoding. + +## Supported Hardware +- **NVIDIA Blackwell**: B200, GB200, DGX Spark +- **NVIDIA Hopper**: H100, H200, + + +# Usage + +## Online serving example + +We can follow the configuration file from [nemotron-3-super-throughput.yaml](https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/configs/curated/nemotron-3-super-throughput.yaml). + +For the server: + +```sh +# Example configuration: +cat > nemotron_super_v3.yaml< ./extra-llm-api-config.yml << EOF +kv_cache_config: + enable_block_reuse: false +cuda_graph_config: + max_batch_size: 32 + enable_padding: true +moe_config: + backend: CUTLASS +EOF + +trtllm-serve nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 \ +--host 0.0.0.0 \ +--port 8000 \ +--extra_llm_api_options ./extra-llm-api-config.yml +``` + +### SSM Stochastic Rounding with MTP + +For long-context or high-throughput scenarios, enabling SSM stochastic rounding can improve output +quality by reducing numerical drift in the Mamba SSM state accumulation. This configuration also +enables MTP speculative decoding and chunked prefill for optimal performance. + +```sh +cat > nemotron_super_v3_mtp.yaml << EOF +trust_remote_code: true +kv_cache_config: + enable_block_reuse: false + free_gpu_memory_fraction: 0.8 + mamba_ssm_cache_dtype: float16 + mamba_ssm_stochastic_rounding: true + mamba_ssm_philox_rounds: 5 +speculative_config: + decoding_type: MTP + num_nextn_predict_layers: 5 + allow_advanced_sampling: true +cuda_graph_config: + max_batch_size: 64 + enable_padding: true +moe_config: + backend: TRTLLM +stream_interval: 10 +enable_chunked_prefill: true +enable_attention_dp: true +num_postprocess_workers: 4 +EOF + +trtllm-serve nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 \ +--host 0.0.0.0 \ +--port 8000 \ +--config nemotron_super_v3_mtp.yaml +``` + +Key options: +* `mamba_ssm_stochastic_rounding`: Enables stochastic rounding for SSM state updates, improving numerical stability for long sequences. +* `mamba_ssm_philox_rounds`: Number of Philox RNG rounds for stochastic rounding. +* `mamba_ssm_cache_dtype`: Sets the data type for the Mamba SSM cache. +* `speculative_config`: Enables MTP with next-token prediction layers and advanced sampling. +* `enable_chunked_prefill`: Enables chunked prefill for better memory efficiency. + +## Offline inference example + +Using the `quickstart_advanced.py` script with MTP (Multi-Token Prediction) speculative decoding: + +```sh +python3 examples/llm-api/quickstart_advanced.py \ + --model_dir nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 \ + --disable_kv_cache_reuse \ + --max_batch_size=128 \ + --moe_backend=TRTLLM \ + --spec_decode_algo=MTP \ + --spec_decode_max_draft_len=3 \ + --use_one_model \ + --tp_size=8 \ + --moe_ep_size 8 \ + --apply_chat_template +``` + +Key options: +* `--spec_decode_algo=MTP --spec_decode_max_draft_len=3`: Enables MTP speculative decoding with 3 draft tokens for faster generation. +* `--tp_size=8 --moe_ep_size 8`: Uses 8-way tensor parallelism and expert parallelism. +* `--moe_backend=TRTLLM`: Uses the optimized TensorRT LLM MoE backend. +* `--apply_chat_template`: Applies the chat template for the model. +* `--disable_kv_cache_reuse`: Required for hybrid SSM models. + + +# Notes + +* prefix-cache is not supported for Nemotron Super V3 yet, so please set `enable_block_reuse: false` when launching a server. +* For detailed deployment instructions, see the [deployment guide](https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/deployment-guide/deployment-guide-for-nemotron-3-super-on-trtllm.md).