Skip to content

Commit c468d67

Browse files
authored
TensorRT 10.6-GA OSS Release (#4238)
Signed-off-by: Kevin Chen <[email protected]>
1 parent c8a5043 commit c468d67

File tree

110 files changed

+8376
-1788
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

110 files changed

+8376
-1788
lines changed

.gitmodules

+1-1
Original file line numberDiff line numberDiff line change
@@ -9,4 +9,4 @@
99
[submodule "parsers/onnx"]
1010
path = parsers/onnx
1111
url = https://github.com/onnx/onnx-tensorrt.git
12-
branch = main
12+
branch = release/10.6-GA

CHANGELOG.md

+32
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,37 @@
11
# TensorRT OSS Release Changelog
22

3+
## 10.6.0 GA - 2024-11-05
4+
Key Feature and Updates:
5+
- Demo Changes
6+
- demoBERT: The use of `fcPlugin` in demoBERT has been removed.
7+
- demoBERT: All TensorRT plugins now used in demoBERT (`CustomEmbLayerNormDynamic`, `CustomSkipLayerNormDynamic`, and `CustomQKVToContextDynamic`) now have versions that inherit from IPluginV3 interface classes. The user can opt-in to use these V3 plugins by specifying `--use-v3-plugins` to the builder scripts.
8+
- Opting-in to use V3 plugins does not affect performance, I/O, or plugin attributes.
9+
- There is a known issue in the V3 (version 4) of `CustomQKVToContextDynamic` plugin from TensorRT 10.6.0, causing an internal assertion error if either the batch or sequence dimensions differ at runtime from the ones used to serialize the engine. See the “known issues” section of the [TensorRT-10.6.0 release notes](https://docs.nvidia.com/deeplearning/tensorrt/release-notes/index.html#rel-10-6-0).
10+
- For smoother migration, the default behavior is still using the deprecated `IPluginV2DynamicExt`-derived plugins, when the flag: `--use-v3-plugins` isn't specified in the builder scripts. The flag `--use-deprecated-plugins` was added as an explicit way to enforce the default behavior, and is mutually exclusive with `--use-v3-plugins`.
11+
- demoDiffusion
12+
- Introduced BF16 and FP8 support for the [Flux.1-dev](demo/Diffusion#generate-an-image-guided-by-a-text-prompt-using-flux) pipeline.
13+
- Expanded FP8 support on Ada platforms.
14+
- Enabled LoRA adapter compatibility for SDv1.5, SDv2.1, and SDXL pipelines using Diffusers version 0.30.3.
15+
16+
- Sample Changes
17+
- Added the Python sample [quickly_deployable_plugins](samples/python/quickly_deployable_plugins), which demonstrates quickly deployable Python-based plugin definitions (QDPs) in TensorRT. QDPs are a simple and intuitive decorator-based approach to defining TensorRT plugins, requiring drastically less code.
18+
19+
- Plugin Changes
20+
- The `fcPlugin` has been deprecated. Its functionality has been superseded by the [IMatrixMultiplyLayer](https://docs.nvidia.com/deeplearning/tensorrt/api/c_api/classnvinfer1_1_1_i_matrix_multiply_layer.html) that is natively provided by TensorRT.
21+
- Migrated `IPluginV2`-descendent version 1 of `CustomEmbLayerNormDynamic`, to version 6, which implements `IPluginV3`.
22+
- The newer versions preserve the attributes and I/O of the corresponding older plugin version.
23+
- The older plugin versions are deprecated and will be removed in a future release.
24+
25+
- Parser Changes
26+
- Updated ONNX submodule version to 1.17.0.
27+
- Fixed issue where conditional layers were incorrectly being added.
28+
- Updated local function metadata to contain more information.
29+
- Added support for parsing nodes with Quickly Deployable Plugins.
30+
- Fixed handling of optional outputs.
31+
32+
- Tool Updates
33+
- ONNX-Graphsurgeon updated to version 0.5.3
34+
- Polygraphy updated to 0.49.14.
335

436
## 10.5.0 GA - 2024-09-30
537
Key Features and Updates:

README.md

+9-9
Original file line numberDiff line numberDiff line change
@@ -26,7 +26,7 @@ You can skip the **Build** section to enjoy TensorRT with Python.
2626
To build the TensorRT-OSS components, you will first need the following software packages.
2727

2828
**TensorRT GA build**
29-
* TensorRT v10.5.0.18
29+
* TensorRT v10.6.0.26
3030
* Available from direct download links listed below
3131

3232
**System Packages**
@@ -73,25 +73,25 @@ To build the TensorRT-OSS components, you will first need the following software
7373
If using the TensorRT OSS build container, TensorRT libraries are preinstalled under `/usr/lib/x86_64-linux-gnu` and you may skip this step.
7474

7575
Else download and extract the TensorRT GA build from [NVIDIA Developer Zone](https://developer.nvidia.com) with the direct links below:
76-
- [TensorRT 10.5.0.18 for CUDA 11.8, Linux x86_64](https://developer.nvidia.com/downloads/compute/machine-learning/tensorrt/10.5.0/tars/TensorRT-10.5.0.18.Linux.x86_64-gnu.cuda-11.8.tar.gz)
77-
- [TensorRT 10.5.0.18 for CUDA 12.6, Linux x86_64](https://developer.nvidia.com/downloads/compute/machine-learning/tensorrt/10.5.0/tars/TensorRT-10.5.0.18.Linux.x86_64-gnu.cuda-12.6.tar.gz)
78-
- [TensorRT 10.5.0.18 for CUDA 11.8, Windows x86_64](https://developer.nvidia.com/downloads/compute/machine-learning/tensorrt/10.5.0/zip/TensorRT-10.5.0.18.Windows.win10.cuda-11.8.zip)
79-
- [TensorRT 10.5.0.18 for CUDA 12.6, Windows x86_64](https://developer.nvidia.com/downloads/compute/machine-learning/tensorrt/10.5.0/zip/TensorRT-10.5.0.18.Windows.win10.cuda-12.6.zip)
76+
- [TensorRT 10.6.0.26 for CUDA 11.8, Linux x86_64](https://developer.nvidia.com/downloads/compute/machine-learning/tensorrt/10.6.0/tars/TensorRT-10.6.0.26.Linux.x86_64-gnu.cuda-11.8.tar.gz)
77+
- [TensorRT 10.6.0.26 for CUDA 12.6, Linux x86_64](https://developer.nvidia.com/downloads/compute/machine-learning/tensorrt/10.6.0/tars/TensorRT-10.6.0.26.Linux.x86_64-gnu.cuda-12.6.tar.gz)
78+
- [TensorRT 10.6.0.26 for CUDA 11.8, Windows x86_64](https://developer.nvidia.com/downloads/compute/machine-learning/tensorrt/10.6.0/zip/TensorRT-10.6.0.26.Windows.win10.cuda-11.8.zip)
79+
- [TensorRT 10.6.0.26 for CUDA 12.6, Windows x86_64](https://developer.nvidia.com/downloads/compute/machine-learning/tensorrt/10.6.0/zip/TensorRT-10.6.0.26.Windows.win10.cuda-12.6.zip)
8080

8181

8282
**Example: Ubuntu 20.04 on x86-64 with cuda-12.6**
8383

8484
```bash
8585
cd ~/Downloads
86-
tar -xvzf TensorRT-10.5.0.18.Linux.x86_64-gnu.cuda-12.6.tar.gz
87-
export TRT_LIBPATH=`pwd`/TensorRT-10.5.0.18
86+
tar -xvzf TensorRT-10.6.0.26.Linux.x86_64-gnu.cuda-12.6.tar.gz
87+
export TRT_LIBPATH=`pwd`/TensorRT-10.6.0.26
8888
```
8989

9090
**Example: Windows on x86-64 with cuda-12.6**
9191

9292
```powershell
93-
Expand-Archive -Path TensorRT-10.5.0.18.Windows.win10.cuda-12.6.zip
94-
$env:TRT_LIBPATH="$pwd\TensorRT-10.5.0.18\lib"
93+
Expand-Archive -Path TensorRT-10.6.0.26.Windows.win10.cuda-12.6.zip
94+
$env:TRT_LIBPATH="$pwd\TensorRT-10.6.0.26\lib"
9595
```
9696

9797
## Setting Up The Build Environment

VERSION

+1-1
Original file line numberDiff line numberDiff line change
@@ -1 +1 @@
1-
10.5.0.18
1+
10.6.0.26

demo/BERT/README.md

+16-13
Original file line numberDiff line numberDiff line change
@@ -75,7 +75,7 @@ The following software version configuration has been tested:
7575
|Software|Version|
7676
|--------|-------|
7777
|Python|>=3.8|
78-
|TensorRT|10.5.0.18|
78+
|TensorRT|10.6.0.26|
7979
|CUDA|12.6|
8080

8181
## Setup
@@ -122,7 +122,7 @@ This demo BERT application can be run within the TensorRT OSS build container. I
122122
bash scripts/download_model.sh
123123
```
124124

125-
**Note:** Since the datasets and checkpoints are stored in the directory mounted from the host, they do *not* need to be downloaded each time the container is launched.
125+
**Note:** Since the datasets and checkpoints are stored in the directory mounted from the host, they do *not* need to be downloaded each time the container is launched.
126126

127127
**Warning:** In the event of encountering an error message stating, "Missing API key and missing Email Authentication. This command requires an API key or authentication via browser login", the recommended steps for resolution are as follows:
128128
* Generate an API key by logging in https://ngc.nvidia.com/setup/api-key and copy the generated API key.
@@ -153,11 +153,11 @@ Completing these steps should resolve the error you encountered and allow the co
153153
jupyter notebook --ip 0.0.0.0 inference.ipynb
154154
```
155155
Then, use your browser to open the link displayed. The link should look similar to: `http://127.0.0.1:8888/?token=<TOKEN>`
156-
156+
157157
6. Run inference with CUDA Graph support.
158158

159159
A separate python `inference_c.py` script is provided to run inference with CUDA Graph support. This is necessary since CUDA Graph is only supported through CUDA C/C++ APIs, not pyCUDA. The `inference_c.py` script uses pybind11 to interface with C/C++ for CUDA graph capturing and launching. The cmdline interface is the same as `inference.py` except for an extra `--enable-graph` option.
160-
160+
161161
```bash
162162
mkdir -p build; pushd build
163163
cmake .. -DPYTHON_EXECUTABLE=$(which python)
@@ -167,11 +167,11 @@ Completing these steps should resolve the error you encountered and allow the co
167167
```
168168

169169
A separate C/C++ inference benchmark executable `perf` (compiled from `perf.cpp`) is provided to run inference benchmarks with CUDA Graph. The cmdline interface is the same as `perf.py` except for an extra `--enable_graph` option.
170-
170+
171171
```bash
172172
build/perf -e engines/bert_large_128.engine -b 1 -s 128 -w 100 -i 1000 --enable_graph
173173
```
174-
174+
175175

176176
### (Optional) Trying a different configuration
177177

@@ -220,6 +220,9 @@ The `infer_c/` folder contains all the necessary C/C++ files required for CUDA G
220220

221221
To view the available parameters for each script, you can use the help flag (`-h`).
222222

223+
**Note:** In the builder scripts (`builder.py` and `builder_varseqlen.py`), the options `--use-deprecated-plugins` and `--use-v3-plugins` toggle the underlying implementation of the plugins used in demoBERT. They are mutually exclusive, and enabling either should not affect functionality, or performance. The `--use-deprecated-plugins` uses plugin versions that inherit from `IPluginV2DynamicExt`, while `--use-v3-plugins` uses plugin versions that inherit from `IPluginV3` classes.
224+
If unspecified, `--use-deprecated-plugins` is used by default.
225+
223226
### TensorRT inference process
224227

225228
As mentioned in the [Quick Start Guide](#quick-start-guide), two options are provided for running inference:
@@ -245,7 +248,7 @@ As mentioned in the [Quick Start Guide](#quick-start-guide), two options are pro
245248
**Xavier GPU**
246249
```bash
247250
# Only supports SkipLayerNormPlugin running with INT8 I/O. Use -iln builder flag to enable.
248-
mkdir -p engines && python3 builder.py -m models/fine-tuned/bert_tf_ckpt_large_qa_squad2_amp_384_v19.03.1/model.ckpt -o engines/bert_large_384_int8mix.engine -b 1 -s 384 --int8 --fp16 --strict -c models/fine-tuned/bert_tf_ckpt_large_qa_squad2_amp_384_v19.03.1 --squad-json ./squad/train-v1.1.json -v models/fine-tuned/bert_tf_ckpt_large_qa_squad2_amp_384_v19.03.1/vocab.txt --calib-num 100 -iln
251+
mkdir -p engines && python3 builder.py -m models/fine-tuned/bert_tf_ckpt_large_qa_squad2_amp_384_v19.03.1/model.ckpt -o engines/bert_large_384_int8mix.engine -b 1 -s 384 --int8 --fp16 --strict -c models/fine-tuned/bert_tf_ckpt_large_qa_squad2_amp_384_v19.03.1 --squad-json ./squad/train-v1.1.json -v models/fine-tuned/bert_tf_ckpt_large_qa_squad2_amp_384_v19.03.1/vocab.txt --calib-num 100 -iln
249252
```
250253

251254
**Volta GPU**
@@ -278,13 +281,13 @@ As mentioned in the [Quick Start Guide](#quick-start-guide), two options are pro
278281
**Xavier GPU**
279282
```bash
280283
# Only supports SkipLayerNormPlugin running with INT8 I/O. Use -iln builder flag to enable.
281-
mkdir -p engines && python3 builder.py -o engines/bert_large_384_int8mix.engine -b 1 -s 384 --int8 --fp16 --strict -c models/fine-tuned/bert_tf_ckpt_large_qa_squad2_amp_384_v19.03.1 -v models/fine-tuned/bert_tf_ckpt_large_qa_squad2_amp_384_v19.03.1/vocab.txt -x models/fine-tuned/bert_pyt_onnx_large_qa_squad11_amp_fake_quant_v1/bert_large_v1_1_fake_quant.onnx -iln
284+
mkdir -p engines && python3 builder.py -o engines/bert_large_384_int8mix.engine -b 1 -s 384 --int8 --fp16 --strict -c models/fine-tuned/bert_tf_ckpt_large_qa_squad2_amp_384_v19.03.1 -v models/fine-tuned/bert_tf_ckpt_large_qa_squad2_amp_384_v19.03.1/vocab.txt -x models/fine-tuned/bert_pyt_onnx_large_qa_squad11_amp_fake_quant_v1/bert_large_v1_1_fake_quant.onnx -iln
282285
```
283286

284287
**Volta GPU**
285288
```bash
286289
# No support for QKVToContextPlugin or SkipLayerNormPlugin running with INT8 I/O. Don't specify -imh or -iln in builder flags.
287-
mkdir -p engines && python3 builder.py -o engines/bert_large_384_int8mix.engine -b 1 -s 384 --int8 --fp16 --strict -c models/fine-tuned/bert_tf_ckpt_large_qa_squad2_amp_384_v19.03.1 -v models/fine-tuned/bert_tf_ckpt_large_qa_squad2_amp_384_v19.03.1/vocab.txt -x models/fine-tuned/bert_pyt_onnx_large_qa_squad11_amp_fake_quant_v1/bert_large_v1_1_fake_quant.onnx
290+
mkdir -p engines && python3 builder.py -o engines/bert_large_384_int8mix.engine -b 1 -s 384 --int8 --fp16 --strict -c models/fine-tuned/bert_tf_ckpt_large_qa_squad2_amp_384_v19.03.1 -v models/fine-tuned/bert_tf_ckpt_large_qa_squad2_amp_384_v19.03.1/vocab.txt -x models/fine-tuned/bert_pyt_onnx_large_qa_squad11_amp_fake_quant_v1/bert_large_v1_1_fake_quant.onnx
288291
```
289292

290293
This will build and engine with a maximum batch size of 1 (`-b 1`) and sequence length of 384 (`-s 384`) using INT8 mixed precision computation where possible (`--int8 --fp16 --strict`).
@@ -324,10 +327,10 @@ Note this is an experimental feature because we only support Xavier+ GPUs, also
324327

325328
This will build and engine with a maximum batch size of 1 (`-b 1`) and sequence length of 256 (`-s 256`) using INT8 precision computation where possible (`--int8`).
326329

327-
3. Run inference
330+
3. Run inference
328331

329332
Evaluate the F1 score and exact match score using the squad dataset:
330-
333+
331334
```bash
332335
python3 inference_varseqlen.py -e engines/bert_varseq_int8.engine -s 256 -sq ./squad/dev-v1.1.json -v models/fine-tuned/bert_tf_ckpt_large_qa_squad2_amp_384_v19.03.1/vocab.txt -o ./predictions.json
333336
python3 squad/evaluate-v1.1.py squad/dev-v1.1.json ./predictions.json 90
@@ -345,11 +348,11 @@ Note this is an experimental feature because we only support Xavier+ GPUs, also
345348
python3 perf_varseqlen.py -e engines/bert_varseq_int8.engine -b 1 -s 256
346349
```
347350

348-
This will collect performance data run use batch size 1 (`-b 1`) and sequence length of 256 (`-s 256`).
351+
This will collect performance data run use batch size 1 (`-b 1`) and sequence length of 256 (`-s 256`).
349352

350353
5. Collect performance data with CUDA graph enabled
351354

352-
We can use the same `inference_c.py` and `build/perf` to collect performance data with cuda graph enabled. The command line is the same as run without variable sequence length.
355+
We can use the same `inference_c.py` and `build/perf` to collect performance data with cuda graph enabled. The command line is the same as run without variable sequence length.
353356

354357
### Sparsity with Quantization Aware Training
355358

0 commit comments

Comments
 (0)