Skip to content

Commit e350f6d

Browse files
committed
add docs for bf16
1 parent 3b60c3c commit e350f6d

File tree

1 file changed

+15
-2
lines changed

1 file changed

+15
-2
lines changed

docs/execution-providers/TensorRT-ExecutionProvider.md

Lines changed: 15 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -195,6 +195,7 @@ Ort::ThrowOnError(api.GetTensorRTProviderOptionsAsString(tensorrt_options,
195195
| **Precision and Performance** | | |
196196
| Set TensorRT EP GPU memory usage limit | [trt_max_workspace_size](./TensorRT-ExecutionProvider.md#trt_max_workspace_size) | int |
197197
| Enable FP16 precision for faster performance | [trt_fp16_enable](./TensorRT-ExecutionProvider.md#trt_fp16_enable) | bool |
198+
| Enable BF16 precision for faster performance | [trt_bf16_enable](./TensorRT-ExecutionProvider.md#trt_bf16_enable) | bool |
198199
| Enable INT8 precision for quantized inference | [trt_int8_enable](./TensorRT-ExecutionProvider.md#trt_int8_enable) | bool |
199200
| Name INT8 calibration table for non-QDQ models | [trt_int8_calibration_table_name](./TensorRT-ExecutionProvider.md#trt_int8_calibration_table_name) | string |
200201
| Use native TensorRT calibration tables | [trt_int8_use_native_calibration_table](./TensorRT-ExecutionProvider.md#trt_int8_use_native_calibration_table) | bool |
@@ -303,6 +304,13 @@ TensorRT configurations can be set by execution provider options. It's useful wh
303304
304305
> Note: not all Nvidia GPUs support FP16 precision.
305306
307+
##### trt_bf16_enable
308+
309+
310+
* Description: enable BF16 mode in TensorRT.
311+
312+
> Note: not all Nvidia GPUs support BF16 precision.
313+
306314
##### trt_int8_enable
307315
308316
* Description: enable INT8 mode in TensorRT.
@@ -347,7 +355,7 @@ TensorRT configurations can be set by execution provider options. It's useful wh
347355
348356
* Engine will be cached when it's built for the first time so next time when new inference session is created the engine can be loaded directly from cache. In order to validate that the loaded engine is usable for current inference, engine profile is also cached and loaded along with engine. If current input shapes are in the range of the engine profile, the loaded engine can be safely used. Otherwise if input shapes are out of range, profile cache will be updated to cover the new shape and engine will be recreated based on the new profile (and also refreshed in the engine cache).
349357
350-
* Note each engine is created for specific settings such as model path/name, precision (FP32/FP16/INT8 etc), workspace, profiles etc, and specific GPUs and it's not portable, so it's essential to make sure those settings are not changing, otherwise the engine needs to be rebuilt and cached again.
358+
* Note each engine is created for specific settings such as model path/name, precision (FP32/FP16/BF16/INT8 etc), workspace, profiles etc, and specific GPUs and it's not portable, so it's essential to make sure those settings are not changing, otherwise the engine needs to be rebuilt and cached again.
351359
352360
> **Warning: Please clean up any old engine and profile cache files (.engine and .profile) if any of the following changes:**
353361
>
@@ -501,6 +509,8 @@ Following environment variables can be set for TensorRT execution provider. Clic
501509
502510
* `ORT_TENSORRT_FP16_ENABLE`: Enable FP16 mode in TensorRT. 1: enabled, 0: disabled. Default value: 0. Note not all Nvidia GPUs support FP16 precision.
503511
512+
* `ORT_TENSORRT_BF16_ENABLE`: Enable BF16 mode in TensorRT. 1: enabled, 0: disabled. Default value: 0. Note not all Nvidia GPUs support BF16 precision.
513+
504514
* `ORT_TENSORRT_INT8_ENABLE`: Enable INT8 mode in TensorRT. 1: enabled, 0: disabled. Default value: 0. Note not all Nvidia GPUs support INT8 precision.
505515
506516
* `ORT_TENSORRT_INT8_CALIBRATION_TABLE_NAME`: Specify INT8 calibration table file for non-QDQ models in INT8 mode. Note calibration table should not be provided for QDQ model because TensorRT doesn't allow calibration table to be loded if there is any Q/DQ node in the model. By default the name is empty.
@@ -512,7 +522,7 @@ Following environment variables can be set for TensorRT execution provider. Clic
512522
513523
* `ORT_TENSORRT_DLA_CORE`: Specify DLA core to execute on. Default value: 0.
514524
515-
* `ORT_TENSORRT_ENGINE_CACHE_ENABLE`: Enable TensorRT engine caching. The purpose of using engine caching is to save engine build time in the case that TensorRT may take long time to optimize and build engine. Engine will be cached when it's built for the first time so next time when new inference session is created the engine can be loaded directly from cache. In order to validate that the loaded engine is usable for current inference, engine profile is also cached and loaded along with engine. If current input shapes are in the range of the engine profile, the loaded engine can be safely used. Otherwise if input shapes are out of range, profile cache will be updated to cover the new shape and engine will be recreated based on the new profile (and also refreshed in the engine cache). Note each engine is created for specific settings such as model path/name, precision (FP32/FP16/INT8 etc), workspace, profiles etc, and specific GPUs and it's not portable, so it's essential to make sure those settings are not changing, otherwise the engine needs to be rebuilt and cached again. 1: enabled, 0: disabled. Default value: 0.
525+
* `ORT_TENSORRT_ENGINE_CACHE_ENABLE`: Enable TensorRT engine caching. The purpose of using engine caching is to save engine build time in the case that TensorRT may take long time to optimize and build engine. Engine will be cached when it's built for the first time so next time when new inference session is created the engine can be loaded directly from cache. In order to validate that the loaded engine is usable for current inference, engine profile is also cached and loaded along with engine. If current input shapes are in the range of the engine profile, the loaded engine can be safely used. Otherwise if input shapes are out of range, profile cache will be updated to cover the new shape and engine will be recreated based on the new profile (and also refreshed in the engine cache). Note each engine is created for specific settings such as model path/name, precision (FP32/FP16/BF16/INT8 etc), workspace, profiles etc, and specific GPUs and it's not portable, so it's essential to make sure those settings are not changing, otherwise the engine needs to be rebuilt and cached again. 1: enabled, 0: disabled. Default value: 0.
516526
* **Warning: Please clean up any old engine and profile cache files (.engine and .profile) if any of the following changes:**
517527
* Model changes (if there are any changes to the model topology, opset version, operators etc.)
518528
* ORT version changes (i.e. moving from ORT version 1.8 to 1.9)
@@ -564,6 +574,9 @@ export ORT_TENSORRT_MIN_SUBGRAPH_SIZE=5
564574
# Enable FP16 mode in TensorRT
565575
export ORT_TENSORRT_FP16_ENABLE=1
566576
577+
# Enable BF16 mode in TensorRT
578+
export ORT_TENSORRT_BF16_ENABLE=1
579+
567580
# Enable INT8 mode in TensorRT
568581
export ORT_TENSORRT_INT8_ENABLE=1
569582

0 commit comments

Comments
 (0)