Release Torch-TensorRT v2.12.0 · pytorch/TensorRT

Torch-TensorRT 2.12.0 Linux x86-64 and Windows targets

PyTorch 2.12, CUDA 13.0, TensorRT 10.16, Python 3.10~3.13

Torch-TensorRT Wheels are available:

x86-64 Linux and Windows:
CUDA 13.0 + Python 3.10-3.13 is Available via PyPI

https://pypi.org/project/torch-tensorrt/

aarch64 SBSA Linux and Jetson Thor
CUDA 13.0 + Python 3.10–3.13 + Torch 2.12 + TensorRT 10.16

Available via PyPI: https://pypi.org/project/torch-tensorrt/
Available via PyTorch index: https://download.pytorch.org/whl/torch-tensorrt

Jetson Orin

no torch_tensorrt 2.9/2.10/2.11/2.12 release for Jetson Orin
please continue using torch_tensorrt 2.8 release

Torch-TensorRT-RTX 2.12.0 Linux x86-64 and Windows targets

PyTorch 2.12, CUDA 13.0, TensorRT-RTX 1.4, Python 3.10~3.13

Torch-TensorRT-RTX Wheels are available:

x86-64 Linux and Windows:
CUDA 13.0 + Python 3.10-3.13 is Available via PyPI

https://pypi.org/project/torch-tensorrt-rtx/

CUDA 13.0 + Python 3.10-3.13 is also Available via Pytorch Index

https://download.pytorch.org/whl/torch-tensorrt-rtx

Native Distributed Collectives

In prior versions of Torch-TensorRT, distributed operators were backed by kernels provided by TensorRT-LLM that the user needed to manually install. With Torch-TensorRT 2.12, many of these operations are natively supported which means in deployment, only Torch-TensorRT needs to be installed.

The distributed infrastructure is designed to operate on top of torch.distributed. Once a graph is sharded, traced and compiled. The torch.distributed device mesh can be passed to torch-tensorrt compiled modules using the following API:

trt_model = torch.compile(model, backend="torch_tensorrt", ...)
_ = trt_model(inp)  # warmup — triggers engine build
dist.barrier()

with torch_tensorrt.distributed.distributed_context(dist.group.WORLD, trt_model) as dmodel:
    output = dmodel(inp)

dist.destroy_process_group()
os._exit(0)

This can be done at compile time as well

with torch_tensorrt.distributed.distributed_context(tp_group):
    trt_model = torch.compile(model, backend="torch_tensorrt", ...)
    output = trt_model(inp)

Note: use_distributed_trace is no longer necessary to compile multi-device models, torch-tensorrt will automatically recognize distributed collectives and set the setting for the user.

torchtrtrun

The distributed operations utilize the NCCL version distributed by PyTorch which must be added to LD_PRELOAD before importing torch-tensorrt. As a convience we provide a tool torchtrtrun which is analogous to torchrun that configures these libraries correctly in addition to allowing users to launch models distributed across multiple nodes.

For example:

# Node 0 (rank 0):
torchtrtrun --nproc_per_node=1 --nnodes=2 --node_rank=0 \
  --rdzv_endpoint=<node0-ip>:29500 \
  tensor_parallel_llama_multinode.py

# Node 1 (rank 1):
torchtrtrun --nproc_per_node=1 --nnodes=2 --node_rank=1 \
  --rdzv_endpoint=<node0-ip>:29500 \
  tensor_parallel_llama_multinode.py

Serialization and torch.export

Models sharded and then exported can be compiled and saved to disk before being loaded on a deployment system. By default these modules attempt to bind to the default torch distributed device mesh. If there are multiple valid device meshes availble, the above API can be used to set a specific one to execute the engine.

More information on torch-tensorrt distributed collective support can be found here: https://docs.pytorch.org/TensorRT/tutorials/deployment/distributed_inference.html#multinode-inference

More information on native multi-device collectives can be found here: https://docs.nvidia.com/deeplearning/tensorrt/latest/inference-library/work-with-transformers.html#multi-device-attention-preview-feature

ExecuTorch Support

Torch-TensorRT 2.12 introduces initial ExecuTorch integration for exporting and running TensorRT-accelerated models in ExecuTorch .pte format. Users can now
save TensorRT-compiled ExportedProgram / FX models with:

torch_tensorrt.save(model, "model.pte", output_format="executorch")

This release adds a TensorRT ExecuTorch backend, partitioner, and serialization path that embeds TensorRT engine payloads directly into the .pte using the
same engine metadata format as the Torch-TensorRT runtime. The release package also includes a C++ TensorRT ExecuTorch backend source package and a reference
ExecuTorch runner showing how to load .pte files, initialize the TensorRT delegate, bind inputs/outputs, and execute inference without requiring Python at
runtime.

Highlights:

New torch_tensorrt.executorch Python APIs: TensorRTBackend, TensorRTPartitioner, and get_edge_compile_config.
New output_format="executorch" save path for generating ExecuTorch .pte models.
Support for static-shape and TensorRT profile-based dynamic-shape export examples.
New native C++ TensorRT ExecuTorch backend and reference runner included in libtorchtrt.tar.gz.
Engines that require TensorRT output allocators, such as data-dependent output shape engines, are not supported yet.
In Torch-TensorRT 2.12, the ExecuTorch integration still depends on LibTorch in the native runtime path.
In the next Torch-TensorRT 2.13 release is planned to move this to a pure ExecuTorch backend implementation without the LibTorch runtime dependency.

Known Limitations

ExecuTorch support still depends on the Torch/LibTorch C++ libraries used by Torch-TensorRT; this release does not provide a pure
ExecuTorch-only TensorRT deployment path.
TensorRT engine payloads larger than 2 GiB are not supported when embedded in an ExecuTorch .pte file.
Selecting a target device during ExecuTorch export is not currently supported. Exported .pte files default to cuda:0.

Comprehensive Attention Support

This release extends the TRT attention converters to support GQA/MQA and decode-phase attention based on TensorRT IAttentionLayer. Specifically, it covers all SDPA kernel variants, MHA/GQA/MQA attention patterns, causal vs non-causal masking, bool/float/broadcast mask shapes, decode-phase attention (seq_q=1), non-power-of-2 head dims, LLM-realistic configs, and multiple dtypes. This feature is enabled by default. If you want to turn it off, please set decompose_attention=True.

Known Limitations

TensorRT 10.x (will be resolved in TRT 11.0) and TensorRT-RTX-1.4:
For TensorRT 10.x, large causal sequences of k/v (seq >= 512, is_causal=True) in FP16/BF16
IAttentionLayer produces ~80% element mismatch at long sequences. Thus, we use FP32 for
the scale factor. If you want to use the accurate dtype, please set decompose_attention=True
or upgrade to TRT 11.0 or later.

Comprehensive Complex Numerics Support

Torch-TensorRT can now compile models containing complex64/complex128 tensors end-to-end. TensorRT itself has no native complex dtype — a lowering pass intercepts complex subgraphs before partitioning and rewrites them into equivalent real arithmetic on a (..., 2) last-dim layout (real/imag interleaved), so the engine only sees standard float ops and callers don't have to change anything.

This unlocks compilation of models that use complex arithmetic for rotary position embeddings — Llama 3 (1D RoPE), and video generation transformers like CogVideoX, Wan, and HunyuanVideo (3D RoPE) — including under dynamic shapes and in distributed (multi-GPU) settings.

What's supported

Complex inputs (placeholders) and buffers (get_attr) are rewritten to real-valued equivalents. placeholder(complex64) becomes placeholder(float32) with an appended trailing dim of 2; complex buffers are replaced via torch.stack([t.real, t.imag], dim=-1). Dynamic-shape SymInts are preserved across the rewrite.
Complex multiply (aten.mul.Tensor between two complex operands) is decomposed to the standard identity (ac − bd) + (ad + bc)i.
view_as_complex / view_as_real are erased — they become identities once the layout is already (..., 2).
Shape-manipulation ops are handled with the trailing real/imag dim in mind: reshape / view / _unsafe_view, flatten, unsqueeze, squeeze, permute, transpose, t, cat, stack, select, slice, narrow, roll, flip, split, chunk, expand, repeat. Negative dim indices are auto-shifted by −1 so dim=-1 keeps meaning "the original last complex dim."
Math ops over complex inputs — mm / bmm / matmul (including complex×real and real×complex mixed forms), abs, angle, real, imag, sin/cos/exp/log on complex, reciprocal (for scalar / complex), sum.dim_IntList / mean.dim / prod.dim_int, ones_like / full_like (correctly initialise as 1+0i / fill+0i).
Engine outputs that stay complex — models that return a complex tensor without going through view_as_real are also detected via a forward scan that collects unbounded complex nodes.
Runtime I/O — complex inputs can be passed as-is at call time. The runtime modules automatically apply torch.view_as_real(x).contiguous() to complex inputs before handing them to the engine, and rebuild complex outputs on the way back.
truncate_double=True lowers complex128 to float32 (vs the default float64)
Refit caching — complex buffers via a last-dim slice-matching stage in _save_weight_mapping, plus a tuple-keyed (sd_key, last_dim, idx) lookup in construct_refit_mapping_from_weight_name_map. Verification picks the real-unpacked branch for the reference module when complex inputs are present.
Graceful fallback — complex ops the rewriter doesn't know how to handle now fall back to PyTorch execution rather than failing the compile.

Known limitations

Only view_as_real-anchored subgraphs and forward-scanned complex outputs are detected; complex arithmetic that escapes both paths still fails.
Complex convolution / batch norm are not rewritten — only elementwise and shape-manipulation patterns plus the matmul family.
Real parameters intentionally shaped (d, 2) that feed complex-arithmetic ops will not be auto-promoted to complex layout; route them through the standard view_as_complex → mul → view_as_real pattern instead.

Reference PR: #4119 — Comprehensive Complex Numerics Support.

Debugger

TensorRT API Capture
In this release, we have improved TensorRT API Capture and Replay feature
It allows you to record the engine-building phase of your model and later replay the engine-build steps.

Capture:
The capture feature is by default disabled.
You can enable the capture feature via environment variable: TORCHTRT_ENABLE_TENSORRT_API_CAPTURE=1
TORCHTRT_ENABLE_TENSORRT_API_CAPTURE=1 python your_model_test.py
capture and replay files are automatically saved under debuglogs/capture_replay/ (i.e., the capture_replay subdirectory of logging_dir). You should see capture.json and associated .bin files generated there.

Replay:
Use tensorrt_player tool to replay the captured trt engine build without the original framework
tensorrt_player -j /absolute/path/to/shim.json -o /absolute/path/to/output_engine

Limitations:
-This feature is currently restricted to Linux(x86-64 and aarch64) only.

You can see more details in
https://docs.pytorch.org/TensorRT/debugging/capture_and_replay.html

What's Changed

upgrade torch_tensorrt from 2.11 to 2.12 by @lanluo-nvidia in #4090
cache shape expressions for reexport by @narendasan in #4079
Adds a heuristic upper bound in the case of unbounded symints by @narendasan in #4083
fix: docs + new dep group by @narendasan in #4060
fix: remove refit validator by @zewenli98 in #4044
fix the windows ci issue by @lanluo-nvidia in #4097
Fix typo in symbolic shape expressions variable name by @narendasan in #4102
cherry pick rtx release build fix from 2.11 release to main by @lanluo-nvidia in #4106
Support extra_file arg by @cehongwang in #4064
L2 nccl tests failures by @apbose in #4101
docs: Update the docs to the new theme from PyTorch by @narendasan in #4110
docs: make some nav stuff a bit clearer and fix the version drop down by @narendasan in #4113
docs: backfill release documentation that we have been too lazy to ar… by @narendasan in #4114
Narendasan/push rtrzmkllxunv by @narendasan in #4115
fix: Fix the linting system to not constantly flag generated docs by @narendasan in #4120
converter: add sdpa, flash-sdpa, efficient-sdpa, and cudnn-sdpa converters by @zewenli98 in #4104
fix failures- cpu offloading casues device mismatch L2_dynamo_compile… by @apbose in #4089
Changed test_refit_cumsum test by @cehongwang in #4121
handle symbolic shape for non tensor inputs in symbolic shape extraction by @apbose in #4124
feat: Allow pulling the venv's torch after uv sync so that people don… by @narendasan in #4130
split index.Tensor converter for bool vs int indexing by @wenbingl in #4123
Added fuse_rms_norm lowering by @cehongwang in #4017
fix: validator introduced in #4132 by @narendasan in #4136
upgrade tensorrt to 10.16, tensorrt_rtx to 1.4 by @lanluo-nvidia in #4144
Fixed the problem that cannot get detailed build without default prof… by @cehongwang in #4160
fix : unwaive skipped/special TRT-RTX tests by @tp5uiuc in #4156
Add the capture replay feature improvement for 10.16 by @lanluo-nvidia in #4158
fix typo: rtx 1.4 has the wrong 1.3 lib by @lanluo-nvidia in #4171
fix: avoid unnecessary GPU tensor copy in prepare_inputs() by @SandSnip3r in #4146
Fix: run_llm.py reports several errors by @yizhuoz004 in #4163
chore(deps): bump transformers from 4.53.1 to 5.0.0rc3 in /docker/ngc_test by @dependabot[bot] in #4175
chore(deps): bump transformers from 4.53.1 to 5.0.0rc3 in /tests/modules by @dependabot[bot] in #4174
chore(deps): bump transformers from 4.53.0 to 5.0.0rc3 in /tools/perf by @dependabot[bot] in #4173
chore(deps): bump transformers from 4.53.1 to 5.0.0rc3 in /examples/dynamo by @dependabot[bot] in #4172
Add dynamic shape support to index_put by @narendasan in #4143
test: skip cumsum conversion tests on TensorRT-RTX by @tp5uiuc in #4182
Rather than passing a power value of 1 to the scale op, send None. by @SandSnip3r in #4181
fix(rtx): reintroduce BF16 support, fall back depthwise conv to PyTorch by @tp5uiuc in #4178
fix(test): skip cumsum tests on TensorRT-RTX Windows instead of xfail by @tp5uiuc in #4189
waive(test): skip refit tests that fail on TensorRT-RTX by @tp5uiuc in #4193
fix(rtx): add WAR to fall back grouped 3D deconvolutions to PyTorch by @tp5uiuc in #4188
feat: support attn_bias for efficient SDPA by @zewenli98 in #4131
feat: add runtime cache API for TensorRT-RTX by @tp5uiuc in #4180
feat: add dynamic shapes kernel specialization strategy for TRT-RTX by @tp5uiuc in #4184
MD-TRT Support, Compile/Export, C++ and Python by @narendasan in #4183
Comprehensive Complex Numerics Support by @narendasan in #4119
2.12 torch-tensorrt release cut by @lanluo-nvidia in #4206
Lluo/executorch 2.12 cherry pick by @lanluo-nvidia in #4238
upgrade torchvision from 0.26.0 to 0.27.0 by @lanluo-nvidia in #4250
pin to a specific executorch version by @lanluo-nvidia in #4248
cherry pick 4244 from main to release 2.12 by @lanluo-nvidia in #4259
cherry pick 4246 from main to release 2.12 by @lanluo-nvidia in #4261
add release build force run all tests by @lanluo-nvidia in #4267
cherry pick 4203 from main to release 2.12 by @lanluo-nvidia in #4268
fix: a few bugs in release/2.12 by @zewenli98 in #4270
Revert "cherry pick 4203 from main to release 2.12" by @lanluo-nvidia in #4275
fix the modelopt config issue in python 3.10/11/12 by @lanluo-nvidia in #4276
fix the windows issue by @lanluo-nvidia in #4281

New Contributors

@yizhuoz004 made their first contribution in #4163

Full Changelog: v2.11.0...v2.12.0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Torch-TensorRT v2.12.0

Choose a tag to compare

Sorry, something went wrong.