Torch-TensorRT 2.12.0 Linux x86-64 and Windows targets
PyTorch 2.12, CUDA 13.0, TensorRT 10.16, Python 3.10~3.13
Torch-TensorRT Wheels are available:
x86-64 Linux and Windows:
CUDA 13.0 + Python 3.10-3.13 is Available via PyPI
aarch64 SBSA Linux and Jetson Thor
CUDA 13.0 + Python 3.10–3.13 + Torch 2.12 + TensorRT 10.16
- Available via PyPI: https://pypi.org/project/torch-tensorrt/
- Available via PyTorch index: https://download.pytorch.org/whl/torch-tensorrt
Jetson Orin
- no torch_tensorrt 2.9/2.10/2.11/2.12 release for Jetson Orin
- please continue using torch_tensorrt 2.8 release
Torch-TensorRT-RTX 2.12.0 Linux x86-64 and Windows targets
PyTorch 2.12, CUDA 13.0, TensorRT-RTX 1.4, Python 3.10~3.13
Torch-TensorRT-RTX Wheels are available:
x86-64 Linux and Windows:
CUDA 13.0 + Python 3.10-3.13 is Available via PyPI
CUDA 13.0 + Python 3.10-3.13 is also Available via Pytorch Index
Native Distributed Collectives
In prior versions of Torch-TensorRT, distributed operators were backed by kernels provided by TensorRT-LLM that the user needed to manually install. With Torch-TensorRT 2.12, many of these operations are natively supported which means in deployment, only Torch-TensorRT needs to be installed.
The distributed infrastructure is designed to operate on top of torch.distributed. Once a graph is sharded, traced and compiled. The torch.distributed device mesh can be passed to torch-tensorrt compiled modules using the following API:
trt_model = torch.compile(model, backend="torch_tensorrt", ...)
_ = trt_model(inp) # warmup — triggers engine build
dist.barrier()
with torch_tensorrt.distributed.distributed_context(dist.group.WORLD, trt_model) as dmodel:
output = dmodel(inp)
dist.destroy_process_group()
os._exit(0)This can be done at compile time as well
with torch_tensorrt.distributed.distributed_context(tp_group):
trt_model = torch.compile(model, backend="torch_tensorrt", ...)
output = trt_model(inp)
Note:
use_distributed_traceis no longer necessary to compile multi-device models, torch-tensorrt will automatically recognize distributed collectives and set the setting for the user.
torchtrtrun
The distributed operations utilize the NCCL version distributed by PyTorch which must be added to LD_PRELOAD before importing torch-tensorrt. As a convience we provide a tool torchtrtrun which is analogous to torchrun that configures these libraries correctly in addition to allowing users to launch models distributed across multiple nodes.
For example:
# Node 0 (rank 0):
torchtrtrun --nproc_per_node=1 --nnodes=2 --node_rank=0 \
--rdzv_endpoint=<node0-ip>:29500 \
tensor_parallel_llama_multinode.py
# Node 1 (rank 1):
torchtrtrun --nproc_per_node=1 --nnodes=2 --node_rank=1 \
--rdzv_endpoint=<node0-ip>:29500 \
tensor_parallel_llama_multinode.pySerialization and torch.export
Models sharded and then exported can be compiled and saved to disk before being loaded on a deployment system. By default these modules attempt to bind to the default torch distributed device mesh. If there are multiple valid device meshes availble, the above API can be used to set a specific one to execute the engine.
More information on torch-tensorrt distributed collective support can be found here: https://docs.pytorch.org/TensorRT/tutorials/deployment/distributed_inference.html#multinode-inference
More information on native multi-device collectives can be found here: https://docs.nvidia.com/deeplearning/tensorrt/latest/inference-library/work-with-transformers.html#multi-device-attention-preview-feature
ExecuTorch Support
Torch-TensorRT 2.12 introduces initial ExecuTorch integration for exporting and running TensorRT-accelerated models in ExecuTorch .pte format. Users can now
save TensorRT-compiled ExportedProgram / FX models with:
torch_tensorrt.save(model, "model.pte", output_format="executorch")
This release adds a TensorRT ExecuTorch backend, partitioner, and serialization path that embeds TensorRT engine payloads directly into the .pte using the
same engine metadata format as the Torch-TensorRT runtime. The release package also includes a C++ TensorRT ExecuTorch backend source package and a reference
ExecuTorch runner showing how to load .pte files, initialize the TensorRT delegate, bind inputs/outputs, and execute inference without requiring Python at
runtime.
Highlights:
- New torch_tensorrt.executorch Python APIs: TensorRTBackend, TensorRTPartitioner, and get_edge_compile_config.
- New output_format="executorch" save path for generating ExecuTorch .pte models.
- Support for static-shape and TensorRT profile-based dynamic-shape export examples.
- New native C++ TensorRT ExecuTorch backend and reference runner included in libtorchtrt.tar.gz.
- Engines that require TensorRT output allocators, such as data-dependent output shape engines, are not supported yet.
- In Torch-TensorRT 2.12, the ExecuTorch integration still depends on LibTorch in the native runtime path.
In the next Torch-TensorRT 2.13 release is planned to move this to a pure ExecuTorch backend implementation without the LibTorch runtime dependency.
Known Limitations
- ExecuTorch support still depends on the Torch/LibTorch C++ libraries used by Torch-TensorRT; this release does not provide a pure
ExecuTorch-only TensorRT deployment path. - TensorRT engine payloads larger than 2 GiB are not supported when embedded in an ExecuTorch .pte file.
- Selecting a target device during ExecuTorch export is not currently supported. Exported .pte files default to cuda:0.
Comprehensive Attention Support
This release extends the TRT attention converters to support GQA/MQA and decode-phase attention based on TensorRT IAttentionLayer. Specifically, it covers all SDPA kernel variants, MHA/GQA/MQA attention patterns, causal vs non-causal masking, bool/float/broadcast mask shapes, decode-phase attention (seq_q=1), non-power-of-2 head dims, LLM-realistic configs, and multiple dtypes. This feature is enabled by default. If you want to turn it off, please set decompose_attention=True.
Known Limitations
- TensorRT 10.x (will be resolved in TRT 11.0) and TensorRT-RTX-1.4:
For TensorRT 10.x, large causal sequences of k/v (seq >= 512, is_causal=True) in FP16/BF16
IAttentionLayer produces ~80% element mismatch at long sequences. Thus, we use FP32 for
the scale factor. If you want to use the accurate dtype, please setdecompose_attention=True
or upgrade to TRT 11.0 or later.
Comprehensive Complex Numerics Support
Torch-TensorRT can now compile models containing complex64/complex128 tensors end-to-end. TensorRT itself has no native complex dtype — a lowering pass intercepts complex subgraphs before partitioning and rewrites them into equivalent real arithmetic on a (..., 2) last-dim layout (real/imag interleaved), so the engine only sees standard float ops and callers don't have to change anything.
This unlocks compilation of models that use complex arithmetic for rotary position embeddings — Llama 3 (1D RoPE), and video generation transformers like CogVideoX, Wan, and HunyuanVideo (3D RoPE) — including under dynamic shapes and in distributed (multi-GPU) settings.
What's supported
- Complex inputs (placeholders) and buffers (
get_attr) are rewritten to real-valued equivalents.placeholder(complex64)becomesplaceholder(float32)with an appended trailing dim of 2; complex buffers are replaced viatorch.stack([t.real, t.imag], dim=-1). Dynamic-shape SymInts are preserved across the rewrite. - Complex multiply (
aten.mul.Tensorbetween two complex operands) is decomposed to the standard identity(ac − bd) + (ad + bc)i. view_as_complex/view_as_realare erased — they become identities once the layout is already(..., 2).- Shape-manipulation ops are handled with the trailing real/imag dim in mind:
reshape/view/_unsafe_view,flatten,unsqueeze,squeeze,permute,transpose,t,cat,stack,select,slice,narrow,roll,flip,split,chunk,expand,repeat. Negative dim indices are auto-shifted by −1 sodim=-1keeps meaning "the original last complex dim." - Math ops over complex inputs —
mm/bmm/matmul(including complex×real and real×complex mixed forms),abs,angle,real,imag,sin/cos/exp/logon complex,reciprocal(forscalar / complex),sum.dim_IntList/mean.dim/prod.dim_int,ones_like/full_like(correctly initialise as1+0i/fill+0i). - Engine outputs that stay complex — models that return a complex tensor without going through
view_as_realare also detected via a forward scan that collects unbounded complex nodes. - Runtime I/O — complex inputs can be passed as-is at call time. The runtime modules automatically apply
torch.view_as_real(x).contiguous()to complex inputs before handing them to the engine, and rebuild complex outputs on the way back. truncate_double=Truelowerscomplex128tofloat32(vs the defaultfloat64)- Refit caching — complex buffers via a last-dim slice-matching stage in
_save_weight_mapping, plus a tuple-keyed(sd_key, last_dim, idx)lookup inconstruct_refit_mapping_from_weight_name_map. Verification picks the real-unpacked branch for the reference module when complex inputs are present. - Graceful fallback — complex ops the rewriter doesn't know how to handle now fall back to PyTorch execution rather than failing the compile.
Known limitations
- Only
view_as_real-anchored subgraphs and forward-scanned complex outputs are detected; complex arithmetic that escapes both paths still fails. - Complex convolution / batch norm are not rewritten — only elementwise and shape-manipulation patterns plus the matmul family.
- Real parameters intentionally shaped
(d, 2)that feed complex-arithmetic ops will not be auto-promoted to complex layout; route them through the standardview_as_complex → mul → view_as_realpattern instead.
Reference PR: #4119 — Comprehensive Complex Numerics Support.
Debugger
TensorRT API Capture
In this release, we have improved TensorRT API Capture and Replay feature
It allows you to record the engine-building phase of your model and later replay the engine-build steps.
Capture:
The capture feature is by default disabled.
You can enable the capture feature via environment variable: TORCHTRT_ENABLE_TENSORRT_API_CAPTURE=1
TORCHTRT_ENABLE_TENSORRT_API_CAPTURE=1 python your_model_test.py
capture and replay files are automatically saved under debuglogs/capture_replay/ (i.e., the capture_replay subdirectory of logging_dir). You should see capture.json and associated .bin files generated there.
Replay:
Use tensorrt_player tool to replay the captured trt engine build without the original framework
tensorrt_player -j /absolute/path/to/shim.json -o /absolute/path/to/output_engine
Limitations:
-This feature is currently restricted to Linux(x86-64 and aarch64) only.
You can see more details in
https://docs.pytorch.org/TensorRT/debugging/capture_and_replay.html
What's Changed
- upgrade torch_tensorrt from 2.11 to 2.12 by @lanluo-nvidia in #4090
- cache shape expressions for reexport by @narendasan in #4079
- Adds a heuristic upper bound in the case of unbounded symints by @narendasan in #4083
- fix: docs + new dep group by @narendasan in #4060
- fix: remove refit validator by @zewenli98 in #4044
- fix the windows ci issue by @lanluo-nvidia in #4097
- Fix typo in symbolic shape expressions variable name by @narendasan in #4102
- cherry pick rtx release build fix from 2.11 release to main by @lanluo-nvidia in #4106
- Support extra_file arg by @cehongwang in #4064
- L2 nccl tests failures by @apbose in #4101
- docs: Update the docs to the new theme from PyTorch by @narendasan in #4110
- docs: make some nav stuff a bit clearer and fix the version drop down by @narendasan in #4113
- docs: backfill release documentation that we have been too lazy to ar… by @narendasan in #4114
- Narendasan/push rtrzmkllxunv by @narendasan in #4115
- fix: Fix the linting system to not constantly flag generated docs by @narendasan in #4120
- converter: add sdpa, flash-sdpa, efficient-sdpa, and cudnn-sdpa converters by @zewenli98 in #4104
- fix failures- cpu offloading casues device mismatch L2_dynamo_compile… by @apbose in #4089
- Changed test_refit_cumsum test by @cehongwang in #4121
- handle symbolic shape for non tensor inputs in symbolic shape extraction by @apbose in #4124
- feat: Allow pulling the venv's torch after uv sync so that people don… by @narendasan in #4130
- split index.Tensor converter for bool vs int indexing by @wenbingl in #4123
- Added fuse_rms_norm lowering by @cehongwang in #4017
- fix: validator introduced in #4132 by @narendasan in #4136
- upgrade tensorrt to 10.16, tensorrt_rtx to 1.4 by @lanluo-nvidia in #4144
- Fixed the problem that cannot get detailed build without default prof… by @cehongwang in #4160
- fix : unwaive skipped/special TRT-RTX tests by @tp5uiuc in #4156
- Add the capture replay feature improvement for 10.16 by @lanluo-nvidia in #4158
- fix typo: rtx 1.4 has the wrong 1.3 lib by @lanluo-nvidia in #4171
- fix: avoid unnecessary GPU tensor copy in prepare_inputs() by @SandSnip3r in #4146
- Fix: run_llm.py reports several errors by @yizhuoz004 in #4163
- chore(deps): bump transformers from 4.53.1 to 5.0.0rc3 in /docker/ngc_test by @dependabot[bot] in #4175
- chore(deps): bump transformers from 4.53.1 to 5.0.0rc3 in /tests/modules by @dependabot[bot] in #4174
- chore(deps): bump transformers from 4.53.0 to 5.0.0rc3 in /tools/perf by @dependabot[bot] in #4173
- chore(deps): bump transformers from 4.53.1 to 5.0.0rc3 in /examples/dynamo by @dependabot[bot] in #4172
- Add dynamic shape support to index_put by @narendasan in #4143
- test: skip cumsum conversion tests on TensorRT-RTX by @tp5uiuc in #4182
- Rather than passing a power value of 1 to the scale op, send None. by @SandSnip3r in #4181
- fix(rtx): reintroduce BF16 support, fall back depthwise conv to PyTorch by @tp5uiuc in #4178
- fix(test): skip cumsum tests on TensorRT-RTX Windows instead of xfail by @tp5uiuc in #4189
- waive(test): skip refit tests that fail on TensorRT-RTX by @tp5uiuc in #4193
- fix(rtx): add WAR to fall back grouped 3D deconvolutions to PyTorch by @tp5uiuc in #4188
- feat: support
attn_biasfor efficient SDPA by @zewenli98 in #4131 - feat: add runtime cache API for TensorRT-RTX by @tp5uiuc in #4180
- feat: add dynamic shapes kernel specialization strategy for TRT-RTX by @tp5uiuc in #4184
- MD-TRT Support, Compile/Export, C++ and Python by @narendasan in #4183
- Comprehensive Complex Numerics Support by @narendasan in #4119
- 2.12 torch-tensorrt release cut by @lanluo-nvidia in #4206
- Lluo/executorch 2.12 cherry pick by @lanluo-nvidia in #4238
- upgrade torchvision from 0.26.0 to 0.27.0 by @lanluo-nvidia in #4250
- pin to a specific executorch version by @lanluo-nvidia in #4248
- cherry pick 4244 from main to release 2.12 by @lanluo-nvidia in #4259
- cherry pick 4246 from main to release 2.12 by @lanluo-nvidia in #4261
- add release build force run all tests by @lanluo-nvidia in #4267
- cherry pick 4203 from main to release 2.12 by @lanluo-nvidia in #4268
- fix: a few bugs in release/2.12 by @zewenli98 in #4270
- Revert "cherry pick 4203 from main to release 2.12" by @lanluo-nvidia in #4275
- fix the modelopt config issue in python 3.10/11/12 by @lanluo-nvidia in #4276
- fix the windows issue by @lanluo-nvidia in #4281
New Contributors
- @yizhuoz004 made their first contribution in #4163
Full Changelog: v2.11.0...v2.12.0