Skip to content

Torch-TensorRT v2.12.0

Latest

Choose a tag to compare

@lanluo-nvidia lanluo-nvidia released this 20 May 19:24
· 80 commits to main since this release
9afefd0

Torch-TensorRT 2.12.0 Linux x86-64 and Windows targets

PyTorch 2.12, CUDA 13.0, TensorRT 10.16, Python 3.10~3.13

Torch-TensorRT Wheels are available:

x86-64 Linux and Windows:
CUDA 13.0 + Python 3.10-3.13 is Available via PyPI

aarch64 SBSA Linux and Jetson Thor
CUDA 13.0 + Python 3.10–3.13 + Torch 2.12 + TensorRT 10.16

Jetson Orin

  • no torch_tensorrt 2.9/2.10/2.11/2.12 release for Jetson Orin
  • please continue using torch_tensorrt 2.8 release

Torch-TensorRT-RTX 2.12.0 Linux x86-64 and Windows targets

PyTorch 2.12, CUDA 13.0, TensorRT-RTX 1.4, Python 3.10~3.13

Torch-TensorRT-RTX Wheels are available:

x86-64 Linux and Windows:
CUDA 13.0 + Python 3.10-3.13 is Available via PyPI

CUDA 13.0 + Python 3.10-3.13 is also Available via Pytorch Index

Native Distributed Collectives

In prior versions of Torch-TensorRT, distributed operators were backed by kernels provided by TensorRT-LLM that the user needed to manually install. With Torch-TensorRT 2.12, many of these operations are natively supported which means in deployment, only Torch-TensorRT needs to be installed.

The distributed infrastructure is designed to operate on top of torch.distributed. Once a graph is sharded, traced and compiled. The torch.distributed device mesh can be passed to torch-tensorrt compiled modules using the following API:

trt_model = torch.compile(model, backend="torch_tensorrt", ...)
_ = trt_model(inp)  # warmup — triggers engine build
dist.barrier()

with torch_tensorrt.distributed.distributed_context(dist.group.WORLD, trt_model) as dmodel:
    output = dmodel(inp)

dist.destroy_process_group()
os._exit(0)

This can be done at compile time as well

with torch_tensorrt.distributed.distributed_context(tp_group):
    trt_model = torch.compile(model, backend="torch_tensorrt", ...)
    output = trt_model(inp)

Note: use_distributed_trace is no longer necessary to compile multi-device models, torch-tensorrt will automatically recognize distributed collectives and set the setting for the user.

torchtrtrun

The distributed operations utilize the NCCL version distributed by PyTorch which must be added to LD_PRELOAD before importing torch-tensorrt. As a convience we provide a tool torchtrtrun which is analogous to torchrun that configures these libraries correctly in addition to allowing users to launch models distributed across multiple nodes.

For example:

# Node 0 (rank 0):
torchtrtrun --nproc_per_node=1 --nnodes=2 --node_rank=0 \
  --rdzv_endpoint=<node0-ip>:29500 \
  tensor_parallel_llama_multinode.py

# Node 1 (rank 1):
torchtrtrun --nproc_per_node=1 --nnodes=2 --node_rank=1 \
  --rdzv_endpoint=<node0-ip>:29500 \
  tensor_parallel_llama_multinode.py

Serialization and torch.export

Models sharded and then exported can be compiled and saved to disk before being loaded on a deployment system. By default these modules attempt to bind to the default torch distributed device mesh. If there are multiple valid device meshes availble, the above API can be used to set a specific one to execute the engine.

More information on torch-tensorrt distributed collective support can be found here: https://docs.pytorch.org/TensorRT/tutorials/deployment/distributed_inference.html#multinode-inference

More information on native multi-device collectives can be found here: https://docs.nvidia.com/deeplearning/tensorrt/latest/inference-library/work-with-transformers.html#multi-device-attention-preview-feature

ExecuTorch Support

Torch-TensorRT 2.12 introduces initial ExecuTorch integration for exporting and running TensorRT-accelerated models in ExecuTorch .pte format. Users can now
save TensorRT-compiled ExportedProgram / FX models with:

torch_tensorrt.save(model, "model.pte", output_format="executorch")

This release adds a TensorRT ExecuTorch backend, partitioner, and serialization path that embeds TensorRT engine payloads directly into the .pte using the
same engine metadata format as the Torch-TensorRT runtime. The release package also includes a C++ TensorRT ExecuTorch backend source package and a reference
ExecuTorch runner showing how to load .pte files, initialize the TensorRT delegate, bind inputs/outputs, and execute inference without requiring Python at
runtime.

Highlights:

  • New torch_tensorrt.executorch Python APIs: TensorRTBackend, TensorRTPartitioner, and get_edge_compile_config.
  • New output_format="executorch" save path for generating ExecuTorch .pte models.
  • Support for static-shape and TensorRT profile-based dynamic-shape export examples.
  • New native C++ TensorRT ExecuTorch backend and reference runner included in libtorchtrt.tar.gz.
  • Engines that require TensorRT output allocators, such as data-dependent output shape engines, are not supported yet.
  • In Torch-TensorRT 2.12, the ExecuTorch integration still depends on LibTorch in the native runtime path.
    In the next Torch-TensorRT 2.13 release is planned to move this to a pure ExecuTorch backend implementation without the LibTorch runtime dependency.

Known Limitations

  • ExecuTorch support still depends on the Torch/LibTorch C++ libraries used by Torch-TensorRT; this release does not provide a pure
    ExecuTorch-only TensorRT deployment path.
  • TensorRT engine payloads larger than 2 GiB are not supported when embedded in an ExecuTorch .pte file.
  • Selecting a target device during ExecuTorch export is not currently supported. Exported .pte files default to cuda:0.

Comprehensive Attention Support

This release extends the TRT attention converters to support GQA/MQA and decode-phase attention based on TensorRT IAttentionLayer. Specifically, it covers all SDPA kernel variants, MHA/GQA/MQA attention patterns, causal vs non-causal masking, bool/float/broadcast mask shapes, decode-phase attention (seq_q=1), non-power-of-2 head dims, LLM-realistic configs, and multiple dtypes. This feature is enabled by default. If you want to turn it off, please set decompose_attention=True.

Known Limitations

  • TensorRT 10.x (will be resolved in TRT 11.0) and TensorRT-RTX-1.4:
    For TensorRT 10.x, large causal sequences of k/v (seq >= 512, is_causal=True) in FP16/BF16
    IAttentionLayer produces ~80% element mismatch at long sequences. Thus, we use FP32 for
    the scale factor. If you want to use the accurate dtype, please set decompose_attention=True
    or upgrade to TRT 11.0 or later.

Comprehensive Complex Numerics Support

Torch-TensorRT can now compile models containing complex64/complex128 tensors end-to-end. TensorRT itself has no native complex dtype — a lowering pass intercepts complex subgraphs before partitioning and rewrites them into equivalent real arithmetic on a (..., 2) last-dim layout (real/imag interleaved), so the engine only sees standard float ops and callers don't have to change anything.

This unlocks compilation of models that use complex arithmetic for rotary position embeddings — Llama 3 (1D RoPE), and video generation transformers like CogVideoX, Wan, and HunyuanVideo (3D RoPE) — including under dynamic shapes and in distributed (multi-GPU) settings.

What's supported

  • Complex inputs (placeholders) and buffers (get_attr) are rewritten to real-valued equivalents. placeholder(complex64) becomes placeholder(float32) with an appended trailing dim of 2; complex buffers are replaced via torch.stack([t.real, t.imag], dim=-1). Dynamic-shape SymInts are preserved across the rewrite.
  • Complex multiply (aten.mul.Tensor between two complex operands) is decomposed to the standard identity (ac − bd) + (ad + bc)i.
  • view_as_complex / view_as_real are erased — they become identities once the layout is already (..., 2).
  • Shape-manipulation ops are handled with the trailing real/imag dim in mind: reshape / view / _unsafe_view, flatten, unsqueeze, squeeze, permute, transpose, t, cat, stack, select, slice, narrow, roll, flip, split, chunk, expand, repeat. Negative dim indices are auto-shifted by −1 so dim=-1 keeps meaning "the original last complex dim."
  • Math ops over complex inputsmm / bmm / matmul (including complex×real and real×complex mixed forms), abs, angle, real, imag, sin/cos/exp/log on complex, reciprocal (for scalar / complex), sum.dim_IntList / mean.dim / prod.dim_int, ones_like / full_like (correctly initialise as 1+0i / fill+0i).
  • Engine outputs that stay complex — models that return a complex tensor without going through view_as_real are also detected via a forward scan that collects unbounded complex nodes.
  • Runtime I/O — complex inputs can be passed as-is at call time. The runtime modules automatically apply torch.view_as_real(x).contiguous() to complex inputs before handing them to the engine, and rebuild complex outputs on the way back.
  • truncate_double=True lowers complex128 to float32 (vs the default float64)
  • Refit caching — complex buffers via a last-dim slice-matching stage in _save_weight_mapping, plus a tuple-keyed (sd_key, last_dim, idx) lookup in construct_refit_mapping_from_weight_name_map. Verification picks the real-unpacked branch for the reference module when complex inputs are present.
  • Graceful fallback — complex ops the rewriter doesn't know how to handle now fall back to PyTorch execution rather than failing the compile.

Known limitations

  • Only view_as_real-anchored subgraphs and forward-scanned complex outputs are detected; complex arithmetic that escapes both paths still fails.
  • Complex convolution / batch norm are not rewritten — only elementwise and shape-manipulation patterns plus the matmul family.
  • Real parameters intentionally shaped (d, 2) that feed complex-arithmetic ops will not be auto-promoted to complex layout; route them through the standard view_as_complex → mul → view_as_real pattern instead.

Reference PR: #4119 — Comprehensive Complex Numerics Support.

Debugger

TensorRT API Capture
In this release, we have improved TensorRT API Capture and Replay feature
It allows you to record the engine-building phase of your model and later replay the engine-build steps.

Capture:
The capture feature is by default disabled.
You can enable the capture feature via environment variable: TORCHTRT_ENABLE_TENSORRT_API_CAPTURE=1
TORCHTRT_ENABLE_TENSORRT_API_CAPTURE=1 python your_model_test.py
capture and replay files are automatically saved under debuglogs/capture_replay/ (i.e., the capture_replay subdirectory of logging_dir). You should see capture.json and associated .bin files generated there.

Replay:
Use tensorrt_player tool to replay the captured trt engine build without the original framework
tensorrt_player -j /absolute/path/to/shim.json -o /absolute/path/to/output_engine

Limitations:
-This feature is currently restricted to Linux(x86-64 and aarch64) only.

You can see more details in
https://docs.pytorch.org/TensorRT/debugging/capture_and_replay.html

What's Changed

New Contributors

Full Changelog: v2.11.0...v2.12.0