feat: TorchTRT Annotation Layer for Cuda generated kernels by bowang007 · Pull Request #4199 · pytorch/TensorRT

bowang007 · 2026-04-21T16:55:24Z

Description

This PR introduces torch_tensorrt.annotation, an experimental module for registering hand-written CUDA C++ kernels as both PyTorch custom ops (for eager execution) and TensorRT Quick Deployable Plugins with AOT support (for torch_tensorrt.compile).

Usage

  import torch, torch_tensorrt
  import torch_tensorrt.annotation as tta                                                                                                                           
   
  CU = """                                                                                                                                                          
  extern "C" __global__ void my_sigmoid(const float* x, int n, float* y) {
      int i = blockIdx.x * blockDim.x + threadIdx.x;
      if (i < n) y[i] = 1.0f / (1.0f + __expf(-x[i]));                                                                                                              
  }
  """                                                                                                                                                               
                  
  tta.auto_cuda_kernel_plugin(                                                                                                                                      
      "ann_ex::sigmoid",
      tta.KernelSpec(                                                                                                                                               
          kernel_source=CU, kernel_name="my_sigmoid",
          inputs=[tta.InputDecl("x")],                                                                                                                              
          outputs=[tta.OutputDecl("y", shape=tta.SameAs(0))],
          extras=[tta.Numel("x")],                                                                                                                                  
          geometry=tta.Elementwise(block=(256,), layout="flat"),                                                                                                    
      ),
  )

After this call, torch.ops.ann_ex.sigmoid is available in eager and is embedded as a TensorRT plugin during torch_tensorrt.compile. The meta function, eager
launch, AOT implementation, and PyTorch schema are all derived from the KernelSpec.

API Surface

The module exposes two primary entry points, layered by declarativeness:

auto_cuda_kernel_plugin is the recommended default. The caller supplies a KernelSpec dataclass describing the kernel's inputs, outputs (with a shape relation such
as SameAs or ReduceDims), scalar extras (Numel, DimSize), and launch geometry (Elementwise or Reduction). The framework derives the meta function, eager CUDA
launch, TensorRT AOT implementation, and PyTorch schema. This path covers pointwise kernels (1-D flat or N-D grid launches), reductions (with optional keepdim),
multi-input kernels, and scalar (non-tensor) kernel arguments via ScalarInput.

manual_cuda_kernel_plugin is the lower-level alternative for kernels outside the declarative DSL — shape-changing outputs, multi-output kernels, or non-standard
launch geometries. The caller provides eager_fn and aot_fn directly; the decorator still registers the PyTorch op, TRT plugin, AOT implementation, and converter
in a single call.

A Custom(fn=...) geometry is also available for callers who want the declarative path's schema/meta derivation but need to hand-write the TRT KernelLaunchParams.

Type of change

New feature (non-breaking change which adds functionality)

Checklist:

My code follows the style guidelines of this project (You can use the linters)
I have performed a self-review of my own code
I have commented my code, particularly in hard-to-understand areas and hacks
I have made corresponding changes to the documentation
I have added tests to verify my fix or my feature
New and existing unit tests pass locally with my changes
I have added the relevant labels to my PR in so that relevant reviewers are notified

narendasan

Cool I think this is getting really close. I think we just have a few naming things to make this more user friendly and I think we should let users provide PTX directly in addition to the cuda apis. Also did you add nvrtc as an optional dependency in the pyproject.toml) (maybe under an a extras called kernels)?

narendasan · 2026-04-30T15:56:26Z

@@ -0,0 +1,156 @@
+.. _torch_tensorrt_annotation_py:
+
+torch_tensorrt.annotation


I think this should be called torch_tensorrt.kernels

Hey @narendasan should we just deprecate the naming related to annotation? I'm a little bit confused here since this annotation is really opaque to the users and there is no annotation module right now

narendasan · 2026-04-30T16:02:23Z

+    return params, extra
+
+
+@tta.manual_cuda_kernel_plugin(


What is the distinction between manual and auto from a user perspective?

To me it just seems like the function bodies are just manually configured here, why not just support additional kwargs in the same api?

I’m planning to present how to manually write a CUDA kernel plugin. We have the option to completely remove this manual approach and rely entirely on automatic generation so users don't have to worry about it.

However, doing so means users lose control over launch functions and configurations. In this specific case, automatic generation doubles the number of launching threads, which could lead to memory inefficiencies and limit our ability to support dynamic cases.

@narendasan – We can drop the manual options for the sake of simplicity, but it comes at the cost of efficiency and flexibility. Which approach do you prefer?

narendasan · 2026-05-01T16:29:07Z

+    )
+
+
+def custom_plugin(


there is now a torch_tensorrt.annotations(kernels).custom_plugin and a torch_tensorrt.dynamo.conversion.plugins.custom_op. Why cant we just centralize on one?

Or make it clear what is used for what by disambiguating the names

@narendasan we could do that by just removing the annotation related stuff for now. I'm just putting that in the annotation in case the TensorRT team wants to develop some annotation related stuff in the future. Do we need to keep that stub for them now?

narendasan · 2026-05-01T16:43:01Z

+# Numel("x")         pass x.numel() to the kernel as an int extra.
+# Elementwise(flat)  1-D launch over the flattened output; any input rank works.
+
+tta.auto_cuda_kernel_plugin(


maybe we can call this something like torch_tensorrt.kernels.cuda_kernel_op

narendasan · 2026-05-01T16:46:49Z

+    aot_fn=_aot_repeat2,
+    supports_dynamic_shapes=True,
+)
+def _repeat2_meta(x: torch.Tensor) -> torch.Tensor:


Why is the meta kernel the one that we decorate? to me the obvious thing to decorate is the jit_impl_fn

I think the manual api as a decorator is somewhat confusing, imo we either do the workflow that we already have with multiple decorators (https://docs.pytorch.org/TensorRT/tutorials/_rendered_examples/dynamo/nvrtc_aot_plugin.html) or we dont decorate anything and just have a function that takes all of kernel source, meta, jit and aot as argument

narendasan · 2026-05-01T16:51:20Z

+            return
+        from torch_tensorrt.annotation._custom_plugin._nvrtc import compile_to_ptx
+
+        _ptx, device, kernel = compile_to_ptx(


Could we also expose a torch_tensorrt.kernels.ptx_op, that just takes externally created valid ptx?

meta-cla Bot added the cla signed label Apr 21, 2026

github-actions Bot added component: tests Issues re: Tests component: api [Python] Issues re: Python API labels Apr 21, 2026

github-actions Bot requested a review from lanluo-nvidia April 21, 2026 16:55