Subject: Forcing NHWC Layout for Convolutions in Mixed CNN-Transformer Graphs (TensorRT 10.10 - Blackwell)

Environment:

GPU: NVIDIA RTX 5070 (Blackwell)

TensorRT Version: 10.10

Workflow: PyTorch -> TensorRT Model Optimizer (Q/DQ) -> ONNX -> TensorRT Engine

Architecture Overview: My model follows this sequence: RGB Input → Backbone (CNN) → BiFPN (CNN) → Transformer → 3D Deconvolution.

The Problem: Currently, TensorRT defaults the Backbone and BiFPN stages to NCHW (Channel First). However, the Transformer and Deconv3D stages naturally operate in NHWC (Channel Last). This mismatch introduces "Reformat" (transpose) nodes in the final engine, creating significant latency overhead and breaking potential fusions.

Despite using the TensorRT Model Optimizer to maximize fusions with Q/DQ nodes, I haven't been able to force the entire graph into a unified NHWC layout. Given that Blackwell Tensor Cores are highly efficient with interleaved data, maintaining NHWC throughout would be ideal.

Technical Questions:

Enforcing NHWC for Convolutions: In TensorRT 10.10, is there a way to explicitly constrain the builder to select NHWC tactics for the initial CNN stages? Can this be done via the IAlgorithmSelector or a specific flag in the Model Optimizer?

CUTLASS Integration: I am aware CUTLASS offers high-performance NHWC convolution kernels. Does TensorRT 10.10 automatically leverage these for Blackwell when NHWC is preferred, or should I implement a custom plugin to guarantee this layout?

Q/DQ Influence on Layout: Does the placement of Q/DQ nodes by the Model Optimizer restrict the tactic selection to NCHW for standard convolutions? How can I ensure that quantization doesn't "lock" the model into an inefficient layout?

Graph Surgeon vs. Native TRT: Would it be more effective to use onnx-graphsurgeon to manually inject transposes and "trick" the builder into NHWC, or is there a more "native" way to handle this within the TensorRT 10.x API?

I am looking for the most efficient way to achieve a zero-reformat pipeline. If other frameworks (like Torch-TensorRT) offer better layout control for this specific use case, I am open to suggestions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Subject: Forcing NHWC Layout for Convolutions in Mixed CNN-Transformer Graphs (TensorRT 10.10 - Blackwell) #4677

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Subject: Forcing NHWC Layout for Convolutions in Mixed CNN-Transformer Graphs (TensorRT 10.10 - Blackwell) #4677

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions