Skip to content

Subject: Forcing NHWC Layout for Convolutions in Mixed CNN-Transformer Graphs (TensorRT 10.10 - Blackwell) #4677

@Floriangit12

Description

@Floriangit12

Environment:

GPU: NVIDIA RTX 5070 (Blackwell)

TensorRT Version: 10.10

Workflow: PyTorch -> TensorRT Model Optimizer (Q/DQ) -> ONNX -> TensorRT Engine

Architecture Overview: My model follows this sequence: RGB Input → Backbone (CNN) → BiFPN (CNN) → Transformer → 3D Deconvolution.

The Problem: Currently, TensorRT defaults the Backbone and BiFPN stages to NCHW (Channel First). However, the Transformer and Deconv3D stages naturally operate in NHWC (Channel Last). This mismatch introduces "Reformat" (transpose) nodes in the final engine, creating significant latency overhead and breaking potential fusions.

Despite using the TensorRT Model Optimizer to maximize fusions with Q/DQ nodes, I haven't been able to force the entire graph into a unified NHWC layout. Given that Blackwell Tensor Cores are highly efficient with interleaved data, maintaining NHWC throughout would be ideal.

Technical Questions:

Enforcing NHWC for Convolutions: In TensorRT 10.10, is there a way to explicitly constrain the builder to select NHWC tactics for the initial CNN stages? Can this be done via the IAlgorithmSelector or a specific flag in the Model Optimizer?

CUTLASS Integration: I am aware CUTLASS offers high-performance NHWC convolution kernels. Does TensorRT 10.10 automatically leverage these for Blackwell when NHWC is preferred, or should I implement a custom plugin to guarantee this layout?

Q/DQ Influence on Layout: Does the placement of Q/DQ nodes by the Model Optimizer restrict the tactic selection to NCHW for standard convolutions? How can I ensure that quantization doesn't "lock" the model into an inefficient layout?

Graph Surgeon vs. Native TRT: Would it be more effective to use onnx-graphsurgeon to manually inject transposes and "trick" the builder into NHWC, or is there a more "native" way to handle this within the TensorRT 10.x API?

I am looking for the most efficient way to achieve a zero-reformat pipeline. If other frameworks (like Torch-TensorRT) offer better layout control for this specific use case, I am open to suggestions.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions