Skip to content

[MLAS] Add an implementation an NHWC implementation of convolution to avoid transposes#26834

Open
orlmon01 wants to merge 23 commits intomicrosoft:mainfrom
orlmon01:main
Open

[MLAS] Add an implementation an NHWC implementation of convolution to avoid transposes#26834
orlmon01 wants to merge 23 commits intomicrosoft:mainfrom
orlmon01:main

Conversation

@orlmon01
Copy link
Contributor

  • Modification to the CPU EP to specify channels_last when data format is NWHC
  • Added a FusedNhwcConv kernel
  • Implementation of the kernel in mlas
  • Added compiler guards so it is inly used with KleidiAi (for now, can be removed if needed)
  • Added unittests

Description

Currently OnnxRT supports NCHW as a default datalayout. For optimisations and kernels that operate better in NHWC layout, or where the datalayout is NHWC in the first place Transposes are added around the layers. This patch seeks to eliminate them in cases of convolutions where it would cause a performance decrease.

Motivation and Context

KleidiAi specific implementation of this feature. Only supports convolutions, DepthWise to follow. Currently a little strict with the filters as a result.

…transposes

* Modification to the CPU EP to specify channels_last when data format is NWHC
* Added a FusedNhwcConv kernel
* Implementation of the kernel in mlas
* Added compiler guards so it is inly used with KleidiAi (for now, can be removed if needed)
* Added unittests

Signed-off-by: Orlaith Monahan <orlaith.monahan@arm.com>
@orlmon01
Copy link
Contributor Author

@microsoft-github-policy-service agree company="Arm"

@orlmon01 orlmon01 marked this pull request as draft December 19, 2025 12:34
@orlmon01 orlmon01 marked this pull request as ready for review December 19, 2025 12:35
@orlmon01
Copy link
Contributor Author

Feedback appreciated as this PR makes quite a lot of changes to the codebase well outside of the normal KleidiAI scope.

Signed-off-by: Orlaith Monahan <orlaith.monahan@arm.com>
@Rohanjames1997
Copy link
Contributor

Hi @orlmon01, I imagine that avoiding transposes also improves performance.
Do you have any performance results to share?
TIA!

@orlmon01
Copy link
Contributor Author

orlmon01 commented Jan 6, 2026

Hi @orlmon01, I imagine that avoiding transposes also improves performance. Do you have any performance results to share? TIA!

Hiya,
Sorry, just back after the holidays. Yes, there is a performance increase. It depends on the model. Ones with multiple consecutive Convolutions where transposes can be eliminated will see a larger speedup. Even with the limited range of convolutions it's implemented for there should still be a performance increase in most cases.

I have some numbers somewhere from a Mobilenet model I was using for testing that I'll add in a bit, once I find / regenerate them. :)

@orlmon01
Copy link
Contributor Author

orlmon01 commented Jan 6, 2026

mobilenet model without the current patch:

Setting intra_op_num_threads to 1
Overriding dimension with name, N, to 1
Overriding dimension with name, T, to 1000
Overriding dimension with name, cache_T_attn, to 32
Overriding dimension with name, right_context, to 5
Session creation time cost: 0.020627 s
First inference time cost: 10 ms
Total inference time cost: 1.33257 s
Total inference requests: 200
Average inference time cost total: 6.662851 ms
Total inference run time: 1.33266 s
Number of inferences per second: 150.075
Avg CPU usage: 16 %
Peak working set size: 85429583872 bytes
Avg CPU usage:16
Peak working set size:85429583872
Runs:200
Min Latency: 0.006193 s
Max Latency: 0.007625 s
P50 Latency: 0.00666992 s
P90 Latency: 0.00686983 s
P95 Latency: 0.00694425 s
P99 Latency: 0.00733196 s
P999 Latency: 0.007625 s

Same model with changes:

Setting intra_op_num_threads to 1
Overriding dimension with name, N, to 1
Overriding dimension with name, T, to 1000
Overriding dimension with name, cache_T_attn, to 32
Overriding dimension with name, right_context, to 5
Session creation time cost: 0.0217724 s
First inference time cost: 7 ms
Total inference time cost: 1.12897 s
Total inference requests: 200
Average inference time cost total: 5.644857 ms
Total inference run time: 1.12905 s
Number of inferences per second: 177.14
Avg CPU usage: 16 %
Peak working set size: 80362864640 bytes
Avg CPU usage:16
Peak working set size:80362864640
Runs:200
Min Latency: 0.00527438 s
Max Latency: 0.006706 s
P50 Latency: 0.00566529 s
P90 Latency: 0.00579958 s
P95 Latency: 0.00585429 s
P99 Latency: 0.00639058 s
P999 Latency: 0.006706 s

@edgchen1
Copy link
Contributor

edgchen1 commented Jan 8, 2026

/azp run Linux QNN CI Pipeline,Win_TRT_Minimal_CUDA_Test_CI,Windows ARM64 QNN CI Pipeline,Windows GPU Doc Gen CI Pipeline

@azure-pipelines
Copy link

Azure Pipelines successfully started running 4 pipeline(s).

Update to the internal_testings_tests helper macros for file expansion so it works on other platforms
Fix for failing ConvDepthwiseFloat test, allows for a small tolerance when running on different hardware
For for failing TestSaveAndLoadOrtModel test
Make sure the model being saved / loaded is being done from a writeable location
Fix for undeclared identifier linker error
@hariharans29
Copy link
Member

/azp run Linux QNN CI Pipeline,Win_TRT_Minimal_CUDA_Test_CI,Windows ARM64 QNN CI Pipeline,Windows GPU Doc Gen CI Pipeline

@azure-pipelines
Copy link

Azure Pipelines successfully started running 4 pipeline(s).

Signed-off-by: Orlaith Monahan <orlaith.monahan@arm.com>
@hariharans29
Copy link
Member

/azp run Linux QNN CI Pipeline,Win_TRT_Minimal_CUDA_Test_CI,Windows ARM64 QNN CI Pipeline,Windows GPU Doc Gen CI Pipeline

@azure-pipelines
Copy link

Azure Pipelines successfully started running 4 pipeline(s).

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds an NHWC (channels-last) implementation of convolution operations to avoid costly transpose operations in the CPU execution provider. The implementation includes KleidiAI-specific optimizations and a fallback path for NHWC convolutions.

Changes:

  • Added NhwcFusedConv kernel for float32 convolutions in NHWC layout (KleidiAI-guarded)
  • Implemented NHWC fast path and fallback path with explicit NHWC↔NCHW conversions in MLAS
  • Extended test infrastructure to resolve paths dynamically and filter NHWC transformers in existing tests

Reviewed changes

Copilot reviewed 24 out of 24 changed files in this pull request and generated 7 comments.

Show a summary per file
File Description
onnxruntime/core/providers/cpu/nn/conv.h Added channels_last_ flag to Conv kernel
onnxruntime/core/providers/cpu/nn/conv.cc Implemented NHWC convolution logic with fast path and fallback
onnxruntime/core/optimizer/nhwc_transformer.cc Added KleidiAI filter and FusedConv sum input handling
onnxruntime/core/mlas/lib/convolve.cpp Added ChannelsLast parameter to MlasConvPrepare
onnxruntime/core/mlas/inc/mlas.h Added ChannelsLast field to MLAS_CONV_PARAMETERS
onnxruntime/contrib_ops/cpu/fused_conv.cc Registered NhwcFusedConv kernel
onnxruntime/test/optimizer/nhwc_transformer_test.cc Added depthwise convolution test case
onnxruntime/test/optimizer/fuse_initializers_transformer_test.cc Filtered NhwcTransformer in tests
onnxruntime/test/optimizer/conv_add_act_test.cc Updated to handle both FusedConv variants
onnxruntime/test/internal_testing_ep/internal_testing_tests.cc Added path resolution utilities
onnxruntime/test/framework/ort_model_only_test.cc Added path resolution with diagnostic output
onnxruntime/core/util/math_cpu.cc Added Im2col instantiation for float NHWC
onnxruntime/core/mlas/lib/kleidiai/convolve_kleidiai.cpp Updated to support channels-last input

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +341 to +355
if (node->OpType() == "Conv" || node->OpType() == "FusedConv") {
const auto group_opt = node->GetAttributeInt("group");
if (group_opt.has_value() && group_opt.value() != 1) {
continue;
}

const auto dilations_opt = node->GetAttributeInts("dilations");
if (dilations_opt.has_value()) {
const auto& dilations = dilations_opt.value();
if ((dilations.size() >= 1 && dilations[0] != 1) ||
(dilations.size() >= 2 && dilations[1] != 1)) {
continue;
}
}
}
Copy link

Copilot AI Jan 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This validation logic duplicates the filter logic defined earlier in the same file (lines 187-203 and KleidiFp32NhwcFilter). Consider extracting this validation into a shared helper function to avoid code duplication and ensure consistency.

Copilot uses AI. Check for mistakes.
const auto& graph1 = session2->GetGraph();
// model should have all the original nodes and we should be able to execute with the fallback to CPU EP
ASSERT_EQ(graph1.NumberOfNodes(), num_nodes);
// ensure we can execute with the fallback to CPU EP even if additional nodes are introduced during loading
Copy link

Copilot AI Jan 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment mentions 'additional nodes are introduced during loading', but this change appears unrelated to the NHWC convolution feature. If this is a side effect of the NHWC transformer introducing additional nodes, this should be explicitly documented in the comment.

Suggested change
// ensure we can execute with the fallback to CPU EP even if additional nodes are introduced during loading
// ensure we can execute with the fallback to CPU EP even if additional nodes are introduced during loading
// (e.g., by graph transformers such as the NHWC convolution transformer that may add nodes when handling NHWC models)

Copilot uses AI. Check for mistakes.
@hariharans29 hariharans29 changed the title Add an implementation an NHWC implementation of convolution to avoid transposes [MLAS] Add an implementation an NHWC implementation of convolution to avoid transposes Jan 21, 2026
orlmon01 and others added 6 commits January 26, 2026 10:07
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 24 out of 24 changed files in this pull request and generated 9 comments.

Comments suppressed due to low confidence (1)

onnxruntime/core/providers/cpu/nn/conv.cc:399

  • The fallback path (lines 353-399) for convolutions with kernel_rank > 3 does not handle the channels_last layout. This path uses Im2col with StorageOrder::NCHW (line 367) regardless of the channels_last_ flag. If a channels_last Conv node with kernel_rank > 3 is created, it will produce incorrect results. Either add proper handling for channels_last in this path, or add a validation check earlier in the function to reject channels_last convolutions with kernel_rank > 3.
  } else {
    const int64_t input_image_size = input_shape.Size();
    const int64_t output_image_size = output_shape.Size();
    const int64_t kernel_size = TensorShape(kernel_shape).Size();
    const SafeInt<int64_t> X_offset = SafeInt<int64_t>(C) / conv_attrs_.group * input_image_size;
    const SafeInt<int64_t> Y_offset = SafeInt<int64_t>(Y->Shape().Size()) / Y->Shape()[0] / conv_attrs_.group;
    const SafeInt<int64_t> W_offset = SafeInt<int64_t>(W->Shape().Size()) / conv_attrs_.group;
    const SafeInt<int64_t> kernel_dim = SafeInt<int64_t>(C) / conv_attrs_.group * kernel_size;
    const int64_t col_buffer_size = kernel_dim * output_image_size;

    auto col_data = IAllocator::MakeUniquePtr<float>(alloc, narrow<size_t>(col_buffer_size));
    auto w_data = W->DataAsSpan<float>();
    for (int image_id = 0; image_id < N; ++image_id) {
      for (int group_id = 0; group_id < conv_attrs_.group; ++group_id) {
        math::Im2col<float, StorageOrder::NCHW>()(
            &Xdata[group_id * X_offset],
            input_shape.GetDims().data(),
            output_shape.GetDims().data(),
            kernel_dim,
            kernel_shape.data(),
            strides.data(),
            dilations.data(),
            pads.data(),
            narrow<int>(kernel_shape.size()),
            col_data.get());

        math::Gemm<float>(
            CblasNoTrans,
            CblasNoTrans,
            narrow<ptrdiff_t>(M / conv_attrs_.group),
            narrow<ptrdiff_t>(output_image_size),
            narrow<ptrdiff_t>(kernel_dim),
            1,
            &w_data[group_id * W_offset],
            col_data.get(),
            Beta,
            &Ydata[group_id * Y_offset],
            thread_pool);
      }

      MlasActivation(&activation_, Ydata.data(), Bdata, narrow<size_t>(M),
                     narrow<size_t>(output_image_size), narrow<size_t>(output_image_size));

      Xdata = Xdata.subspan(X_offset * conv_attrs_.group);
      Ydata = Ydata.subspan(Y_offset * conv_attrs_.group);
    }
  }

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

NhwcFusedConv,
1,
float,
KernelDefBuilder()
Copy link

Copilot AI Jan 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The NhwcFusedConv kernel registration is missing the MayInplace hint that is present in the FusedConv registration. The FusedConv kernel uses .MayInplace(3, 0) to allow the optional "sum" input (index 3) to be reused as the output buffer (index 0) for efficiency. However, NhwcFusedConv does not include this hint. This means that even though the code in conv.cc handles the Sum input for channels_last mode, the allocation planner cannot optimize memory usage by reusing the Sum buffer for the output when using NhwcFusedConv. Consider adding .MayInplace(3, 0) to the NhwcFusedConv kernel builder to maintain consistency and enable the same memory optimization.

Suggested change
KernelDefBuilder()
KernelDefBuilder()
// Allow the optional "sum" input (index 3) to be reused as the output buffer (index 0),
// consistent with the FusedConv kernel registration.
.MayInplace(3, 0)

Copilot uses AI. Check for mistakes.
Comment on lines +29 to +35
ONNX_CPU_OPERATOR_TYPED_MS_KERNEL(
NhwcFusedConv,
1,
float,
KernelDefBuilder()
.TypeConstraint("T", DataTypeImpl::GetTensorType<float>()),
FusedConvFloat);
Copy link

Copilot AI Jan 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The NhwcFusedConv kernel registration (lines 29-35) is not conditionally compiled with USE_KLEIDIAI guards, but the PR description states this is a "KleidiAi specific implementation" that is "only used with KleidiAi (for now)". This is inconsistent with the conditional registration approach used in cpu_contrib_kernels.cc where the declaration is guarded by #ifdef USE_KLEIDIAI. Either this registration should also be conditionally compiled with USE_KLEIDIAI, or if the kernel is meant to work without KleidiAI (using the fallback path), the guards in cpu_contrib_kernels.cc should be removed.

Copilot uses AI. Check for mistakes.
TransformerLevel::Level3);
}

TEST(NhwcTransformerTests, ConvDepthwiseFloat) {
Copy link

Copilot AI Jan 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test name "ConvDepthwiseFloat" and the test setup (group=8 with 8 input channels) indicates this is testing depthwise convolution. However, the test expectation on line 239 expects 0 NhwcFusedConv nodes, and line 240 expects 0 Transpose nodes, suggesting the transformer should NOT apply NHWC transformation to depthwise convolutions. This is inconsistent with the PR description which states "Only supports convolutions, DepthWise to follow" - implying depthwise convolutions are not yet supported. The test appears to be verifying that depthwise convolutions are correctly skipped, but the test name is misleading. Consider renaming to "ConvDepthwiseFloat_NotTransformed" or "ConvDepthwiseFloat_SkipNhwc" to make the intent clearer.

Suggested change
TEST(NhwcTransformerTests, ConvDepthwiseFloat) {
TEST(NhwcTransformerTests, ConvDepthwiseFloat_SkipNhwc) {

Copilot uses AI. Check for mistakes.
Comment on lines +55 to +58
const auto inputs = node.Inputs();
if (inputs.size() > 3 && !inputs[3].empty()) {
return false;
}
Copy link

Copilot AI Jan 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The KleidiFp32NhwcFilter function rejects FusedConv nodes that have a Sum input at index 3 (lines 55-58). However, the conv.cc implementation (lines 256-274) includes logic to handle the Sum input in both the fast path and the fallback path for channels_last mode. This means the filter is more restrictive than necessary - it prevents transformation of FusedConv nodes with Sum inputs even though the implementation can handle them. Consider removing this check (lines 55-58) or updating the comment to explain why Sum is not supported in the KleidiAI fast path specifically.

Copilot uses AI. Check for mistakes.
Comment on lines 214 to +355
if (transform->has_channels_last_attrib_) {
node->SetAttributeInt("channels_last", 1);
}

if (node->OpType() == "Conv" || node->OpType() == "FusedConv") {
const auto group_opt = node->GetAttributeInt("group");
if (group_opt.has_value() && group_opt.value() != 1) {
continue;
}

const auto dilations_opt = node->GetAttributeInts("dilations");
if (dilations_opt.has_value()) {
const auto& dilations = dilations_opt.value();
if ((dilations.size() >= 1 && dilations[0] != 1) ||
(dilations.size() >= 2 && dilations[1] != 1)) {
continue;
}
}
}
Copy link

Copilot AI Jan 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a logic error in the transformer flow. The code sets the "channels_last" attribute on line 338, then continues to check group and dilation constraints (lines 341-355), and only if those checks pass does it proceed with the transformation. However, if the checks fail and the code continues (skips the node), the "channels_last" attribute has already been set but the node hasn't been transformed. This will leave the node in an inconsistent state with channels_last=1 but still in NCHW layout without the necessary transposes. The attribute should only be set after all validation checks have passed, or the attribute should be cleared before continuing.

Copilot uses AI. Check for mistakes.
Comment on lines +217 to +222

if (target.OpType() == "Conv") {
return channels_last ? "NhwcFusedConv" : "FusedConv";
}

return "NhwcFusedConv";
Copy link

Copilot AI Jan 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The OpType method has incomplete logic. When the target node is not "Conv" (line 218), it unconditionally returns "NhwcFusedConv" (line 222). However, if the target is already "FusedConv" without channels_last attribute, it should remain as "FusedConv", not become "NhwcFusedConv". The correct logic should check if channels_last is set for all node types, not just for "Conv". Consider changing line 218 to check for both "Conv" and "FusedConv", or restructure to check channels_last first regardless of node type.

Suggested change
if (target.OpType() == "Conv") {
return channels_last ? "NhwcFusedConv" : "FusedConv";
}
return "NhwcFusedConv";
const std::string& op_type = target.OpType();
// If channels_last is set, use NHWC fused convolution regardless of original op type.
if (channels_last) {
return "NhwcFusedConv";
}
// Without channels_last, convert Conv to FusedConv, and leave other op types unchanged.
if (op_type == "Conv") {
return "FusedConv";
}
return op_type;

Copilot uses AI. Check for mistakes.
Comment on lines 304 to 308
BuildKernelCreateInfo<ONNX_OPERATOR_TYPED_KERNEL_CLASS_NAME(kCpuExecutionProvider, kMSDomain, 1, float, FusedConv)>,
BuildKernelCreateInfo<ONNX_OPERATOR_TYPED_KERNEL_CLASS_NAME(kCpuExecutionProvider, kMSDomain, 1, float, NhwcFusedConv)>,
Copy link

Copilot AI Jan 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The registration of NhwcFusedConv kernel is unconditional in cpu_contrib_kernels.cc (line 308), but the kernel declaration is conditionally compiled with USE_KLEIDIAI guards in the same file (lines 21-23). This creates an inconsistency: when USE_KLEIDIAI is not defined, the declaration is absent but the registration still attempts to register the kernel, which will likely cause a compilation error. The registration on line 308 should also be wrapped with #ifdef USE_KLEIDIAI guards to match the conditional declaration.

Suggested change
BuildKernelCreateInfo<ONNX_OPERATOR_TYPED_KERNEL_CLASS_NAME(kCpuExecutionProvider, kMSDomain, 1, float, FusedConv)>,
BuildKernelCreateInfo<ONNX_OPERATOR_TYPED_KERNEL_CLASS_NAME(kCpuExecutionProvider, kMSDomain, 1, float, NhwcFusedConv)>,
BuildKernelCreateInfo<ONNX_OPERATOR_TYPED_KERNEL_CLASS_NAME(kCpuExecutionProvider, kMSDomain, 1, float, FusedConv)>,
#ifdef USE_KLEIDIAI
BuildKernelCreateInfo<ONNX_OPERATOR_TYPED_KERNEL_CLASS_NAME(kCpuExecutionProvider, kMSDomain, 1, float, NhwcFusedConv)>,
#endif

Copilot uses AI. Check for mistakes.
Comment on lines +341 to +356
if (node->OpType() == "Conv" || node->OpType() == "FusedConv") {
const auto group_opt = node->GetAttributeInt("group");
if (group_opt.has_value() && group_opt.value() != 1) {
continue;
}

const auto dilations_opt = node->GetAttributeInts("dilations");
if (dilations_opt.has_value()) {
const auto& dilations = dilations_opt.value();
if ((dilations.size() >= 1 && dilations[0] != 1) ||
(dilations.size() >= 2 && dilations[1] != 1)) {
continue;
}
}
}

Copy link

Copilot AI Jan 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The group and dilation checks in lines 341-355 duplicate the same checks already performed in the KleidiFp32NhwcFilter function (lines 75-87). This duplication is error-prone and makes maintenance harder. Since the KleidiFp32NhwcFilter is already called during NhwcConvLookup (line 319), these constraints should already be enforced for FusedConv/Conv nodes. Consider removing this redundant check, or if it's needed for non-KleidiAI cases, add a comment explaining why the duplication is necessary.

Suggested change
if (node->OpType() == "Conv" || node->OpType() == "FusedConv") {
const auto group_opt = node->GetAttributeInt("group");
if (group_opt.has_value() && group_opt.value() != 1) {
continue;
}
const auto dilations_opt = node->GetAttributeInts("dilations");
if (dilations_opt.has_value()) {
const auto& dilations = dilations_opt.value();
if ((dilations.size() >= 1 && dilations[0] != 1) ||
(dilations.size() >= 2 && dilations[1] != 1)) {
continue;
}
}
}

Copilot uses AI. Check for mistakes.
@hariharans29
Copy link
Member

/azp run Linux QNN CI Pipeline,Win_TRT_Minimal_CUDA_Test_CI,Windows ARM64 QNN CI Pipeline,Windows GPU Doc Gen CI Pipeline

@azure-pipelines
Copy link

Azure Pipelines successfully started running 4 pipeline(s).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants