[MLAS] Add an implementation an NHWC implementation of convolution to avoid transposes by orlmon01 · Pull Request #26834 · microsoft/onnxruntime

orlmon01 · 2025-12-19T10:59:07Z

Modification to the CPU EP to specify channels_last when data format is NWHC
Added a FusedNhwcConv kernel
Implementation of the kernel in mlas
Added compiler guards so it is inly used with KleidiAi (for now, can be removed if needed)
Added unittests

Description

Currently OnnxRT supports NCHW as a default datalayout. For optimisations and kernels that operate better in NHWC layout, or where the datalayout is NHWC in the first place Transposes are added around the layers. This patch seeks to eliminate them in cases of convolutions where it would cause a performance decrease.

Motivation and Context

KleidiAi specific implementation of this feature. Only supports convolutions, DepthWise to follow. Currently a little strict with the filters as a result.

…transposes * Modification to the CPU EP to specify channels_last when data format is NWHC * Added a FusedNhwcConv kernel * Implementation of the kernel in mlas * Added compiler guards so it is inly used with KleidiAi (for now, can be removed if needed) * Added unittests Signed-off-by: Orlaith Monahan <orlaith.monahan@arm.com>

orlmon01 · 2025-12-19T11:00:53Z

@microsoft-github-policy-service agree company="Arm"

orlmon01 · 2025-12-19T12:35:47Z

Feedback appreciated as this PR makes quite a lot of changes to the codebase well outside of the normal KleidiAI scope.

Signed-off-by: Orlaith Monahan <orlaith.monahan@arm.com>

Rohanjames1997 · 2025-12-22T22:23:21Z

Hi @orlmon01, I imagine that avoiding transposes also improves performance.
Do you have any performance results to share?
TIA!

orlmon01 · 2026-01-06T10:07:09Z

Hi @orlmon01, I imagine that avoiding transposes also improves performance. Do you have any performance results to share? TIA!

Hiya,
Sorry, just back after the holidays. Yes, there is a performance increase. It depends on the model. Ones with multiple consecutive Convolutions where transposes can be eliminated will see a larger speedup. Even with the limited range of convolutions it's implemented for there should still be a performance increase in most cases.

I have some numbers somewhere from a Mobilenet model I was using for testing that I'll add in a bit, once I find / regenerate them. :)

orlmon01 · 2026-01-06T10:48:56Z

mobilenet model without the current patch:

Setting intra_op_num_threads to 1
Overriding dimension with name, N, to 1
Overriding dimension with name, T, to 1000
Overriding dimension with name, cache_T_attn, to 32
Overriding dimension with name, right_context, to 5
Session creation time cost: 0.020627 s
First inference time cost: 10 ms
Total inference time cost: 1.33257 s
Total inference requests: 200
Average inference time cost total: 6.662851 ms
Total inference run time: 1.33266 s
Number of inferences per second: 150.075
Avg CPU usage: 16 %
Peak working set size: 85429583872 bytes
Avg CPU usage:16
Peak working set size:85429583872
Runs:200
Min Latency: 0.006193 s
Max Latency: 0.007625 s
P50 Latency: 0.00666992 s
P90 Latency: 0.00686983 s
P95 Latency: 0.00694425 s
P99 Latency: 0.00733196 s
P999 Latency: 0.007625 s

Same model with changes:

Setting intra_op_num_threads to 1
Overriding dimension with name, N, to 1
Overriding dimension with name, T, to 1000
Overriding dimension with name, cache_T_attn, to 32
Overriding dimension with name, right_context, to 5
Session creation time cost: 0.0217724 s
First inference time cost: 7 ms
Total inference time cost: 1.12897 s
Total inference requests: 200
Average inference time cost total: 5.644857 ms
Total inference run time: 1.12905 s
Number of inferences per second: 177.14
Avg CPU usage: 16 %
Peak working set size: 80362864640 bytes
Avg CPU usage:16
Peak working set size:80362864640
Runs:200
Min Latency: 0.00527438 s
Max Latency: 0.006706 s
P50 Latency: 0.00566529 s
P90 Latency: 0.00579958 s
P95 Latency: 0.00585429 s
P99 Latency: 0.00639058 s
P999 Latency: 0.006706 s

edgchen1 · 2026-01-08T17:35:32Z

/azp run Linux QNN CI Pipeline,Win_TRT_Minimal_CUDA_Test_CI,Windows ARM64 QNN CI Pipeline,Windows GPU Doc Gen CI Pipeline

azure-pipelines · 2026-01-08T17:35:52Z

Azure Pipelines successfully started running 4 pipeline(s).

Update to the internal_testings_tests helper macros for file expansion so it works on other platforms

Fix for failing ConvDepthwiseFloat test, allows for a small tolerance when running on different hardware

For for failing TestSaveAndLoadOrtModel test Make sure the model being saved / loaded is being done from a writeable location

Fix for undeclared identifier linker error

hariharans29 · 2026-01-14T17:45:21Z

/azp run Linux QNN CI Pipeline,Win_TRT_Minimal_CUDA_Test_CI,Windows ARM64 QNN CI Pipeline,Windows GPU Doc Gen CI Pipeline

azure-pipelines · 2026-01-14T17:45:41Z

Azure Pipelines successfully started running 4 pipeline(s).

Signed-off-by: Orlaith Monahan <orlaith.monahan@arm.com>

hariharans29 · 2026-01-15T15:58:58Z

/azp run Linux QNN CI Pipeline,Win_TRT_Minimal_CUDA_Test_CI,Windows ARM64 QNN CI Pipeline,Windows GPU Doc Gen CI Pipeline

azure-pipelines · 2026-01-15T15:59:23Z

Azure Pipelines successfully started running 4 pipeline(s).

Copilot

Pull request overview

This PR adds an NHWC (channels-last) implementation of convolution operations to avoid costly transpose operations in the CPU execution provider. The implementation includes KleidiAI-specific optimizations and a fallback path for NHWC convolutions.

Changes:

Added NhwcFusedConv kernel for float32 convolutions in NHWC layout (KleidiAI-guarded)
Implemented NHWC fast path and fallback path with explicit NHWC↔NCHW conversions in MLAS
Extended test infrastructure to resolve paths dynamically and filter NHWC transformers in existing tests

Reviewed changes

Copilot reviewed 24 out of 24 changed files in this pull request and generated 7 comments.

Show a summary per file

File	Description
onnxruntime/core/providers/cpu/nn/conv.h	Added `channels_last_` flag to Conv kernel
onnxruntime/core/providers/cpu/nn/conv.cc	Implemented NHWC convolution logic with fast path and fallback
onnxruntime/core/optimizer/nhwc_transformer.cc	Added KleidiAI filter and FusedConv sum input handling
onnxruntime/core/mlas/lib/convolve.cpp	Added `ChannelsLast` parameter to MlasConvPrepare
onnxruntime/core/mlas/inc/mlas.h	Added `ChannelsLast` field to MLAS_CONV_PARAMETERS
onnxruntime/contrib_ops/cpu/fused_conv.cc	Registered NhwcFusedConv kernel
onnxruntime/test/optimizer/nhwc_transformer_test.cc	Added depthwise convolution test case
onnxruntime/test/optimizer/fuse_initializers_transformer_test.cc	Filtered NhwcTransformer in tests
onnxruntime/test/optimizer/conv_add_act_test.cc	Updated to handle both FusedConv variants
onnxruntime/test/internal_testing_ep/internal_testing_tests.cc	Added path resolution utilities
onnxruntime/test/framework/ort_model_only_test.cc	Added path resolution with diagnostic output
onnxruntime/core/util/math_cpu.cc	Added Im2col instantiation for float NHWC
onnxruntime/core/mlas/lib/kleidiai/convolve_kleidiai.cpp	Updated to support channels-last input

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

onnxruntime/core/optimizer/nhwc_transformer.cc

onnxruntime/core/framework/kernel_type_str_resolver.cc

onnxruntime/test/framework/ort_model_only_test.cc

Copilot · 2026-01-16T00:23:34Z

onnxruntime/core/optimizer/nhwc_transformer.cc

+    if (node->OpType() == "Conv" || node->OpType() == "FusedConv") {
+      const auto group_opt = node->GetAttributeInt("group");
+      if (group_opt.has_value() && group_opt.value() != 1) {
+        continue;
+      }
+
+      const auto dilations_opt = node->GetAttributeInts("dilations");
+      if (dilations_opt.has_value()) {
+        const auto& dilations = dilations_opt.value();
+        if ((dilations.size() >= 1 && dilations[0] != 1) ||
+            (dilations.size() >= 2 && dilations[1] != 1)) {
+          continue;
+        }
+      }
+    }


This validation logic duplicates the filter logic defined earlier in the same file (lines 187-203 and KleidiFp32NhwcFilter). Consider extracting this validation into a shared helper function to avoid code duplication and ensure consistency.

Copilot · 2026-01-16T00:23:34Z

onnxruntime/test/internal_testing_ep/internal_testing_tests.cc

  const auto& graph1 = session2->GetGraph();
-  // model should have all the original nodes and we should be able to execute with the fallback to CPU EP
-  ASSERT_EQ(graph1.NumberOfNodes(), num_nodes);
+  // ensure we can execute with the fallback to CPU EP even if additional nodes are introduced during loading


The comment mentions 'additional nodes are introduced during loading', but this change appears unrelated to the NHWC convolution feature. If this is a side effect of the NHWC transformer introducing additional nodes, this should be explicitly documented in the comment.

Suggested change

// ensure we can execute with the fallback to CPU EP even if additional nodes are introduced during loading

// ensure we can execute with the fallback to CPU EP even if additional nodes are introduced during loading

// (e.g., by graph transformers such as the NHWC convolution transformer that may add nodes when handling NHWC models)

onnxruntime/contrib_ops/cpu/cpu_contrib_kernels.cc

onnxruntime/core/providers/cpu/nn/conv.cc

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Copilot

Pull request overview

Copilot reviewed 24 out of 24 changed files in this pull request and generated 9 comments.

Comments suppressed due to low confidence (1)

onnxruntime/core/providers/cpu/nn/conv.cc:399

The fallback path (lines 353-399) for convolutions with kernel_rank > 3 does not handle the channels_last layout. This path uses Im2col with StorageOrder::NCHW (line 367) regardless of the channels_last_ flag. If a channels_last Conv node with kernel_rank > 3 is created, it will produce incorrect results. Either add proper handling for channels_last in this path, or add a validation check earlier in the function to reject channels_last convolutions with kernel_rank > 3.

  } else {
    const int64_t input_image_size = input_shape.Size();
    const int64_t output_image_size = output_shape.Size();
    const int64_t kernel_size = TensorShape(kernel_shape).Size();
    const SafeInt<int64_t> X_offset = SafeInt<int64_t>(C) / conv_attrs_.group * input_image_size;
    const SafeInt<int64_t> Y_offset = SafeInt<int64_t>(Y->Shape().Size()) / Y->Shape()[0] / conv_attrs_.group;
    const SafeInt<int64_t> W_offset = SafeInt<int64_t>(W->Shape().Size()) / conv_attrs_.group;
    const SafeInt<int64_t> kernel_dim = SafeInt<int64_t>(C) / conv_attrs_.group * kernel_size;
    const int64_t col_buffer_size = kernel_dim * output_image_size;

    auto col_data = IAllocator::MakeUniquePtr<float>(alloc, narrow<size_t>(col_buffer_size));
    auto w_data = W->DataAsSpan<float>();
    for (int image_id = 0; image_id < N; ++image_id) {
      for (int group_id = 0; group_id < conv_attrs_.group; ++group_id) {
        math::Im2col<float, StorageOrder::NCHW>()(
            &Xdata[group_id * X_offset],
            input_shape.GetDims().data(),
            output_shape.GetDims().data(),
            kernel_dim,
            kernel_shape.data(),
            strides.data(),
            dilations.data(),
            pads.data(),
            narrow<int>(kernel_shape.size()),
            col_data.get());

        math::Gemm<float>(
            CblasNoTrans,
            CblasNoTrans,
            narrow<ptrdiff_t>(M / conv_attrs_.group),
            narrow<ptrdiff_t>(output_image_size),
            narrow<ptrdiff_t>(kernel_dim),
            1,
            &w_data[group_id * W_offset],
            col_data.get(),
            Beta,
            &Ydata[group_id * Y_offset],
            thread_pool);
      }

      MlasActivation(&activation_, Ydata.data(), Bdata, narrow<size_t>(M),
                     narrow<size_t>(output_image_size), narrow<size_t>(output_image_size));

      Xdata = Xdata.subspan(X_offset * conv_attrs_.group);
      Ydata = Ydata.subspan(Y_offset * conv_attrs_.group);
    }
  }

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-01-26T18:03:15Z

onnxruntime/contrib_ops/cpu/fused_conv.cc

+    NhwcFusedConv,
+    1,
+    float,
+    KernelDefBuilder()


The NhwcFusedConv kernel registration is missing the MayInplace hint that is present in the FusedConv registration. The FusedConv kernel uses .MayInplace(3, 0) to allow the optional "sum" input (index 3) to be reused as the output buffer (index 0) for efficiency. However, NhwcFusedConv does not include this hint. This means that even though the code in conv.cc handles the Sum input for channels_last mode, the allocation planner cannot optimize memory usage by reusing the Sum buffer for the output when using NhwcFusedConv. Consider adding .MayInplace(3, 0) to the NhwcFusedConv kernel builder to maintain consistency and enable the same memory optimization.

Suggested change

KernelDefBuilder()

KernelDefBuilder()

// Allow the optional "sum" input (index 3) to be reused as the output buffer (index 0),

// consistent with the FusedConv kernel registration.

.MayInplace(3, 0)

Copilot · 2026-01-26T18:03:15Z

onnxruntime/contrib_ops/cpu/fused_conv.cc

+ONNX_CPU_OPERATOR_TYPED_MS_KERNEL(
+    NhwcFusedConv,
+    1,
+    float,
+    KernelDefBuilder()
+        .TypeConstraint("T", DataTypeImpl::GetTensorType<float>()),
+    FusedConvFloat);


The NhwcFusedConv kernel registration (lines 29-35) is not conditionally compiled with USE_KLEIDIAI guards, but the PR description states this is a "KleidiAi specific implementation" that is "only used with KleidiAi (for now)". This is inconsistent with the conditional registration approach used in cpu_contrib_kernels.cc where the declaration is guarded by #ifdef USE_KLEIDIAI. Either this registration should also be conditionally compiled with USE_KLEIDIAI, or if the kernel is meant to work without KleidiAI (using the fallback path), the guards in cpu_contrib_kernels.cc should be removed.

Copilot · 2026-01-26T18:03:16Z

onnxruntime/test/optimizer/nhwc_transformer_test.cc

                    TransformerLevel::Level3);
 }

+TEST(NhwcTransformerTests, ConvDepthwiseFloat) {


The test name "ConvDepthwiseFloat" and the test setup (group=8 with 8 input channels) indicates this is testing depthwise convolution. However, the test expectation on line 239 expects 0 NhwcFusedConv nodes, and line 240 expects 0 Transpose nodes, suggesting the transformer should NOT apply NHWC transformation to depthwise convolutions. This is inconsistent with the PR description which states "Only supports convolutions, DepthWise to follow" - implying depthwise convolutions are not yet supported. The test appears to be verifying that depthwise convolutions are correctly skipped, but the test name is misleading. Consider renaming to "ConvDepthwiseFloat_NotTransformed" or "ConvDepthwiseFloat_SkipNhwc" to make the intent clearer.

Suggested change

TEST(NhwcTransformerTests, ConvDepthwiseFloat) {

TEST(NhwcTransformerTests, ConvDepthwiseFloat_SkipNhwc) {

Copilot · 2026-01-26T18:03:16Z

onnxruntime/core/optimizer/nhwc_transformer.cc

+  const auto inputs = node.Inputs();
+  if (inputs.size() > 3 && !inputs[3].empty()) {
+    return false;
+  }


The KleidiFp32NhwcFilter function rejects FusedConv nodes that have a Sum input at index 3 (lines 55-58). However, the conv.cc implementation (lines 256-274) includes logic to handle the Sum input in both the fast path and the fallback path for channels_last mode. This means the filter is more restrictive than necessary - it prevents transformation of FusedConv nodes with Sum inputs even though the implementation can handle them. Consider removing this check (lines 55-58) or updating the comment to explain why Sum is not supported in the KleidiAI fast path specifically.

Copilot · 2026-01-26T18:03:16Z

onnxruntime/core/optimizer/nhwc_transformer.cc

    if (transform->has_channels_last_attrib_) {
      node->SetAttributeInt("channels_last", 1);
    }
+
+    if (node->OpType() == "Conv" || node->OpType() == "FusedConv") {
+      const auto group_opt = node->GetAttributeInt("group");
+      if (group_opt.has_value() && group_opt.value() != 1) {
+        continue;
+      }
+
+      const auto dilations_opt = node->GetAttributeInts("dilations");
+      if (dilations_opt.has_value()) {
+        const auto& dilations = dilations_opt.value();
+        if ((dilations.size() >= 1 && dilations[0] != 1) ||
+            (dilations.size() >= 2 && dilations[1] != 1)) {
+          continue;
+        }
+      }
+    }


There is a logic error in the transformer flow. The code sets the "channels_last" attribute on line 338, then continues to check group and dilation constraints (lines 341-355), and only if those checks pass does it proceed with the transformation. However, if the checks fail and the code continues (skips the node), the "channels_last" attribute has already been set but the node hasn't been transformed. This will leave the node in an inconsistent state with channels_last=1 but still in NCHW layout without the necessary transposes. The attribute should only be set after all validation checks have passed, or the attribute should be cleared before continuing.

Copilot · 2026-01-26T18:03:17Z

onnxruntime/core/optimizer/conv_add_act_fusion.cc

+
+    if (target.OpType() == "Conv") {
+      return channels_last ? "NhwcFusedConv" : "FusedConv";
+    }
+
+    return "NhwcFusedConv";


The OpType method has incomplete logic. When the target node is not "Conv" (line 218), it unconditionally returns "NhwcFusedConv" (line 222). However, if the target is already "FusedConv" without channels_last attribute, it should remain as "FusedConv", not become "NhwcFusedConv". The correct logic should check if channels_last is set for all node types, not just for "Conv". Consider changing line 218 to check for both "Conv" and "FusedConv", or restructure to check channels_last first regardless of node type.

Suggested change

if (target.OpType() == "Conv") {

return channels_last ? "NhwcFusedConv" : "FusedConv";

}

return "NhwcFusedConv";

const std::string& op_type = target.OpType();

// If channels_last is set, use NHWC fused convolution regardless of original op type.

if (channels_last) {

return "NhwcFusedConv";

}

// Without channels_last, convert Conv to FusedConv, and leave other op types unchanged.

if (op_type == "Conv") {

return "FusedConv";

}

return op_type;

Copilot · 2026-01-26T18:03:17Z

onnxruntime/contrib_ops/cpu/cpu_contrib_kernels.cc

      BuildKernelCreateInfo<ONNX_OPERATOR_TYPED_KERNEL_CLASS_NAME(kCpuExecutionProvider, kMSDomain, 1, float, FusedConv)>,
+      BuildKernelCreateInfo<ONNX_OPERATOR_TYPED_KERNEL_CLASS_NAME(kCpuExecutionProvider, kMSDomain, 1, float, NhwcFusedConv)>,


The registration of NhwcFusedConv kernel is unconditional in cpu_contrib_kernels.cc (line 308), but the kernel declaration is conditionally compiled with USE_KLEIDIAI guards in the same file (lines 21-23). This creates an inconsistency: when USE_KLEIDIAI is not defined, the declaration is absent but the registration still attempts to register the kernel, which will likely cause a compilation error. The registration on line 308 should also be wrapped with #ifdef USE_KLEIDIAI guards to match the conditional declaration.

Suggested change

BuildKernelCreateInfo<ONNX_OPERATOR_TYPED_KERNEL_CLASS_NAME(kCpuExecutionProvider, kMSDomain, 1, float, FusedConv)>,

BuildKernelCreateInfo<ONNX_OPERATOR_TYPED_KERNEL_CLASS_NAME(kCpuExecutionProvider, kMSDomain, 1, float, NhwcFusedConv)>,

BuildKernelCreateInfo<ONNX_OPERATOR_TYPED_KERNEL_CLASS_NAME(kCpuExecutionProvider, kMSDomain, 1, float, FusedConv)>,

#ifdef USE_KLEIDIAI

BuildKernelCreateInfo<ONNX_OPERATOR_TYPED_KERNEL_CLASS_NAME(kCpuExecutionProvider, kMSDomain, 1, float, NhwcFusedConv)>,

#endif

Copilot · 2026-01-26T18:03:18Z

onnxruntime/core/optimizer/nhwc_transformer.cc

+    if (node->OpType() == "Conv" || node->OpType() == "FusedConv") {
+      const auto group_opt = node->GetAttributeInt("group");
+      if (group_opt.has_value() && group_opt.value() != 1) {
+        continue;
+      }
+
+      const auto dilations_opt = node->GetAttributeInts("dilations");
+      if (dilations_opt.has_value()) {
+        const auto& dilations = dilations_opt.value();
+        if ((dilations.size() >= 1 && dilations[0] != 1) ||
+            (dilations.size() >= 2 && dilations[1] != 1)) {
+          continue;
+        }
+      }
+    }
+


The group and dilation checks in lines 341-355 duplicate the same checks already performed in the KleidiFp32NhwcFilter function (lines 75-87). This duplication is error-prone and makes maintenance harder. Since the KleidiFp32NhwcFilter is already called during NhwcConvLookup (line 319), these constraints should already be enforced for FusedConv/Conv nodes. Consider removing this redundant check, or if it's needed for non-KleidiAI cases, add a comment explaining why the duplication is necessary.

Suggested change

if (node->OpType() == "Conv" || node->OpType() == "FusedConv") {

const auto group_opt = node->GetAttributeInt("group");

if (group_opt.has_value() && group_opt.value() != 1) {

continue;

}

const auto dilations_opt = node->GetAttributeInts("dilations");

if (dilations_opt.has_value()) {

const auto& dilations = dilations_opt.value();

if ((dilations.size() >= 1 && dilations[0] != 1) ||

(dilations.size() >= 2 && dilations[1] != 1)) {

continue;

}

}

}

Signed-off-by: Orlaith Monahan <orlaith.monahan@arm.com>

hariharans29 · 2026-02-04T23:35:06Z

/azp run Linux QNN CI Pipeline,Win_TRT_Minimal_CUDA_Test_CI,Windows ARM64 QNN CI Pipeline,Windows GPU Doc Gen CI Pipeline

azure-pipelines · 2026-02-04T23:35:24Z

Azure Pipelines successfully started running 4 pipeline(s).

orlmon01 marked this pull request as draft December 19, 2025 12:34

orlmon01 marked this pull request as ready for review December 19, 2025 12:35

Add a value for channels_last to bench_sconv.cpp

1606a1c

Signed-off-by: Orlaith Monahan <orlaith.monahan@arm.com>

Merge branch 'microsoft:main' into main

f80cc39

orlmon01 added 6 commits January 12, 2026 10:55

Merge branch 'microsoft:main' into main

6045333

Update internal_testing_tests.cc

2dd199e

Update to the internal_testings_tests helper macros for file expansion so it works on other platforms

Merge branch 'microsoft:main' into main

eb026d1

Update nhwc_transformer_test.cc

4df9cea

Fix for failing ConvDepthwiseFloat test, allows for a small tolerance when running on different hardware

Update internal_testing_tests.cc

b133782

For for failing TestSaveAndLoadOrtModel test Make sure the model being saved / loaded is being done from a writeable location

Update ort_model_only_test.cc

0c2d1cd

Fix for undeclared identifier linker error

orlmon01 added 2 commits January 14, 2026 18:00

Lintrunner fixes

25c0be7

Signed-off-by: Orlaith Monahan <orlaith.monahan@arm.com>

Merge branch 'microsoft:main' into main

bee0892

hariharans29 requested a review from Copilot January 16, 2026 00:20

Copilot started reviewing on behalf of hariharans29 January 16, 2026 00:21 View session

Copilot AI reviewed Jan 16, 2026

View reviewed changes

orlmon01 added 2 commits January 16, 2026 12:02

Merge branch 'microsoft:main' into main

a64af7c

Merge branch 'microsoft:main' into main

bc1ada6

hariharans29 changed the title ~~Add an implementation an NHWC implementation of convolution to avoid transposes~~ [MLAS] Add an implementation an NHWC implementation of convolution to avoid transposes Jan 21, 2026

orlmon01 and others added 6 commits January 26, 2026 10:07

Update onnxruntime/core/optimizer/nhwc_transformer.cc

0482150

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Update onnxruntime/core/framework/kernel_type_str_resolver.cc

f9606cd

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Update onnxruntime/core/providers/cpu/nn/conv.cc

63d9c55

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Update onnxruntime/contrib_ops/cpu/cpu_contrib_kernels.cc

457513b

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Update onnxruntime/test/framework/ort_model_only_test.cc

b836bd3

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Merge branch 'microsoft:main' into main

d305b8f

hariharans29 requested a review from Copilot January 26, 2026 17:57

Copilot started reviewing on behalf of hariharans29 January 26, 2026 17:58 View session

Copilot AI reviewed Jan 26, 2026

View reviewed changes

orlmon01 added 2 commits February 4, 2026 12:57

Additional guards to not include KLEIDIAI specific kernels

891dad5

Signed-off-by: Orlaith Monahan <orlaith.monahan@arm.com>

Merge branch 'microsoft:main' into main

7acbfcf

orlmon01 added 2 commits February 6, 2026 10:58

Merge branch 'microsoft:main' into main

878dff6

Merge branch 'microsoft:main' into main

0a04afc

	// ensure we can execute with the fallback to CPU EP even if additional nodes are introduced during loading
	// ensure we can execute with the fallback to CPU EP even if additional nodes are introduced during loading
	// (e.g., by graph transformers such as the NHWC convolution transformer that may add nodes when handling NHWC models)

-    KernelDefBuilder()
+    KernelDefBuilder()
+        // Allow the optional "sum" input (index 3) to be reused as the output buffer (index 0),
+        // consistent with the FusedConv kernel registration.
+        .MayInplace(3, 0)

	TEST(NhwcTransformerTests, ConvDepthwiseFloat) {
	TEST(NhwcTransformerTests, ConvDepthwiseFloat_SkipNhwc) {

-    if (target.OpType() == "Conv") {
-      return channels_last ? "NhwcFusedConv" : "FusedConv";
-    }
-    return "NhwcFusedConv";
+    const std::string& op_type = target.OpType();
+    // If channels_last is set, use NHWC fused convolution regardless of original op type.
+    if (channels_last) {
+      return "NhwcFusedConv";
+    }
+    // Without channels_last, convert Conv to FusedConv, and leave other op types unchanged.
+    if (op_type == "Conv") {
+      return "FusedConv";
+    }
+    return op_type;

		BuildKernelCreateInfo<ONNX_OPERATOR_TYPED_KERNEL_CLASS_NAME(kCpuExecutionProvider, kMSDomain, 1, float, FusedConv)>,
		BuildKernelCreateInfo<ONNX_OPERATOR_TYPED_KERNEL_CLASS_NAME(kCpuExecutionProvider, kMSDomain, 1, float, NhwcFusedConv)>,

Conversation

orlmon01 commented Dec 19, 2025

Description

Motivation and Context

Uh oh!

orlmon01 commented Dec 19, 2025

Uh oh!

orlmon01 commented Dec 19, 2025

Uh oh!

Rohanjames1997 commented Dec 22, 2025

Uh oh!

orlmon01 commented Jan 6, 2026

Uh oh!

orlmon01 commented Jan 6, 2026

Uh oh!

edgchen1 commented Jan 8, 2026

Uh oh!

azure-pipelines bot commented Jan 8, 2026

Uh oh!

hariharans29 commented Jan 14, 2026

Uh oh!

azure-pipelines bot commented Jan 14, 2026

Uh oh!

hariharans29 commented Jan 15, 2026

Uh oh!

azure-pipelines bot commented Jan 15, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI Jan 16, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 16, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Jan 26, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 26, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 26, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 26, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 26, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 26, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 26, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 26, 2026

Choose a reason for hiding this comment

Uh oh!

hariharans29 commented Feb 4, 2026

Uh oh!

azure-pipelines bot commented Feb 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone