[MLAS] Add an implementation an NHWC implementation of convolution to avoid transposes#26834
[MLAS] Add an implementation an NHWC implementation of convolution to avoid transposes#26834orlmon01 wants to merge 23 commits intomicrosoft:mainfrom
Conversation
…transposes * Modification to the CPU EP to specify channels_last when data format is NWHC * Added a FusedNhwcConv kernel * Implementation of the kernel in mlas * Added compiler guards so it is inly used with KleidiAi (for now, can be removed if needed) * Added unittests Signed-off-by: Orlaith Monahan <orlaith.monahan@arm.com>
|
@microsoft-github-policy-service agree company="Arm" |
|
Feedback appreciated as this PR makes quite a lot of changes to the codebase well outside of the normal KleidiAI scope. |
Signed-off-by: Orlaith Monahan <orlaith.monahan@arm.com>
|
Hi @orlmon01, I imagine that avoiding transposes also improves performance. |
Hiya, I have some numbers somewhere from a Mobilenet model I was using for testing that I'll add in a bit, once I find / regenerate them. :) |
|
mobilenet model without the current patch:
Same model with changes:
|
|
/azp run Linux QNN CI Pipeline,Win_TRT_Minimal_CUDA_Test_CI,Windows ARM64 QNN CI Pipeline,Windows GPU Doc Gen CI Pipeline |
|
Azure Pipelines successfully started running 4 pipeline(s). |
Update to the internal_testings_tests helper macros for file expansion so it works on other platforms
Fix for failing ConvDepthwiseFloat test, allows for a small tolerance when running on different hardware
For for failing TestSaveAndLoadOrtModel test Make sure the model being saved / loaded is being done from a writeable location
Fix for undeclared identifier linker error
|
/azp run Linux QNN CI Pipeline,Win_TRT_Minimal_CUDA_Test_CI,Windows ARM64 QNN CI Pipeline,Windows GPU Doc Gen CI Pipeline |
|
Azure Pipelines successfully started running 4 pipeline(s). |
Signed-off-by: Orlaith Monahan <orlaith.monahan@arm.com>
|
/azp run Linux QNN CI Pipeline,Win_TRT_Minimal_CUDA_Test_CI,Windows ARM64 QNN CI Pipeline,Windows GPU Doc Gen CI Pipeline |
|
Azure Pipelines successfully started running 4 pipeline(s). |
There was a problem hiding this comment.
Pull request overview
This PR adds an NHWC (channels-last) implementation of convolution operations to avoid costly transpose operations in the CPU execution provider. The implementation includes KleidiAI-specific optimizations and a fallback path for NHWC convolutions.
Changes:
- Added
NhwcFusedConvkernel for float32 convolutions in NHWC layout (KleidiAI-guarded) - Implemented NHWC fast path and fallback path with explicit NHWC↔NCHW conversions in MLAS
- Extended test infrastructure to resolve paths dynamically and filter NHWC transformers in existing tests
Reviewed changes
Copilot reviewed 24 out of 24 changed files in this pull request and generated 7 comments.
Show a summary per file
| File | Description |
|---|---|
| onnxruntime/core/providers/cpu/nn/conv.h | Added channels_last_ flag to Conv kernel |
| onnxruntime/core/providers/cpu/nn/conv.cc | Implemented NHWC convolution logic with fast path and fallback |
| onnxruntime/core/optimizer/nhwc_transformer.cc | Added KleidiAI filter and FusedConv sum input handling |
| onnxruntime/core/mlas/lib/convolve.cpp | Added ChannelsLast parameter to MlasConvPrepare |
| onnxruntime/core/mlas/inc/mlas.h | Added ChannelsLast field to MLAS_CONV_PARAMETERS |
| onnxruntime/contrib_ops/cpu/fused_conv.cc | Registered NhwcFusedConv kernel |
| onnxruntime/test/optimizer/nhwc_transformer_test.cc | Added depthwise convolution test case |
| onnxruntime/test/optimizer/fuse_initializers_transformer_test.cc | Filtered NhwcTransformer in tests |
| onnxruntime/test/optimizer/conv_add_act_test.cc | Updated to handle both FusedConv variants |
| onnxruntime/test/internal_testing_ep/internal_testing_tests.cc | Added path resolution utilities |
| onnxruntime/test/framework/ort_model_only_test.cc | Added path resolution with diagnostic output |
| onnxruntime/core/util/math_cpu.cc | Added Im2col instantiation for float NHWC |
| onnxruntime/core/mlas/lib/kleidiai/convolve_kleidiai.cpp | Updated to support channels-last input |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| if (node->OpType() == "Conv" || node->OpType() == "FusedConv") { | ||
| const auto group_opt = node->GetAttributeInt("group"); | ||
| if (group_opt.has_value() && group_opt.value() != 1) { | ||
| continue; | ||
| } | ||
|
|
||
| const auto dilations_opt = node->GetAttributeInts("dilations"); | ||
| if (dilations_opt.has_value()) { | ||
| const auto& dilations = dilations_opt.value(); | ||
| if ((dilations.size() >= 1 && dilations[0] != 1) || | ||
| (dilations.size() >= 2 && dilations[1] != 1)) { | ||
| continue; | ||
| } | ||
| } | ||
| } |
There was a problem hiding this comment.
This validation logic duplicates the filter logic defined earlier in the same file (lines 187-203 and KleidiFp32NhwcFilter). Consider extracting this validation into a shared helper function to avoid code duplication and ensure consistency.
| const auto& graph1 = session2->GetGraph(); | ||
| // model should have all the original nodes and we should be able to execute with the fallback to CPU EP | ||
| ASSERT_EQ(graph1.NumberOfNodes(), num_nodes); | ||
| // ensure we can execute with the fallback to CPU EP even if additional nodes are introduced during loading |
There was a problem hiding this comment.
The comment mentions 'additional nodes are introduced during loading', but this change appears unrelated to the NHWC convolution feature. If this is a side effect of the NHWC transformer introducing additional nodes, this should be explicitly documented in the comment.
| // ensure we can execute with the fallback to CPU EP even if additional nodes are introduced during loading | |
| // ensure we can execute with the fallback to CPU EP even if additional nodes are introduced during loading | |
| // (e.g., by graph transformers such as the NHWC convolution transformer that may add nodes when handling NHWC models) |
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 24 out of 24 changed files in this pull request and generated 9 comments.
Comments suppressed due to low confidence (1)
onnxruntime/core/providers/cpu/nn/conv.cc:399
- The fallback path (lines 353-399) for convolutions with kernel_rank > 3 does not handle the channels_last layout. This path uses Im2col with StorageOrder::NCHW (line 367) regardless of the channels_last_ flag. If a channels_last Conv node with kernel_rank > 3 is created, it will produce incorrect results. Either add proper handling for channels_last in this path, or add a validation check earlier in the function to reject channels_last convolutions with kernel_rank > 3.
} else {
const int64_t input_image_size = input_shape.Size();
const int64_t output_image_size = output_shape.Size();
const int64_t kernel_size = TensorShape(kernel_shape).Size();
const SafeInt<int64_t> X_offset = SafeInt<int64_t>(C) / conv_attrs_.group * input_image_size;
const SafeInt<int64_t> Y_offset = SafeInt<int64_t>(Y->Shape().Size()) / Y->Shape()[0] / conv_attrs_.group;
const SafeInt<int64_t> W_offset = SafeInt<int64_t>(W->Shape().Size()) / conv_attrs_.group;
const SafeInt<int64_t> kernel_dim = SafeInt<int64_t>(C) / conv_attrs_.group * kernel_size;
const int64_t col_buffer_size = kernel_dim * output_image_size;
auto col_data = IAllocator::MakeUniquePtr<float>(alloc, narrow<size_t>(col_buffer_size));
auto w_data = W->DataAsSpan<float>();
for (int image_id = 0; image_id < N; ++image_id) {
for (int group_id = 0; group_id < conv_attrs_.group; ++group_id) {
math::Im2col<float, StorageOrder::NCHW>()(
&Xdata[group_id * X_offset],
input_shape.GetDims().data(),
output_shape.GetDims().data(),
kernel_dim,
kernel_shape.data(),
strides.data(),
dilations.data(),
pads.data(),
narrow<int>(kernel_shape.size()),
col_data.get());
math::Gemm<float>(
CblasNoTrans,
CblasNoTrans,
narrow<ptrdiff_t>(M / conv_attrs_.group),
narrow<ptrdiff_t>(output_image_size),
narrow<ptrdiff_t>(kernel_dim),
1,
&w_data[group_id * W_offset],
col_data.get(),
Beta,
&Ydata[group_id * Y_offset],
thread_pool);
}
MlasActivation(&activation_, Ydata.data(), Bdata, narrow<size_t>(M),
narrow<size_t>(output_image_size), narrow<size_t>(output_image_size));
Xdata = Xdata.subspan(X_offset * conv_attrs_.group);
Ydata = Ydata.subspan(Y_offset * conv_attrs_.group);
}
}
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| NhwcFusedConv, | ||
| 1, | ||
| float, | ||
| KernelDefBuilder() |
There was a problem hiding this comment.
The NhwcFusedConv kernel registration is missing the MayInplace hint that is present in the FusedConv registration. The FusedConv kernel uses .MayInplace(3, 0) to allow the optional "sum" input (index 3) to be reused as the output buffer (index 0) for efficiency. However, NhwcFusedConv does not include this hint. This means that even though the code in conv.cc handles the Sum input for channels_last mode, the allocation planner cannot optimize memory usage by reusing the Sum buffer for the output when using NhwcFusedConv. Consider adding .MayInplace(3, 0) to the NhwcFusedConv kernel builder to maintain consistency and enable the same memory optimization.
| KernelDefBuilder() | |
| KernelDefBuilder() | |
| // Allow the optional "sum" input (index 3) to be reused as the output buffer (index 0), | |
| // consistent with the FusedConv kernel registration. | |
| .MayInplace(3, 0) |
| ONNX_CPU_OPERATOR_TYPED_MS_KERNEL( | ||
| NhwcFusedConv, | ||
| 1, | ||
| float, | ||
| KernelDefBuilder() | ||
| .TypeConstraint("T", DataTypeImpl::GetTensorType<float>()), | ||
| FusedConvFloat); |
There was a problem hiding this comment.
The NhwcFusedConv kernel registration (lines 29-35) is not conditionally compiled with USE_KLEIDIAI guards, but the PR description states this is a "KleidiAi specific implementation" that is "only used with KleidiAi (for now)". This is inconsistent with the conditional registration approach used in cpu_contrib_kernels.cc where the declaration is guarded by #ifdef USE_KLEIDIAI. Either this registration should also be conditionally compiled with USE_KLEIDIAI, or if the kernel is meant to work without KleidiAI (using the fallback path), the guards in cpu_contrib_kernels.cc should be removed.
| TransformerLevel::Level3); | ||
| } | ||
|
|
||
| TEST(NhwcTransformerTests, ConvDepthwiseFloat) { |
There was a problem hiding this comment.
The test name "ConvDepthwiseFloat" and the test setup (group=8 with 8 input channels) indicates this is testing depthwise convolution. However, the test expectation on line 239 expects 0 NhwcFusedConv nodes, and line 240 expects 0 Transpose nodes, suggesting the transformer should NOT apply NHWC transformation to depthwise convolutions. This is inconsistent with the PR description which states "Only supports convolutions, DepthWise to follow" - implying depthwise convolutions are not yet supported. The test appears to be verifying that depthwise convolutions are correctly skipped, but the test name is misleading. Consider renaming to "ConvDepthwiseFloat_NotTransformed" or "ConvDepthwiseFloat_SkipNhwc" to make the intent clearer.
| TEST(NhwcTransformerTests, ConvDepthwiseFloat) { | |
| TEST(NhwcTransformerTests, ConvDepthwiseFloat_SkipNhwc) { |
| const auto inputs = node.Inputs(); | ||
| if (inputs.size() > 3 && !inputs[3].empty()) { | ||
| return false; | ||
| } |
There was a problem hiding this comment.
The KleidiFp32NhwcFilter function rejects FusedConv nodes that have a Sum input at index 3 (lines 55-58). However, the conv.cc implementation (lines 256-274) includes logic to handle the Sum input in both the fast path and the fallback path for channels_last mode. This means the filter is more restrictive than necessary - it prevents transformation of FusedConv nodes with Sum inputs even though the implementation can handle them. Consider removing this check (lines 55-58) or updating the comment to explain why Sum is not supported in the KleidiAI fast path specifically.
| if (transform->has_channels_last_attrib_) { | ||
| node->SetAttributeInt("channels_last", 1); | ||
| } | ||
|
|
||
| if (node->OpType() == "Conv" || node->OpType() == "FusedConv") { | ||
| const auto group_opt = node->GetAttributeInt("group"); | ||
| if (group_opt.has_value() && group_opt.value() != 1) { | ||
| continue; | ||
| } | ||
|
|
||
| const auto dilations_opt = node->GetAttributeInts("dilations"); | ||
| if (dilations_opt.has_value()) { | ||
| const auto& dilations = dilations_opt.value(); | ||
| if ((dilations.size() >= 1 && dilations[0] != 1) || | ||
| (dilations.size() >= 2 && dilations[1] != 1)) { | ||
| continue; | ||
| } | ||
| } | ||
| } |
There was a problem hiding this comment.
There is a logic error in the transformer flow. The code sets the "channels_last" attribute on line 338, then continues to check group and dilation constraints (lines 341-355), and only if those checks pass does it proceed with the transformation. However, if the checks fail and the code continues (skips the node), the "channels_last" attribute has already been set but the node hasn't been transformed. This will leave the node in an inconsistent state with channels_last=1 but still in NCHW layout without the necessary transposes. The attribute should only be set after all validation checks have passed, or the attribute should be cleared before continuing.
|
|
||
| if (target.OpType() == "Conv") { | ||
| return channels_last ? "NhwcFusedConv" : "FusedConv"; | ||
| } | ||
|
|
||
| return "NhwcFusedConv"; |
There was a problem hiding this comment.
The OpType method has incomplete logic. When the target node is not "Conv" (line 218), it unconditionally returns "NhwcFusedConv" (line 222). However, if the target is already "FusedConv" without channels_last attribute, it should remain as "FusedConv", not become "NhwcFusedConv". The correct logic should check if channels_last is set for all node types, not just for "Conv". Consider changing line 218 to check for both "Conv" and "FusedConv", or restructure to check channels_last first regardless of node type.
| if (target.OpType() == "Conv") { | |
| return channels_last ? "NhwcFusedConv" : "FusedConv"; | |
| } | |
| return "NhwcFusedConv"; | |
| const std::string& op_type = target.OpType(); | |
| // If channels_last is set, use NHWC fused convolution regardless of original op type. | |
| if (channels_last) { | |
| return "NhwcFusedConv"; | |
| } | |
| // Without channels_last, convert Conv to FusedConv, and leave other op types unchanged. | |
| if (op_type == "Conv") { | |
| return "FusedConv"; | |
| } | |
| return op_type; |
| BuildKernelCreateInfo<ONNX_OPERATOR_TYPED_KERNEL_CLASS_NAME(kCpuExecutionProvider, kMSDomain, 1, float, FusedConv)>, | ||
| BuildKernelCreateInfo<ONNX_OPERATOR_TYPED_KERNEL_CLASS_NAME(kCpuExecutionProvider, kMSDomain, 1, float, NhwcFusedConv)>, |
There was a problem hiding this comment.
The registration of NhwcFusedConv kernel is unconditional in cpu_contrib_kernels.cc (line 308), but the kernel declaration is conditionally compiled with USE_KLEIDIAI guards in the same file (lines 21-23). This creates an inconsistency: when USE_KLEIDIAI is not defined, the declaration is absent but the registration still attempts to register the kernel, which will likely cause a compilation error. The registration on line 308 should also be wrapped with #ifdef USE_KLEIDIAI guards to match the conditional declaration.
| BuildKernelCreateInfo<ONNX_OPERATOR_TYPED_KERNEL_CLASS_NAME(kCpuExecutionProvider, kMSDomain, 1, float, FusedConv)>, | |
| BuildKernelCreateInfo<ONNX_OPERATOR_TYPED_KERNEL_CLASS_NAME(kCpuExecutionProvider, kMSDomain, 1, float, NhwcFusedConv)>, | |
| BuildKernelCreateInfo<ONNX_OPERATOR_TYPED_KERNEL_CLASS_NAME(kCpuExecutionProvider, kMSDomain, 1, float, FusedConv)>, | |
| #ifdef USE_KLEIDIAI | |
| BuildKernelCreateInfo<ONNX_OPERATOR_TYPED_KERNEL_CLASS_NAME(kCpuExecutionProvider, kMSDomain, 1, float, NhwcFusedConv)>, | |
| #endif |
| if (node->OpType() == "Conv" || node->OpType() == "FusedConv") { | ||
| const auto group_opt = node->GetAttributeInt("group"); | ||
| if (group_opt.has_value() && group_opt.value() != 1) { | ||
| continue; | ||
| } | ||
|
|
||
| const auto dilations_opt = node->GetAttributeInts("dilations"); | ||
| if (dilations_opt.has_value()) { | ||
| const auto& dilations = dilations_opt.value(); | ||
| if ((dilations.size() >= 1 && dilations[0] != 1) || | ||
| (dilations.size() >= 2 && dilations[1] != 1)) { | ||
| continue; | ||
| } | ||
| } | ||
| } | ||
|
|
There was a problem hiding this comment.
The group and dilation checks in lines 341-355 duplicate the same checks already performed in the KleidiFp32NhwcFilter function (lines 75-87). This duplication is error-prone and makes maintenance harder. Since the KleidiFp32NhwcFilter is already called during NhwcConvLookup (line 319), these constraints should already be enforced for FusedConv/Conv nodes. Consider removing this redundant check, or if it's needed for non-KleidiAI cases, add a comment explaining why the duplication is necessary.
| if (node->OpType() == "Conv" || node->OpType() == "FusedConv") { | |
| const auto group_opt = node->GetAttributeInt("group"); | |
| if (group_opt.has_value() && group_opt.value() != 1) { | |
| continue; | |
| } | |
| const auto dilations_opt = node->GetAttributeInts("dilations"); | |
| if (dilations_opt.has_value()) { | |
| const auto& dilations = dilations_opt.value(); | |
| if ((dilations.size() >= 1 && dilations[0] != 1) || | |
| (dilations.size() >= 2 && dilations[1] != 1)) { | |
| continue; | |
| } | |
| } | |
| } |
Signed-off-by: Orlaith Monahan <orlaith.monahan@arm.com>
|
/azp run Linux QNN CI Pipeline,Win_TRT_Minimal_CUDA_Test_CI,Windows ARM64 QNN CI Pipeline,Windows GPU Doc Gen CI Pipeline |
|
Azure Pipelines successfully started running 4 pipeline(s). |
Description
Currently OnnxRT supports NCHW as a default datalayout. For optimisations and kernels that operate better in NHWC layout, or where the datalayout is NHWC in the first place Transposes are added around the layers. This patch seeks to eliminate them in cases of convolutions where it would cause a performance decrease.
Motivation and Context
KleidiAi specific implementation of this feature. Only supports convolutions, DepthWise to follow. Currently a little strict with the filters as a result.