[Core] Expose a session option to disable NCHWc layout optimizations#27248
Open
hariharans29 wants to merge 9 commits intomainfrom
Open
[Core] Expose a session option to disable NCHWc layout optimizations#27248hariharans29 wants to merge 9 commits intomainfrom
hariharans29 wants to merge 9 commits intomainfrom
Conversation
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
…soft/onnxruntime into hari/intel_vs_amd_takeaway
Contributor
There was a problem hiding this comment.
Pull request overview
This PR introduces a session configuration option to allow users to disable NCHWc layout transformations, addressing performance issues observed on certain hardware platforms (particularly Intel CPUs) where large kernel convolutions with NCHWc layout perform poorly due to memory bandwidth bottlenecks.
Changes:
- Added a new session configuration key
kOrtSessionOptionsDisableNchwcLayoutTransformationto control NCHWc layout transformation - Implemented warning logging when Conv nodes with large kernel sizes (>=7x7) are encountered during NCHWc transformation
- Updated the NCHWc transformer implementation to accept and use a logger for warnings
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 4 comments.
| File | Description |
|---|---|
| include/onnxruntime/core/session/onnxruntime_session_options_config_keys.h | Adds the new session configuration constant for disabling NCHWc layout transformation |
| onnxruntime/core/optimizer/graph_transformer_utils.cc | Checks the new session option and conditionally registers the NCHWc transformer |
| onnxruntime/core/optimizer/nchwc_transformer.cc | Passes logger to transformer implementation and adds warning for large kernel Conv operations |
| onnxruntime/core/mlas/lib/snchwc.cpp | Adds TODO comment documenting the need for alternative Conv implementations for large kernels |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
…soft/onnxruntime into hari/intel_vs_amd_takeaway
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
Sometimes, a model could have an "outlier" convolution operation with large kernel sizes (eg) 11x11 and a good amout of filters (eg) 64 filters. Since, NCHWc layout transformations are turned on by default, data layout for the
Convoperation is changed to NCHWc (if otherConvnode parameters fit the bill). The NCHWcConvimplementations in MLAS use a "direct" convolution implementation and large kernel sizes + a good amount of filters make the entire operation a heavily memory badwidth bound operation. With this background, it makes NCHWc data layouts for such models not a good choice for certain platforms whereas on some other platforms (possibly platforms with superior memory bandwidth characteristics and/or other memory characteristics like cache sizes), it still works well. The work-around today is for users to drop down to a lower graph optimization level that does not include the NCHWc transformer on platforms where perf is poor for such models. This seems like a bad precedence to set because it would mean users will miss out on other L3 and L4 optimizers.Short term:
Convnode that is an outlier in terms of the kernel sizeUltimately, the short term fix gives users the data to help them pick the right data layout for their target model on their target hardware to run it most optimally with ORT.
Longer term:
Add complementary convolution implementations in the NCHWc convolution suite in MLAS (like Im2Col + SGemm) to go with the direct convolution implementations and add heuristics to pick an implementation
More longer term:
Enable online benchmarking infrastructure to help pick the best algo (direct vs Im2Col + SGemm) for a given data layout and Conv parameters (filter sizes, dilations, strides, etc.) based on the data from warm-up runs and use that for algo selection for future runs
Motivation and Context
Takeaway from #26992 and possibly #23587 (although it needs a repro model to be sure)