Commit 333ad49
[MLAS/NEON] Add dedicated kernel for depthwise convolution for ARM64 using NEON intrinsics (microsoft#26688)
### Description
**Motivation and approach taken:**
Add a dedicated depthwise convolution kernel for the most common
depthwise convolution configuration (3x3 filter, stride = 1, pad <= 1,
dilation = 1) using NEON intrinsics. This does significantly better than
the current approach of `Im2Col + SGemm`. The Im2Col step extracts
convolution patches and this is a wasteful step and for a 3x3 filter, K
would be 9 for the SGemm and usually Gemms are not optimized for such
small `K` values. Hence, a dedicated kernel works much better.
Initially, I ported over the Winograd based NEON accelerated depthwise
convolution kernel from PyTorch but I found that its performance is not
very good. It's poor performance is probably due to applying the
Winograd transformation for the filter repeatedly. A better approach may
be to tranform the filter offline and this approach can be considered
for later (I reverted the PyTorch Winograd implementation in this
commit:
microsoft@2820a84).
The current depthwise kernel added in this PR was authored by
GPT5.1-Codex and with some minor bug fixes it seems to be functionally
correct now and also provides the perf boost we are seeking.
**Unit tests:**
There are already depthwise convolution tests already existing in the
codebase. I don't see a need for new ones at this point.
**Kernel benchmarking:**
This is the kernel level perf improvement from MLAS Conv benchmarks
(About 50% kernel latency improvements):
<img width="1055" height="90" alt="image"
src="https://github.com/user-attachments/assets/ead9eb83-2d62-4157-a065-70c67c8c7517"
/>
### Motivation and Context
A key customer model had a few depthwise conolution operations and this
change provides a **non-negligible ~3% throughput improvement** using
the customer provided benchmarking setup
For those interested,
microsoft#26654 adds support for the
same type of convolution variant but that leverages SME1/SME2 through
KleidiAI. This PR is conceptually the same but targeting NEON only
platforms.
---------
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>1 parent e74cd86 commit 333ad49
File tree
9 files changed
+447
-18
lines changed- cmake
- onnxruntime
- core/mlas
- inc
- lib
- test/mlas/bench
9 files changed
+447
-18
lines changed| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
115 | 115 | | |
116 | 116 | | |
117 | 117 | | |
| 118 | + | |
118 | 119 | | |
119 | 120 | | |
120 | 121 | | |
| |||
310 | 311 | | |
311 | 312 | | |
312 | 313 | | |
313 | | - | |
314 | | - | |
315 | | - | |
| 314 | + | |
| 315 | + | |
| 316 | + | |
316 | 317 | | |
317 | 318 | | |
318 | 319 | | |
| |||
466 | 467 | | |
467 | 468 | | |
468 | 469 | | |
| 470 | + | |
469 | 471 | | |
470 | 472 | | |
471 | 473 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
854 | 854 | | |
855 | 855 | | |
856 | 856 | | |
857 | | - | |
| 857 | + | |
858 | 858 | | |
859 | 859 | | |
860 | 860 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
698 | 698 | | |
699 | 699 | | |
700 | 700 | | |
| 701 | + | |
| 702 | + | |
| 703 | + | |
701 | 704 | | |
702 | 705 | | |
703 | 706 | | |
704 | | - | |
705 | | - | |
706 | 707 | | |
707 | | - | |
708 | 708 | | |
709 | 709 | | |
710 | 710 | | |
| |||
726 | 726 | | |
727 | 727 | | |
728 | 728 | | |
| 729 | + | |
| 730 | + | |
| 731 | + | |
729 | 732 | | |
730 | 733 | | |
731 | 734 | | |
| |||
805 | 808 | | |
806 | 809 | | |
807 | 810 | | |
| 811 | + | |
| 812 | + | |
| 813 | + | |
| 814 | + | |
| 815 | + | |
| 816 | + | |
| 817 | + | |
| 818 | + | |
| 819 | + | |
| 820 | + | |
| 821 | + | |
| 822 | + | |
| 823 | + | |
| 824 | + | |
| 825 | + | |
| 826 | + | |
| 827 | + | |
| 828 | + | |
| 829 | + | |
| 830 | + | |
| 831 | + | |
| 832 | + | |
| 833 | + | |
| 834 | + | |
| 835 | + | |
| 836 | + | |
| 837 | + | |
| 838 | + | |
| 839 | + | |
| 840 | + | |
| 841 | + | |
| 842 | + | |
| 843 | + | |
| 844 | + | |
| 845 | + | |
| 846 | + | |
| 847 | + | |
| 848 | + | |
| 849 | + | |
| 850 | + | |
| 851 | + | |
| 852 | + | |
| 853 | + | |
| 854 | + | |
| 855 | + | |
| 856 | + | |
| 857 | + | |
| 858 | + | |
| 859 | + | |
| 860 | + | |
| 861 | + | |
| 862 | + | |
| 863 | + | |
| 864 | + | |
| 865 | + | |
| 866 | + | |
| 867 | + | |
| 868 | + | |
| 869 | + | |
| 870 | + | |
| 871 | + | |
| 872 | + | |
| 873 | + | |
| 874 | + | |
| 875 | + | |
| 876 | + | |
| 877 | + | |
| 878 | + | |
| 879 | + | |
| 880 | + | |
| 881 | + | |
| 882 | + | |
| 883 | + | |
| 884 | + | |
| 885 | + | |
| 886 | + | |
| 887 | + | |
| 888 | + | |
| 889 | + | |
| 890 | + | |
| 891 | + | |
| 892 | + | |
| 893 | + | |
| 894 | + | |
808 | 895 | | |
809 | 896 | | |
810 | 897 | | |
| |||
985 | 1072 | | |
986 | 1073 | | |
987 | 1074 | | |
988 | | - | |
| 1075 | + | |
989 | 1076 | | |
990 | 1077 | | |
991 | 1078 | | |
| |||
1019 | 1106 | | |
1020 | 1107 | | |
1021 | 1108 | | |
| 1109 | + | |
| 1110 | + | |
| 1111 | + | |
| 1112 | + | |
| 1113 | + | |
| 1114 | + | |
| 1115 | + | |
| 1116 | + | |
| 1117 | + | |
| 1118 | + | |
| 1119 | + | |
| 1120 | + | |
| 1121 | + | |
| 1122 | + | |
| 1123 | + | |
| 1124 | + | |
| 1125 | + | |
| 1126 | + | |
| 1127 | + | |
| 1128 | + | |
| 1129 | + | |
| 1130 | + | |
| 1131 | + | |
| 1132 | + | |
| 1133 | + | |
| 1134 | + | |
| 1135 | + | |
| 1136 | + | |
| 1137 | + | |
1022 | 1138 | | |
1023 | 1139 | | |
1024 | 1140 | | |
| |||
1082 | 1198 | | |
1083 | 1199 | | |
1084 | 1200 | | |
1085 | | - | |
| 1201 | + | |
1086 | 1202 | | |
1087 | 1203 | | |
1088 | 1204 | | |
| |||
1337 | 1453 | | |
1338 | 1454 | | |
1339 | 1455 | | |
1340 | | - | |
| 1456 | + | |
1341 | 1457 | | |
1342 | | - | |
1343 | | - | |
| 1458 | + | |
| 1459 | + | |
| 1460 | + | |
1344 | 1461 | | |
1345 | 1462 | | |
| 1463 | + | |
| 1464 | + | |
| 1465 | + | |
| 1466 | + | |
| 1467 | + | |
| 1468 | + | |
| 1469 | + | |
1346 | 1470 | | |
1347 | 1471 | | |
1348 | 1472 | | |
1349 | 1473 | | |
1350 | 1474 | | |
| 1475 | + | |
1351 | 1476 | | |
1352 | 1477 | | |
1353 | 1478 | | |
| |||
1411 | 1536 | | |
1412 | 1537 | | |
1413 | 1538 | | |
1414 | | - | |
1415 | | - | |
| 1539 | + | |
| 1540 | + | |
1416 | 1541 | | |
1417 | 1542 | | |
1418 | 1543 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
1645 | 1645 | | |
1646 | 1646 | | |
1647 | 1647 | | |
1648 | | - | |
| 1648 | + | |
| 1649 | + | |
1649 | 1650 | | |
1650 | 1651 | | |
1651 | 1652 | | |
| |||
0 commit comments