Matmul nbits to optimize memory layout for avx instructions by liqunfu · Pull Request #22203 · microsoft/onnxruntime

liqunfu · 2024-09-24T15:50:37Z

The main purpose of this PR is to remove sqnbit's dependency on sgemm in x86/x64 cases. The benefit is a cleaner memory layout not requiring memory alignment, no need for the Rows by 16-bytes memory layout required by sgemm. It also offers slight performance improvement.
A second improvement in the PR is to reduce memory footprint by fully packing zero point and scales. There is no need for these inputs after they are packed with weights.

The following performance data is to show that new code does not downgrade performance (if not improve):

Avx2 M=1, Asymmetric:

blklen	baseline time (ns)	updated time (ns)
32	35514	21530
64	36188	18863
128	30303	21186
256	32058	17880

Avx2 M=1, Symmetric:

blklen	baseline time (ns)	updated time (ns)
32	25863	19933
64	25610	22487
128	27239	20008
256	24154	21795

Avx2 M=128, Asymmetric:

blklen	baseline time (ns)	updated time (ns)
32	`1903033`	1858414
64	1786323	1819076
128	1884952	1790135
256	1906534	1706993

Avx2 M=128, Symmetric:

blklen	baseline time (ns)	updated time (ns)
32	1777207	1897442
64	1833315	`1805860`
128	1689521	1735043
256	`1685658`	1652083

Avx512vnni M=1, Asymmetric:

blklen	baseline time (ns)	updated time (ns)
32	22733	23498
64	22144	23345
128	19368	17810
256	19318	18823

Avx512vnni M=1, Symmetric:

blklen	baseline time (ns)	updated time (ns)
32	22410	28872
64	24994	23917
128	65785	65160
256	20412	20629

Avx512vnni M=128, Asymmetric:

blklen	baseline time (ns)	updated time (ns)
32	1616597	1355684
64	1453165	1464413
128	1116153	1093754
256	959254	989052

Avx512vnni M=128, Symmetric:

blklen	baseline time (ns)	updated time (ns)
32	1603280	1387044
64	1421595	`1459699`
128	1110027	1061157
256	933319	965465

Avx512 M=1, Asymmetric:

blklen	baseline time (ns)	updated time (ns)
32	23598	24242
64	22564	22820
128	21043	26688
256	22333	21199

Avx512 M=1, Symmetric:

blklen	baseline time (ns)	updated time (ns)
32	23520	25145
64	52621	23752
128	30848	21809
256	20594	21390

Avx512 M=128, Asymmetric:

blklen	baseline time (ns)	updated time (ns)
32	1653963	1598588
64	1635840	1579680
128	1633040	1595919
256	1461328	1464798

Avx512 M=128, Symmetric:

blklen	baseline time (ns)	updated time (ns)
32	1755299	1633517
64	1608648	1569993
128	1648288	1688076
256	1454290	1482201

Avx2vnni M=1, Asymmetric:

blklen	baseline time (ns)	updated time (ns)
32	21642	12166
64	19835	10350
128	20185	9565
256	19356	10586

Avx2vnni M=1, Symmetric:

blklen	baseline time (ns)	updated time (ns)
32	15515	12744
64	15347	10068
128	16598	9409
256	17510	9833

Avx2vnni M=128, Asymmetric:

blklen	baseline time (ns)	updated time (ns)
32	1040664	1105827
64	832389	859634
128	815307	819965
256	809460	823504

Avx2vnni M=128, Symmetric:

blklen	baseline time (ns)	updated time (ns)
32	1039106	1066090
64	874908	860423
128	815173	818668
256	819842	809170

Signed-off-by: liqunfu <liqun.fu@microsoft.com>

github-actions

You can commit the suggested changes from lintrunner.

onnxruntime/test/contrib_ops/matmul_4bits_test.cc

Signed-off-by: liqunfu <liqun.fu@microsoft.com>

onnxruntime/core/mlas/lib/sqnbitgemm_kernel_avx512.cpp

onnxruntime/test/contrib_ops/matmul_4bits_test.cc

+}
+
+TEST(MatMulNBits, LongTestFloat32) {
+  // onnxruntime::profiling::Profiler::Profiler::Instance().StartProfiling<char>("profile.json");


Signed-off-by: liqunfu <liqun.fu@microsoft.com>

…hus not to implement avx512 Signed-off-by: liqunfu <liqun.fu@microsoft.com>

… to be in a separate loop. defer this work later Signed-off-by: liqunfu <liqun.fu@microsoft.com>

Signed-off-by: liqunfu <liqun.fu@microsoft.com>

onnxruntime/core/mlas/lib/sqnbitgemm_kernel_avx512_int8_blklen32.h

onnxruntime/core/mlas/lib/sqnbitgemm_kernel_avx512_int8_blklen64.h