Skip to content

Matmul nbits to optimize memory layout for avx instructions#22203

Closed
liqunfu wants to merge 29 commits intomainfrom
liqun/avx-layout
Closed

Matmul nbits to optimize memory layout for avx instructions#22203
liqunfu wants to merge 29 commits intomainfrom
liqun/avx-layout

Conversation

@liqunfu
Copy link
Contributor

@liqunfu liqunfu commented Sep 24, 2024

The main purpose of this PR is to remove sqnbit's dependency on sgemm in x86/x64 cases. The benefit is a cleaner memory layout not requiring memory alignment, no need for the Rows by 16-bytes memory layout required by sgemm. It also offers slight performance improvement.
A second improvement in the PR is to reduce memory footprint by fully packing zero point and scales. There is no need for these inputs after they are packed with weights.

The following performance data is to show that new code does not downgrade performance (if not improve):

Avx2 M=1, Asymmetric:

blklen baseline time (ns) updated time (ns)
32 35514 21530
64 36188 18863
128 30303 21186
256 32058 17880

Avx2 M=1, Symmetric:

blklen baseline time (ns) updated time (ns)
32 25863 19933
64 25610 22487
128 27239 20008
256 24154 21795

Avx2 M=128, Asymmetric:

blklen baseline time (ns) updated time (ns)
32 1903033 1858414
64 1786323 1819076
128 1884952 1790135
256 1906534 1706993

Avx2 M=128, Symmetric:

blklen baseline time (ns) updated time (ns)
32 1777207 1897442
64 1833315 1805860
128 1689521 1735043
256 1685658 1652083

Avx512vnni M=1, Asymmetric:

blklen baseline time (ns) updated time (ns)
32 22733 23498
64 22144 23345
128 19368 17810
256 19318 18823

Avx512vnni M=1, Symmetric:

blklen baseline time (ns) updated time (ns)
32 22410 28872
64 24994 23917
128 65785 65160
256 20412 20629

Avx512vnni M=128, Asymmetric:

blklen baseline time (ns) updated time (ns)
32 1616597 1355684
64 1453165 1464413
128 1116153 1093754
256 959254 989052

Avx512vnni M=128, Symmetric:

blklen baseline time (ns) updated time (ns)
32 1603280 1387044
64 1421595 1459699
128 1110027 1061157
256 933319 965465

Avx512 M=1, Asymmetric:

blklen baseline time (ns) updated time (ns)
32 23598 24242
64 22564 22820
128 21043 26688
256 22333 21199

Avx512 M=1, Symmetric:

blklen baseline time (ns) updated time (ns)
32 23520 25145
64 52621 23752
128 30848 21809
256 20594 21390

Avx512 M=128, Asymmetric:

blklen baseline time (ns) updated time (ns)
32 1653963 1598588
64 1635840 1579680
128 1633040 1595919
256 1461328 1464798

Avx512 M=128, Symmetric:

blklen baseline time (ns) updated time (ns)
32 1755299 1633517
64 1608648 1569993
128 1648288 1688076
256 1454290 1482201

Avx2vnni M=1, Asymmetric:

blklen baseline time (ns) updated time (ns)
32 21642 12166
64 19835 10350
128 20185 9565
256 19356 10586

Avx2vnni M=1, Symmetric:

blklen baseline time (ns) updated time (ns)
32 15515 12744
64 15347 10068
128 16598 9409
256 17510 9833

Avx2vnni M=128, Asymmetric:

blklen baseline time (ns) updated time (ns)
32 1040664 1105827
64 832389 859634
128 815307 819965
256 809460 823504

Avx2vnni M=128, Symmetric:

blklen baseline time (ns) updated time (ns)
32 1039106 1066090
64 874908 860423
128 815173 818668
256 819842 809170

Signed-off-by: liqunfu <liqun.fu@microsoft.com>
@liqunfu liqunfu requested a review from a team as a code owner September 24, 2024 15:50
@liqunfu liqunfu marked this pull request as draft September 24, 2024 15:50
Signed-off-by: liqunfu <liqun.fu@microsoft.com>
Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can commit the suggested changes from lintrunner.

}

TEST(MatMulNBits, LongTestFloat32) {
// onnxruntime::profiling::Profiler::Profiler::Instance().StartProfiling<char>("profile.json");

Check notice

Code scanning / CodeQL

Commented-out code

This comment appears to contain commented-out code.
Signed-off-by: liqunfu <liqun.fu@microsoft.com>
…hus not to implement avx512

Signed-off-by: liqunfu <liqun.fu@microsoft.com>
… to be in a separate loop. defer this work later

Signed-off-by: liqunfu <liqun.fu@microsoft.com>
Signed-off-by: liqunfu <liqun.fu@microsoft.com>
Signed-off-by: liqunfu <liqun.fu@microsoft.com>
Signed-off-by: liqunfu <liqun.fu@microsoft.com>
liqunfu and others added 8 commits December 13, 2024 10:09
Signed-off-by: Liqun Fu <liqun.fu@microsoft.com>
Signed-off-by: Liqun Fu <liqun_fu@hotmail.com>
Signed-off-by: Liqun Fu <liqun.fu@microsoft.com>
Signed-off-by: Liqun Fu <liqun.fu@microsoft.com>
Signed-off-by: Liqun Fu <liqun.fu@microsoft.com>
Signed-off-by: Liqun Fu <liqun.fu@microsoft.com>
__m512i sum_16_epi32 = _mm512_madd_epi16(one_32_epi16, sum_32_epi16);
__m512 sum_16_ps = _mm512_cvtepi32_ps(sum_16_epi32);
acc = _mm512_fmadd_ps(sum_16_ps, _mm512_set1_ps(combined_scale), acc);
// acc = _mm512_fmadd_ps(sum_16_ps, load_broadcast_512(combined_scale), acc);

Check notice

Code scanning / CodeQL

Commented-out code

This comment appears to contain commented-out code.
Comment on lines +52 to +56
// folowing 2 lines do the same with close perf (more latency count).
// it requires CPUID Flags: AVX512DQ which is more restricted
// const __m256 scale_b_ps = _mm256_castpd_ps(_mm256_broadcast_sd(combined_scale));
// const __m512 scale_b_16_ps = _mm512_broadcast_f32x8(scale_b_ps);
// return;

Check notice

Code scanning / CodeQL

Commented-out code

This comment appears to contain commented-out code.
Signed-off-by: Liqun Fu <liqun.fu@microsoft.com>
Signed-off-by: liqunfu <liqun.fu@microsoft.com>
…will not compile on Cuda CI

Signed-off-by: liqunfu <liqun.fu@microsoft.com>
@liqunfu liqunfu marked this pull request as ready for review January 16, 2025 17:38
liqunfu added 6 commits March 17, 2025 10:29
Signed-off-by: Liqun Fu <liqun.fu@microsoft.com>
…or>(InputIndex::scales);

Signed-off-by: Liqun Fu <liqun.fu@microsoft.com>
Signed-off-by: Liqun Fu <liqun.fu@microsoft.com>
Signed-off-by: Liqun Fu <liqun.fu@microsoft.com>
Signed-off-by: Liqun Fu <liqun.fu@microsoft.com>
@liqunfu liqunfu closed this May 13, 2025
@liqunfu liqunfu deleted the liqun/avx-layout branch May 13, 2025 18:38
@liqunfu liqunfu restored the liqun/avx-layout branch May 13, 2025 18:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant