[ cpu_backend ] Enable practical use of qsi4cxp_qs4cxs1s0 GEMM with openMP multithreading and automatic ukernel candidate selection #3519

skykongkong8 · 2025-10-02T07:51:47Z

Dependency of the PR

None.

Summary

This Pull Request introduces even more optimized version of qsi4cxp_qs4cxs1s0 GEMM first propsed from #3497
Optimization technique s.t.:

neon SIMD
openMP based multithreading
automatic ukernel selection

As a result, this patch accelerated approximately x4 times faster GEMM computation latency:

// Tested on Galaxy S25U
[ RUN      ] nntrainer_cpu_backend_standalone.qai8dxp_qsi4cxp_512x768x2048
BEFORE : 8639844 ns 8639 us 8 ms
AFTER : 2524531 ns 2524 us 2 ms

In my inspection, this is more than 2~3 times faster than previous Q4_0
FYI) gemm_q4_0: 6986094 ns 6986 us 6 ms

For practical using sample, refer to unittest_nntrainer_cpu_backend_fp16.cpp

**Self evaluation:**
1. Build test:     [X]Passed [ ]Failed [ ]Skipped
2. Run test:     [X]Passed [ ]Failed [ ]Skipped

Signed-off-by: skykongkong8 <[email protected]>

- armv8.2-a+fp16+dotprod -> armv8.2-a+fp16+dotprod+i8mm - Adding i8mm enables including high-performancing SIMD intrinsics **Self evaluation:** 1. Build test: [X]Passed [ ]Failed [ ]Skipped 2. Run test: [X]Passed [ ]Failed [ ]Skipped Signed-off-by: skykongkong8 <[email protected]>

**Self evaluation:** 1. Build test: [X]Passed [ ]Failed [ ]Skipped 2. Run test: [X]Passed [ ]Failed [ ]Skipped Signed-off-by: skykongkong8 <[email protected]>

- note : for cpu backend interface, need to add fallback function for this... **Self evaluation:** 1. Build test: [X]Passed [ ]Failed [ ]Skipped 2. Run test: [X]Passed [ ]Failed [ ]Skipped Signed-off-by: skykongkong8 <[email protected]>

- Assume the weight is offline-packed in qs4cxs1s0 manner, with its optimal ukernel idx - unittest TC says: [INFO] sgemm : 387812 ns 387 us 0 ms [INFO] test_gemm_qai8dxp_qsi4cxp_packed: 16667 ns 16 us 0 ms [INFO] MSE: 0.554387, COS_SIM: 0.998757, MAX_DIFFER: 3.13451, SUM: 267.005, SUM_GT: 300.489 **Self evaluation:** 1. Build test: [X]Passed [ ]Failed [ ]Skipped 2. Run test: [X]Passed [ ]Failed [ ]Skipped Signed-off-by: skykongkong8 <[email protected]>

…s1s0 format - todo: automatically returns optimal kernel variant idx and feed it to packed-TC **Self evaluation:** 1. Build test: [X]Passed [ ]Failed [ ]Skipped 2. Run test: [X]Passed [ ]Failed [ ]Skipped Signed-off-by: skykongkong8 <[email protected]>

**Self evaluation:** 1. Build test: [X]Passed [ ]Failed [ ]Skipped 2. Run test: [X]Passed [ ]Failed [ ]Skipped Signed-off-by: skykongkong8 <[email protected]>

- Multithread with openMP, w.r.t. N-direction, coarse grained // Tested on Galaxy S23 - BEFORE test_gemm_qai8dxp_qsi4cxp_packed: 6934427 ns 6934 us 6 ms - AFTER test_gemm_qai8dxp_qsi4cxp_packed: 4398489 ns 4398 us 4 ms **Self evaluation:** 1. Build test: [X]Passed [ ]Failed [ ]Skipped 2. Run test: [X]Passed [ ]Failed [ ]Skipped Signed-off-by: skykongkong8 <[email protected]>

- optimal kernel idx is not always consistent among the run. - In heuristic pov, I could get optimal kernel idx with multiple run, and chose the most frequently occuring one. (17 / 20) **Self evaluation:** 1. Build test: [X]Passed [ ]Failed [ ]Skipped 2. Run test: [X]Passed [ ]Failed [ ]Skipped Signed-off-by: skykongkong8 <[email protected]>

- todo: support non-4-divisible-N case [ RUN ] nntrainer_cpu_backend_standalone.qai8dxp_qsi4cxp_512x768x2048 BEFORE : 8639844 ns 8639 us 8 ms AFTER : 2524531 ns 2524 us 2 ms FYI) gemm_q4_0: 6986094 ns 6986 us 6 ms **Self evaluation:** 1. Build test: [X]Passed [ ]Failed [ ]Skipped 2. Run test: [X]Passed [ ]Failed [ ]Skipped Signed-off-by: skykongkong8 <[email protected]>

- nntr_gemm_qai8dxp_qsi4cxp_packed - nntr_qsi4cxp_qs4cxs1s0_rhs_pack - nntr_get_rhs_packed_size_qsi4cxp_qs4cxs1s0 **Self evaluation:** 1. Build test: [X]Passed [ ]Failed [ ]Skipped 2. Run test: [X]Passed [ ]Failed [ ]Skipped Signed-off-by: skykongkong8 <[email protected]>

- kai_matmul_clamp_f32_qai8dxp4x8_qsi4cxp4x8_4x4x32_neon_i8mm.h - kai_matmul_clamp_f32_qai8dxp4x8_qsi4cxp4x8_8x4x32_neon_i8mm.h - kai_matmul_clamp_f32_qai8dxp4x8_qsi4cxp8x8_4x8x32_neon_i8mm.h - kai_matmul_clamp_f32_qai8dxp4x8_qsi4cxp8x8_8x8x32_neon_i8mm.h **Self evaluation:** 1. Build test: [X]Passed [ ]Failed [ ]Skipped 2. Run test: [X]Passed [ ]Failed [ ]Skipped Signed-off-by: skykongkong8 <[email protected]>

- kleidiai-based functions basically support fallback implementations, but some functions for specific purposes are only used in ARM. - Still for easier maintenance of cpu_backend, those function headers should be declared in the other sides as well. (Needs some other opinions though) - trivial doxygentags **Self evaluation:** 1. Build test: [X]Passed [ ]Failed [ ]Skipped 2. Run test: [X]Passed [ ]Failed [ ]Skipped Signed-off-by: skykongkong8 <[email protected]>

Note that this operations expects: 1. Weight is transposed, 2. Weight is quantized in channel-wise scheme, 3. Weight is packed in (GEMV) 1 or (GEMM) 5 idx number, 4. Activation is FP32 (for it is implemented on float_tensor.cpp) **Self evaluation:** 1. Build test: [X]Passed [ ]Failed [ ]Skipped 2. Run test: [X]Passed [ ]Failed [ ]Skipped Signed-off-by: skykongkong8 <[email protected]>

github-actions · 2025-10-25T02:42:35Z

This PR is stale because it has been open 14 days with no activity. Remove stale label or comment or this will be closed in 3 days.

skykongkong8 added 13 commits October 2, 2025 15:47

[ arm ] Introduce i8mm-based f32_qai8dxp4x8_qsi4cxp4x8 matmul kernels

41e5e78

**Self evaluation:** 1. Build test: [X]Passed [ ]Failed [ ]Skipped 2. Run test: [X]Passed [ ]Failed [ ]Skipped Signed-off-by: skykongkong8 <[email protected]>

[ arm ] Add offline-weight-packed GEMM for qai8dxp_qsi4cxp

963fcb4

**Self evaluation:** 1. Build test: [X]Passed [ ]Failed [ ]Skipped 2. Run test: [X]Passed [ ]Failed [ ]Skipped Signed-off-by: skykongkong8 <[email protected]>

[ arm ] Add packing param getter

97e92b4

- note : for cpu backend interface, need to add fallback function for this... **Self evaluation:** 1. Build test: [X]Passed [ ]Failed [ ]Skipped 2. Run test: [X]Passed [ ]Failed [ ]Skipped Signed-off-by: skykongkong8 <[email protected]>

[ unittest ] Support noTrans GEMM TC for kai

636d83a

**Self evaluation:** 1. Build test: [X]Passed [ ]Failed [ ]Skipped 2. Run test: [X]Passed [ ]Failed [ ]Skipped Signed-off-by: skykongkong8 <[email protected]>

[ cpu_backend ] Automatic idx computing for kernel variant idx

ae72431

**Self evaluation:** 1. Build test: [X]Passed [ ]Failed [ ]Skipped 2. Run test: [X]Passed [ ]Failed [ ]Skipped Signed-off-by: skykongkong8 <[email protected]>

skykongkong8 requested review from DonghakPark, EunjuYang, SeoHyungjun, again4you, anyj0527, baek2sm, djeong20, dkjung, gichan-jang, jaeyun-jung, jihochu, jijoongmoon, leemgs, lhs8928, myungjoo, songgot and wooksong as code owners October 2, 2025 07:51

skykongkong8 requested a review from haehun as a code owner October 2, 2025 07:51

github-actions bot added the Need Review label Oct 2, 2025

skykongkong8 mentioned this pull request Oct 2, 2025

Supporting As8Wu4 computation flow #3489

Open

11 tasks

skykongkong8 added 2 commits October 10, 2025 13:27

github-actions bot added the Stale label Oct 25, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[ cpu_backend ] Enable practical use of qsi4cxp_qs4cxs1s0 GEMM with openMP multithreading and automatic ukernel candidate selection #3519

[ cpu_backend ] Enable practical use of qsi4cxp_qs4cxs1s0 GEMM with openMP multithreading and automatic ukernel candidate selection #3519

skykongkong8 commented Oct 2, 2025

Uh oh!

github-actions bot commented Oct 25, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

[ cpu_backend ] Enable practical use of qsi4cxp_qs4cxs1s0 GEMM with openMP multithreading and automatic ukernel candidate selection #3519

Are you sure you want to change the base?

[ cpu_backend ] Enable practical use of qsi4cxp_qs4cxs1s0 GEMM with openMP multithreading and automatic ukernel candidate selection #3519

Conversation

skykongkong8 commented Oct 2, 2025

Dependency of the PR

Summary

Uh oh!

github-actions bot commented Oct 25, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant