Skip to content

Conversation

@skykongkong8
Copy link
Member

Dependency of the PR

None.

Summary

This Pull Request introduces even more optimized version of qsi4cxp_qs4cxs1s0 GEMM first propsed from #3497
Optimization technique s.t.:

  • neon SIMD
  • openMP based multithreading
  • automatic ukernel selection

As a result, this patch accelerated approximately x4 times faster GEMM computation latency:

// Tested on Galaxy S25U
[ RUN      ] nntrainer_cpu_backend_standalone.qai8dxp_qsi4cxp_512x768x2048
BEFORE : 8639844 ns 8639 us 8 ms
AFTER : 2524531 ns 2524 us 2 ms

In my inspection, this is more than 2~3 times faster than previous Q4_0
FYI) gemm_q4_0: 6986094 ns 6986 us 6 ms

For practical using sample, refer to unittest_nntrainer_cpu_backend_fp16.cpp

**Self evaluation:**
1. Build test:     [X]Passed [ ]Failed [ ]Skipped
2. Run test:     [X]Passed [ ]Failed [ ]Skipped

Signed-off-by: skykongkong8 <[email protected]>

- armv8.2-a+fp16+dotprod -> armv8.2-a+fp16+dotprod+i8mm
- Adding i8mm enables including high-performancing SIMD intrinsics

**Self evaluation:**
1. Build test:     [X]Passed [ ]Failed [ ]Skipped
2. Run test:     [X]Passed [ ]Failed [ ]Skipped

Signed-off-by: skykongkong8 <[email protected]>
**Self evaluation:**
1. Build test:     [X]Passed [ ]Failed [ ]Skipped
2. Run test:     [X]Passed [ ]Failed [ ]Skipped

Signed-off-by: skykongkong8 <[email protected]>
**Self evaluation:**
1. Build test:     [X]Passed [ ]Failed [ ]Skipped
2. Run test:     [X]Passed [ ]Failed [ ]Skipped

Signed-off-by: skykongkong8 <[email protected]>
- note : for cpu backend interface, need to add fallback function for this...

**Self evaluation:**
1. Build test:     [X]Passed [ ]Failed [ ]Skipped
2. Run test:     [X]Passed [ ]Failed [ ]Skipped

Signed-off-by: skykongkong8 <[email protected]>
- Assume the weight is offline-packed in qs4cxs1s0 manner, with its optimal ukernel idx
- unittest TC says:
[INFO] sgemm :    387812 ns 387 us 0 ms
[INFO] test_gemm_qai8dxp_qsi4cxp_packed: 16667 ns 16 us 0 ms
[INFO] MSE: 0.554387, COS_SIM: 0.998757, MAX_DIFFER: 3.13451, SUM: 267.005, SUM_GT: 300.489

**Self evaluation:**
1. Build test:     [X]Passed [ ]Failed [ ]Skipped
2. Run test:     [X]Passed [ ]Failed [ ]Skipped

Signed-off-by: skykongkong8 <[email protected]>
…s1s0 format

- todo: automatically returns optimal kernel variant idx and feed it to packed-TC

**Self evaluation:**
1. Build test:     [X]Passed [ ]Failed [ ]Skipped
2. Run test:     [X]Passed [ ]Failed [ ]Skipped

Signed-off-by: skykongkong8 <[email protected]>
**Self evaluation:**
1. Build test:     [X]Passed [ ]Failed [ ]Skipped
2. Run test:     [X]Passed [ ]Failed [ ]Skipped

Signed-off-by: skykongkong8 <[email protected]>
**Self evaluation:**
1. Build test:     [X]Passed [ ]Failed [ ]Skipped
2. Run test:     [X]Passed [ ]Failed [ ]Skipped

Signed-off-by: skykongkong8 <[email protected]>
- Multithread with openMP, w.r.t. N-direction, coarse grained
// Tested on Galaxy S23
- BEFORE
test_gemm_qai8dxp_qsi4cxp_packed: 6934427 ns 6934 us 6 ms
- AFTER
test_gemm_qai8dxp_qsi4cxp_packed: 4398489 ns 4398 us 4 ms

**Self evaluation:**
1. Build test:     [X]Passed [ ]Failed [ ]Skipped
2. Run test:     [X]Passed [ ]Failed [ ]Skipped

Signed-off-by: skykongkong8 <[email protected]>
- optimal kernel idx is not always consistent among the run.
- In heuristic pov, I could get optimal kernel idx with multiple run, and chose the most frequently occuring one. (17 / 20)

**Self evaluation:**
1. Build test:     [X]Passed [ ]Failed [ ]Skipped
2. Run test:     [X]Passed [ ]Failed [ ]Skipped

Signed-off-by: skykongkong8 <[email protected]>
- todo: support non-4-divisible-N case
[ RUN      ] nntrainer_cpu_backend_standalone.qai8dxp_qsi4cxp_512x768x2048
BEFORE : 8639844 ns 8639 us 8 ms
AFTER : 2524531 ns 2524 us 2 ms

FYI) gemm_q4_0: 6986094 ns 6986 us 6 ms

**Self evaluation:**
1. Build test:     [X]Passed [ ]Failed [ ]Skipped
2. Run test:     [X]Passed [ ]Failed [ ]Skipped

Signed-off-by: skykongkong8 <[email protected]>
- nntr_gemm_qai8dxp_qsi4cxp_packed
- nntr_qsi4cxp_qs4cxs1s0_rhs_pack
- nntr_get_rhs_packed_size_qsi4cxp_qs4cxs1s0

**Self evaluation:**
1. Build test:     [X]Passed [ ]Failed [ ]Skipped
2. Run test:     [X]Passed [ ]Failed [ ]Skipped

Signed-off-by: skykongkong8 <[email protected]>
- kai_matmul_clamp_f32_qai8dxp4x8_qsi4cxp4x8_4x4x32_neon_i8mm.h
- kai_matmul_clamp_f32_qai8dxp4x8_qsi4cxp4x8_8x4x32_neon_i8mm.h
- kai_matmul_clamp_f32_qai8dxp4x8_qsi4cxp8x8_4x8x32_neon_i8mm.h
- kai_matmul_clamp_f32_qai8dxp4x8_qsi4cxp8x8_8x8x32_neon_i8mm.h

**Self evaluation:**
1. Build test:     [X]Passed [ ]Failed [ ]Skipped
2. Run test:     [X]Passed [ ]Failed [ ]Skipped

Signed-off-by: skykongkong8 <[email protected]>
- kleidiai-based functions basically support fallback implementations, but some functions for specific purposes are only used in ARM.
- Still for easier maintenance of cpu_backend, those function headers should be declared in the other sides as well. (Needs some other opinions though)
- trivial doxygentags

**Self evaluation:**
1. Build test:     [X]Passed [ ]Failed [ ]Skipped
2. Run test:     [X]Passed [ ]Failed [ ]Skipped

Signed-off-by: skykongkong8 <[email protected]>
Note that this operations expects:
1. Weight is transposed,
2. Weight is quantized in channel-wise scheme,
3. Weight is packed in (GEMV) 1 or (GEMM) 5 idx number,
4. Activation is FP32 (for it is implemented on float_tensor.cpp)

**Self evaluation:**
1. Build test:     [X]Passed [ ]Failed [ ]Skipped
2. Run test:     [X]Passed [ ]Failed [ ]Skipped

Signed-off-by: skykongkong8 <[email protected]>
@github-actions
Copy link

This PR is stale because it has been open 14 days with no activity. Remove stale label or comment or this will be closed in 3 days.

@github-actions github-actions bot added the Stale label Oct 25, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant