-
Notifications
You must be signed in to change notification settings - Fork 94
[ cpu_backend ] Enable practical use of qsi4cxp_qs4cxs1s0 GEMM with openMP multithreading and automatic ukernel candidate selection #3519
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
skykongkong8
wants to merge
15
commits into
nnstreamer:main
Choose a base branch
from
skykongkong8:poc/arm/kai/i8mmkernel+weightofflinepackingdetached
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
+4,075
−392
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
- armv8.2-a+fp16+dotprod -> armv8.2-a+fp16+dotprod+i8mm - Adding i8mm enables including high-performancing SIMD intrinsics **Self evaluation:** 1. Build test: [X]Passed [ ]Failed [ ]Skipped 2. Run test: [X]Passed [ ]Failed [ ]Skipped Signed-off-by: skykongkong8 <[email protected]>
**Self evaluation:** 1. Build test: [X]Passed [ ]Failed [ ]Skipped 2. Run test: [X]Passed [ ]Failed [ ]Skipped Signed-off-by: skykongkong8 <[email protected]>
**Self evaluation:** 1. Build test: [X]Passed [ ]Failed [ ]Skipped 2. Run test: [X]Passed [ ]Failed [ ]Skipped Signed-off-by: skykongkong8 <[email protected]>
- note : for cpu backend interface, need to add fallback function for this... **Self evaluation:** 1. Build test: [X]Passed [ ]Failed [ ]Skipped 2. Run test: [X]Passed [ ]Failed [ ]Skipped Signed-off-by: skykongkong8 <[email protected]>
- Assume the weight is offline-packed in qs4cxs1s0 manner, with its optimal ukernel idx - unittest TC says: [INFO] sgemm : 387812 ns 387 us 0 ms [INFO] test_gemm_qai8dxp_qsi4cxp_packed: 16667 ns 16 us 0 ms [INFO] MSE: 0.554387, COS_SIM: 0.998757, MAX_DIFFER: 3.13451, SUM: 267.005, SUM_GT: 300.489 **Self evaluation:** 1. Build test: [X]Passed [ ]Failed [ ]Skipped 2. Run test: [X]Passed [ ]Failed [ ]Skipped Signed-off-by: skykongkong8 <[email protected]>
…s1s0 format - todo: automatically returns optimal kernel variant idx and feed it to packed-TC **Self evaluation:** 1. Build test: [X]Passed [ ]Failed [ ]Skipped 2. Run test: [X]Passed [ ]Failed [ ]Skipped Signed-off-by: skykongkong8 <[email protected]>
**Self evaluation:** 1. Build test: [X]Passed [ ]Failed [ ]Skipped 2. Run test: [X]Passed [ ]Failed [ ]Skipped Signed-off-by: skykongkong8 <[email protected]>
**Self evaluation:** 1. Build test: [X]Passed [ ]Failed [ ]Skipped 2. Run test: [X]Passed [ ]Failed [ ]Skipped Signed-off-by: skykongkong8 <[email protected]>
- Multithread with openMP, w.r.t. N-direction, coarse grained // Tested on Galaxy S23 - BEFORE test_gemm_qai8dxp_qsi4cxp_packed: 6934427 ns 6934 us 6 ms - AFTER test_gemm_qai8dxp_qsi4cxp_packed: 4398489 ns 4398 us 4 ms **Self evaluation:** 1. Build test: [X]Passed [ ]Failed [ ]Skipped 2. Run test: [X]Passed [ ]Failed [ ]Skipped Signed-off-by: skykongkong8 <[email protected]>
- optimal kernel idx is not always consistent among the run. - In heuristic pov, I could get optimal kernel idx with multiple run, and chose the most frequently occuring one. (17 / 20) **Self evaluation:** 1. Build test: [X]Passed [ ]Failed [ ]Skipped 2. Run test: [X]Passed [ ]Failed [ ]Skipped Signed-off-by: skykongkong8 <[email protected]>
- todo: support non-4-divisible-N case [ RUN ] nntrainer_cpu_backend_standalone.qai8dxp_qsi4cxp_512x768x2048 BEFORE : 8639844 ns 8639 us 8 ms AFTER : 2524531 ns 2524 us 2 ms FYI) gemm_q4_0: 6986094 ns 6986 us 6 ms **Self evaluation:** 1. Build test: [X]Passed [ ]Failed [ ]Skipped 2. Run test: [X]Passed [ ]Failed [ ]Skipped Signed-off-by: skykongkong8 <[email protected]>
- nntr_gemm_qai8dxp_qsi4cxp_packed - nntr_qsi4cxp_qs4cxs1s0_rhs_pack - nntr_get_rhs_packed_size_qsi4cxp_qs4cxs1s0 **Self evaluation:** 1. Build test: [X]Passed [ ]Failed [ ]Skipped 2. Run test: [X]Passed [ ]Failed [ ]Skipped Signed-off-by: skykongkong8 <[email protected]>
- kai_matmul_clamp_f32_qai8dxp4x8_qsi4cxp4x8_4x4x32_neon_i8mm.h - kai_matmul_clamp_f32_qai8dxp4x8_qsi4cxp4x8_8x4x32_neon_i8mm.h - kai_matmul_clamp_f32_qai8dxp4x8_qsi4cxp8x8_4x8x32_neon_i8mm.h - kai_matmul_clamp_f32_qai8dxp4x8_qsi4cxp8x8_8x8x32_neon_i8mm.h **Self evaluation:** 1. Build test: [X]Passed [ ]Failed [ ]Skipped 2. Run test: [X]Passed [ ]Failed [ ]Skipped Signed-off-by: skykongkong8 <[email protected]>
11 tasks
- kleidiai-based functions basically support fallback implementations, but some functions for specific purposes are only used in ARM. - Still for easier maintenance of cpu_backend, those function headers should be declared in the other sides as well. (Needs some other opinions though) - trivial doxygentags **Self evaluation:** 1. Build test: [X]Passed [ ]Failed [ ]Skipped 2. Run test: [X]Passed [ ]Failed [ ]Skipped Signed-off-by: skykongkong8 <[email protected]>
Note that this operations expects: 1. Weight is transposed, 2. Weight is quantized in channel-wise scheme, 3. Weight is packed in (GEMV) 1 or (GEMM) 5 idx number, 4. Activation is FP32 (for it is implemented on float_tensor.cpp) **Self evaluation:** 1. Build test: [X]Passed [ ]Failed [ ]Skipped 2. Run test: [X]Passed [ ]Failed [ ]Skipped Signed-off-by: skykongkong8 <[email protected]>
|
This PR is stale because it has been open 14 days with no activity. Remove stale label or comment or this will be closed in 3 days. |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Dependency of the PR
None.
Summary
This Pull Request introduces even more optimized version of qsi4cxp_qs4cxs1s0 GEMM first propsed from #3497
Optimization technique s.t.:
As a result, this patch accelerated approximately x4 times faster GEMM computation latency:
In my inspection, this is more than 2~3 times faster than previous
Q4_0FYI)
gemm_q4_0: 6986094 ns 6986 us 6 msFor practical using sample, refer to
unittest_nntrainer_cpu_backend_fp16.cpp