-
Notifications
You must be signed in to change notification settings - Fork 2.9k
[ Aarch64 ] Paged Attention FP16 precision enablement. #29219
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
@dmitry-gorokhov, kindly support to review the changes |
|
@ashwins990 @abhijain1204fujitsu Failed CI job is not connected with PR changes. |
4f28ab0 to
03b4c51
Compare
|
Hi, @dmitry-gorokhov. |
src/plugins/intel_cpu/src/nodes/kernels/scaled_attn/executor_pa.cpp
Outdated
Show resolved
Hide resolved
|
Are there any plans to extend specific tests? |
03b4c51 to
4885881
Compare
Please find the Line coverage [ with lcov tool ] for f16 inferenece. Please let me know any further tests required. Output generated is good (manually tested). @maxnick Is this fine?. Can we extend tests [ integrated within OpenVINO ] in next PR maybe, if required?? |
|
build_jenkins |
src/plugins/intel_cpu/src/nodes/kernels/scaled_attn/executor_pa.cpp
Outdated
Show resolved
Hide resolved
src/plugins/intel_cpu/src/nodes/kernels/scaled_attn/executor_pa.cpp
Outdated
Show resolved
Hide resolved
src/plugins/intel_cpu/src/nodes/kernels/scaled_attn/executor_pa.cpp
Outdated
Show resolved
Hide resolved
Thank you for the code coverage report. It looks like the newly added code is well covered with tests. |
@allnes, Yes, it works as expected. Thanks ! |
|
Hi @allnes, |
...common/transformations/src/transformations/common_optimizations/convert_pagedattn_inputs.cpp
Outdated
Show resolved
Hide resolved
...common/transformations/src/transformations/common_optimizations/convert_pagedattn_inputs.cpp
Outdated
Show resolved
Hide resolved
c2d65fe to
78d52cb
Compare
|
build_jenkins |
78d52cb to
0a1bc22
Compare
|
@maxnick I have resolved the merge conflicts. Thanks ! |
|
build_jenkins |
This development is related to Feature Request : #26422 This PR enables f16 inference precision for Paged Attention operator and key-value cache precision as u8. Updated :: Using Kleidi instead of ACL. Attaching the server bechmarking result on Graviton 3E - 64 cores. This shows the comparison of f16 performance with f32 [reference] precision.  --------- Co-authored-by: Nesterov Alexander <[email protected]>
This development is related to Feature Request : #26422
This PR enables f16 inference precision for Paged Attention operator and key-value cache precision as u8.
Updated :: Using Kleidi instead of ACL. Attaching the server bechmarking result on Graviton 3E - 64 cores. This shows the comparison of f16 performance with f32 [reference] precision.