-
Notifications
You must be signed in to change notification settings - Fork 2.9k
Description
Request Description
PagedAttention operation is already implemented in bounds of CPU plugin using C++ and optimized for x64 using avx2/avx512 instrinsics.
The request is to optimize PA operation for aarch64 using NEON/SVE extensions.
Please refer to SDPA optimization using NEON for reference.
How to build OV on ARM: https://github.com/openvinotoolkit/openvino/blob/master/docs/dev/build.md
Feature Use Case
PagedAttention operation implements attention algo required for workloads like continuous batching or speculative decoding. PagedAttention is used as basic attention block in VLLM OpenVINO backend and under OpenVINO GenAI API (for some use-cases). PA operation might take significant resources for execution (especially for long contexts), so its optimization is crucial for overall LLM based workloads.
Issue submission checklist
- The feature request or improvement must be related to OpenVINO
Metadata
Metadata
Assignees
Labels
Type
Projects
Status