Skip to content

[Feature Request]: Optimize PagedAttention operation on aarch64 HW #26422

@dmitry-gorokhov

Description

@dmitry-gorokhov

Request Description

PagedAttention operation is already implemented in bounds of CPU plugin using C++ and optimized for x64 using avx2/avx512 instrinsics.
The request is to optimize PA operation for aarch64 using NEON/SVE extensions.

Please refer to SDPA optimization using NEON for reference.
How to build OV on ARM: https://github.com/openvinotoolkit/openvino/blob/master/docs/dev/build.md

Feature Use Case

PagedAttention operation implements attention algo required for workloads like continuous batching or speculative decoding. PagedAttention is used as basic attention block in VLLM OpenVINO backend and under OpenVINO GenAI API (for some use-cases). PA operation might take significant resources for execution (especially for long contexts), so its optimization is crucial for overall LLM based workloads.

Issue submission checklist

  • The feature request or improvement must be related to OpenVINO

Metadata

Metadata

Type

No type

Projects

Status

Closed

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions