Skip to content

[Arm] Enable SDPA Fusion with Sink input on ARM CPUs#33566

Merged
alvoron merged 1 commit intoopenvinotoolkit:masterfrom
MonakaResearch:add-support-for-sink-input-GPT-OSS-ARM
Feb 4, 2026
Merged

[Arm] Enable SDPA Fusion with Sink input on ARM CPUs#33566
alvoron merged 1 commit intoopenvinotoolkit:masterfrom
MonakaResearch:add-support-for-sink-input-GPT-OSS-ARM

Conversation

@abhijain1204fujitsu
Copy link
Contributor

@abhijain1204fujitsu abhijain1204fujitsu commented Jan 13, 2026

This PR enables SDPA fusion on ARM, when attention contains additional input Sink
Checked and validated the operation with GPT-OSS 20b Model.

On implementing this code and enhanced performance during the LLM inference has been observed, refer the below table

image

** All values are in Tokens per second - decoding throughput.
Machine : Graviton4 - single socket - 96 cores.

Kindly support to review the PR and share feedback if any.

Thanks!

This work is contributed by @ashwins990 & @abhijain1204fujitsu

@abhijain1204fujitsu abhijain1204fujitsu requested review from a team as code owners January 13, 2026 04:32
@github-actions github-actions bot added the category: CPU OpenVINO CPU plugin label Jan 13, 2026
@sys-openvino-ci sys-openvino-ci added the ExternalPR External contributor label Jan 13, 2026
@maxnick maxnick added the platform: arm OpenVINO on ARM / ARM64 label Jan 13, 2026
@maxnick maxnick requested review from alvoron and Copilot January 13, 2026 08:51
@maxnick
Copy link
Contributor

maxnick commented Jan 13, 2026

@alvoron , could you please review?

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR enables SDPA (Scaled Dot Product Attention) fusion with Sink input support for ARM CPUs, extending functionality previously limited to x86_64 platforms. The changes remove platform-specific restrictions and implement the necessary ARM-specific kernel modifications to handle sink inputs during attention computation.

Changes:

  • Removed x86_64-only restriction for sink input support in SDPA fusion
  • Integrated sink input processing into ARM's ACL (Arm Compute Library) attention kernel
  • Optimized SVE (Scalable Vector Extension) operations by replacing predicated-zeroing with predicated-merging variants

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 4 comments.

File Description
stateful_sdpa_fusion.cpp Removes preprocessor directive that restricted sink input support to x86_64 platform
scaled_attn.cpp Enables sink input parameter usage in ARM ACL kernel and passes it to softmax computation
softmax_kernel.hpp Refactors SVE loop structure, optimizes predicate usage, and adds sink value processing in softmax normalization

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

pg_u8 = svwhilelt_b8(0, static_cast<int>(inc));
pg_u16 = svwhilelt_b16(0, static_cast<int>(inc));
}
for (; i + vec_len_f16_sve() < size; i += vec_len_f16_sve()) {
Copy link

Copilot AI Jan 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The loop condition i + vec_len_f16_sve() < size prevents processing the last vector when exactly aligned. This should be i + vec_len_f16_sve() <= size to match the pattern used in the exp_reduce_sum_f32 function (line with i + svcnth() <= size), ensuring all complete vectors are processed.

Suggested change
for (; i + vec_len_f16_sve() < size; i += vec_len_f16_sve()) {
for (; i + vec_len_f16_sve() <= size; i += vec_len_f16_sve()) {

Copilot uses AI. Check for mistakes.

ov::float16 sum = 0.0f;
if (sink != nullptr) {
max = max > static_cast<const ov::float16>(*sink) ? max : static_cast<const ov::float16>(*sink);
Copy link

Copilot AI Jan 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The ternary operator can be replaced with std::max for better readability: max = std::max(max, static_cast<ov::float16>(*sink));

Copilot uses AI. Check for mistakes.
if (dst_precision == ov::element::f32) {
exp_reduce_sum_f32(a, max, len, sum);
if (sink != nullptr) {
sum += std::exp(*sink - max);
Copy link

Copilot AI Jan 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This computation is duplicated in both branches of the if-else statement. Consider computing sink_contrib = std::exp(*sink - max) once before the if-else block and adding it to sum in both branches to reduce code duplication.

Copilot uses AI. Check for mistakes.
Comment on lines +1406 to +1423
if (sink != nullptr) {
max = max > static_cast<const ov::float16>(*sink) ? max : static_cast<const ov::float16>(*sink);
}
if (dst_precision == ov::element::f32) {
exp_reduce_sum_f32(a, max, len, sum);
if (sink != nullptr) {
sum += std::exp(*sink - max);
}
ov::float16 scalar = 1.0f / sum;
multiply_scalar(a, static_cast<float*>(a_dst), scalar, len);
// apply causual mask to final result instead of attn_score
if (total_size > len)
memset(static_cast<float*>(a_dst) + len, 0, sizeof(float) * (total_size - len));
} else {
exp_reduce_sum_f32(a, max, len, sum);
if (sink != nullptr) {
sum += std::exp(*sink - max);
}
Copy link

Copilot AI Jan 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This computation is duplicated in both branches of the if-else statement. Consider computing sink_contrib = std::exp(*sink - max) once before the if-else block and adding it to sum in both branches to reduce code duplication.

Suggested change
if (sink != nullptr) {
max = max > static_cast<const ov::float16>(*sink) ? max : static_cast<const ov::float16>(*sink);
}
if (dst_precision == ov::element::f32) {
exp_reduce_sum_f32(a, max, len, sum);
if (sink != nullptr) {
sum += std::exp(*sink - max);
}
ov::float16 scalar = 1.0f / sum;
multiply_scalar(a, static_cast<float*>(a_dst), scalar, len);
// apply causual mask to final result instead of attn_score
if (total_size > len)
memset(static_cast<float*>(a_dst) + len, 0, sizeof(float) * (total_size - len));
} else {
exp_reduce_sum_f32(a, max, len, sum);
if (sink != nullptr) {
sum += std::exp(*sink - max);
}
ov::float16 sink_contrib = 0.0f;
if (sink != nullptr) {
max = max > static_cast<const ov::float16>(*sink) ? max : static_cast<const ov::float16>(*sink);
sink_contrib = std::exp(*sink - max);
}
exp_reduce_sum_f32(a, max, len, sum);
if (sink != nullptr) {
sum += sink_contrib;
}
if (dst_precision == ov::element::f32) {
ov::float16 scalar = 1.0f / sum;
multiply_scalar(a, static_cast<float*>(a_dst), scalar, len);
// apply causual mask to final result instead of attn_score
if (total_size > len)
memset(static_cast<float*>(a_dst) + len, 0, sizeof(float) * (total_size - len));
} else {

Copilot uses AI. Check for mistakes.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@maxnick , @alvoron
Changes suggested by copilot has been incorporated
Kindly support to review the PR

@@ -1411,15 +1403,20 @@ inline void attn_softmax_kernel<ov::float16>(ov::float16* a,
}

ov::float16 sum = 0.0f;
if (sink != nullptr) {
max = std::max(max, static_cast<const ov::float16>(*sink));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sink is fp32, so max could be inf after casting to fp16. Is it safe to pass inf as max into exp_reduce_sum_f32?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Its resolved now with the latest rebasing.

@abhijain1204fujitsu abhijain1204fujitsu force-pushed the add-support-for-sink-input-GPT-OSS-ARM branch 2 times, most recently from 8dcdae9 to a1b7ad3 Compare January 27, 2026 03:31
@abhijain1204fujitsu
Copy link
Contributor Author

hi @alvoron ,
rebased the PR to resolve conflicts, kindly complete your review and support for merger

@alvoron
Copy link
Contributor

alvoron commented Jan 27, 2026

build_jenkins

@maxnick maxnick added this to the 2026.1 milestone Jan 27, 2026
@alvoron
Copy link
Contributor

alvoron commented Jan 28, 2026

@abhijain1204fujitsu could you please rebase one more time?
If it's possible to add me as collaborator to your fork, I can do rebase of your branches by myself.

@abhijain1204fujitsu abhijain1204fujitsu force-pushed the add-support-for-sink-input-GPT-OSS-ARM branch from a1b7ad3 to c9934c9 Compare January 30, 2026 06:43
@ashwins990
Copy link
Contributor

@abhijain1204fujitsu could you please rebase one more time? If it's possible to add me as collaborator to your fork, I can do rebase of your branches by myself.

I have rebased it. Please check. Thanks !

@alvoron alvoron added this pull request to the merge queue Feb 4, 2026
Merged via the queue into openvinotoolkit:master with commit f2dbee2 Feb 4, 2026
221 of 223 checks passed
insoow pushed a commit to insoow/openvino that referenced this pull request Feb 9, 2026
…#33566)

This PR enables SDPA fusion on ARM, when attention contains additional
input **Sink**
Checked and validated the operation with GPT-OSS 20b Model.

On implementing this code and enhanced performance during the LLM
inference has been observed, refer the below table

<img width="604" height="120" alt="image"
src="https://github.com/user-attachments/assets/c77b7908-4c8a-4a94-9feb-7f887bb9697b"
/>

** All values are in Tokens per second - decoding throughput.
Machine : Graviton4 - single socket - 96 cores.

Kindly support to review the PR and share feedback if any.

Thanks!

This work is contributed by @ashwins990 & @abhijain1204fujitsu

Co-authored-by: Ashwin <Ashwin.Sekhar@fujitsu.com>
Naseer-010 pushed a commit to Naseer-010/openvino that referenced this pull request Feb 18, 2026
…#33566)

This PR enables SDPA fusion on ARM, when attention contains additional
input **Sink**
Checked and validated the operation with GPT-OSS 20b Model.

On implementing this code and enhanced performance during the LLM
inference has been observed, refer the below table

<img width="604" height="120" alt="image"
src="https://github.com/user-attachments/assets/c77b7908-4c8a-4a94-9feb-7f887bb9697b"
/>

** All values are in Tokens per second - decoding throughput.
Machine : Graviton4 - single socket - 96 cores.

Kindly support to review the PR and share feedback if any.

Thanks!

This work is contributed by @ashwins990 & @abhijain1204fujitsu

Co-authored-by: Ashwin <Ashwin.Sekhar@fujitsu.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

category: CPU OpenVINO CPU plugin ExternalPR External contributor platform: arm OpenVINO on ARM / ARM64

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants