[Arm] Enable SDPA Fusion with Sink input on ARM CPUs#33566

Merged

alvoron merged 1 commit intoopenvinotoolkit:masterfrom

MonakaResearch:add-support-for-sink-input-GPT-OSS-ARM

Feb 4, 2026

Contributor

abhijain1204fujitsu commented Jan 13, 2026 •

edited

Loading

This PR enables SDPA fusion on ARM, when attention contains additional input Sink
Checked and validated the operation with GPT-OSS 20b Model.

On implementing this code and enhanced performance during the LLM inference has been observed, refer the below table

** All values are in Tokens per second - decoding throughput.
Machine : Graviton4 - single socket - 96 cores.

Kindly support to review the PR and share feedback if any.

Thanks!

This work is contributed by @ashwins990 & @abhijain1204fujitsu

abhijain1204fujitsu requested review from a team as code owners

January 13, 2026 04:32

github-actions bot added the category: CPU label

sys-openvino-ci added the ExternalPR label

maxnick added the platform: arm label

maxnick assigned alvoron

maxnick requested review from alvoron and Copilot

January 13, 2026 08:51

Contributor

maxnick commented Jan 13, 2026

@alvoron , could you please review?

Copilot AI reviewed

View reviewed changes

Contributor

Copilot AI left a comment

Pull request overview

This PR enables SDPA (Scaled Dot Product Attention) fusion with Sink input support for ARM CPUs, extending functionality previously limited to x86_64 platforms. The changes remove platform-specific restrictions and implement the necessary ARM-specific kernel modifications to handle sink inputs during attention computation.

Changes:

Removed x86_64-only restriction for sink input support in SDPA fusion
Integrated sink input processing into ARM's ACL (Arm Compute Library) attention kernel
Optimized SVE (Scalable Vector Extension) operations by replacing predicated-zeroing with predicated-merging variants

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 4 comments.

File	Description
stateful_sdpa_fusion.cpp	Removes preprocessor directive that restricted sink input support to x86_64 platform
scaled_attn.cpp	Enables sink input parameter usage in ARM ACL kernel and passes it to softmax computation
softmax_kernel.hpp	Refactors SVE loop structure, optimizes predicate usage, and adds sink value processing in softmax normalization

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

src/plugins/intel_cpu/src/nodes/kernels/scaled_attn/softmax_kernel.hpp Outdated

-                          pg_u8 = svwhilelt_b8(0, static_cast<int>(inc));
-                          pg_u16 = svwhilelt_b16(0, static_cast<int>(inc));
-                      }
+                  for (; i + vec_len_f16_sve() < size; i += vec_len_f16_sve()) {

Copilot AI Jan 13, 2026

The loop condition i + vec_len_f16_sve() < size prevents processing the last vector when exactly aligned. This should be i + vec_len_f16_sve() <= size to match the pattern used in the exp_reduce_sum_f32 function (line with i + svcnth() <= size), ensuring all complete vectors are processed.

Suggested change

      
                for (; i + vec_len_f16_sve() < size; i += vec_len_f16_sve()) {
          
                for (; i + vec_len_f16_sve() <= size; i += vec_len_f16_sve()) {

Copilot uses AI. Check for mistakes.

src/plugins/intel_cpu/src/nodes/kernels/scaled_attn/softmax_kernel.hpp Outdated

                   ov::float16 sum = 0.0f;
+                  if (sink != nullptr) {
+                      max = max > static_cast<const ov::float16>(*sink) ? max : static_cast<const ov::float16>(*sink);

Copilot AI Jan 13, 2026

The ternary operator can be replaced with std::max for better readability: max = std::max(max, static_cast<ov::float16>(*sink));

Copilot uses AI. Check for mistakes.

src/plugins/intel_cpu/src/nodes/kernels/scaled_attn/softmax_kernel.hpp Outdated

                   if (dst_precision == ov::element::f32) {
                       exp_reduce_sum_f32(a, max, len, sum);
+                      if (sink != nullptr) {
+                          sum += std::exp(*sink - max);

Copilot AI Jan 13, 2026

This computation is duplicated in both branches of the if-else statement. Consider computing sink_contrib = std::exp(*sink - max) once before the if-else block and adding it to sum in both branches to reduce code duplication.

Copilot uses AI. Check for mistakes.

src/plugins/intel_cpu/src/nodes/kernels/scaled_attn/softmax_kernel.hpp Outdated

Comment on lines +1406 to +1423

+                  if (sink != nullptr) {
+                      max = max > static_cast<const ov::float16>(*sink) ? max : static_cast<const ov::float16>(*sink);
+                  }
                   if (dst_precision == ov::element::f32) {
                       exp_reduce_sum_f32(a, max, len, sum);
+                      if (sink != nullptr) {
+                          sum += std::exp(*sink - max);
+                      }
                       ov::float16 scalar = 1.0f / sum;
                       multiply_scalar(a, static_cast<float*>(a_dst), scalar, len);
                       // apply causual mask to final result instead of attn_score
                       if (total_size > len)
                           memset(static_cast<float*>(a_dst) + len, 0, sizeof(float) * (total_size - len));
                   } else {
                       exp_reduce_sum_f32(a, max, len, sum);
+                      if (sink != nullptr) {
+                          sum += std::exp(*sink - max);
+                      }

Copilot AI Jan 13, 2026

This computation is duplicated in both branches of the if-else statement. Consider computing sink_contrib = std::exp(*sink - max) once before the if-else block and adding it to sum in both branches to reduce code duplication.

Suggested change

      
                if (sink != nullptr) {
          
                    max = max > static_cast<const ov::float16>(*sink) ? max : static_cast<const ov::float16>(*sink);
          
                }
          
                if (dst_precision == ov::element::f32) {
          
                    exp_reduce_sum_f32(a, max, len, sum);
          
                    if (sink != nullptr) {
          
                        sum += std::exp(*sink - max);
          
                    }
          
                    ov::float16 scalar = 1.0f / sum;
          
                    multiply_scalar(a, static_cast<float*>(a_dst), scalar, len);
          
                    // apply causual mask to final result instead of attn_score
          
                    if (total_size > len)
          
                        memset(static_cast<float*>(a_dst) + len, 0, sizeof(float) * (total_size - len));
          
                } else {
          
                    exp_reduce_sum_f32(a, max, len, sum);
          
                    if (sink != nullptr) {
          
                        sum += std::exp(*sink - max);
          
                    }
          
                ov::float16 sink_contrib = 0.0f;
          
                if (sink != nullptr) {
          
                    max = max > static_cast<const ov::float16>(*sink) ? max : static_cast<const ov::float16>(*sink);
          
                    sink_contrib = std::exp(*sink - max);
          
                }
          
                exp_reduce_sum_f32(a, max, len, sum);
          
                if (sink != nullptr) {
          
                    sum += sink_contrib;
          
                }
          
                if (dst_precision == ov::element::f32) {
          
                    ov::float16 scalar = 1.0f / sum;
          
                    multiply_scalar(a, static_cast<float*>(a_dst), scalar, len);
          
                    // apply causual mask to final result instead of attn_score
          
                    if (total_size > len)
          
                        memset(static_cast<float*>(a_dst) + len, 0, sizeof(float) * (total_size - len));
          
                } else {

Copilot uses AI. Check for mistakes.

Contributor Author

abhijain1204fujitsu Jan 16, 2026

@maxnick , @alvoron
Changes suggested by copilot has been incorporated
Kindly support to review the PR

alvoron reviewed

View reviewed changes

src/plugins/intel_cpu/src/nodes/kernels/scaled_attn/softmax_kernel.hpp

    
            @@ -1411,15 +1403,20 @@ inline void attn_softmax_kernel<ov::float16>(ov::float16* a,
          
                  }

                  ov::float16 sum = 0.0f;

                  if (sink != nullptr) {

                      max = std::max(max, static_cast<const ov::float16>(*sink));

Contributor

alvoron Jan 16, 2026

sink is fp32, so max could be inf after casting to fp16. Is it safe to pass inf as max into exp_reduce_sum_f32?

Contributor

ashwins990 Jan 27, 2026

Its resolved now with the latest rebasing.

abhijain1204fujitsu force-pushed the add-support-for-sink-input-GPT-OSS-ARM branch 2 times, most recently from 8dcdae9 to a1b7ad3 Compare

January 27, 2026 03:31

Contributor Author

abhijain1204fujitsu commented Jan 27, 2026

hi @alvoron ,
rebased the PR to resolve conflicts, kindly complete your review and support for merger

Contributor

alvoron commented Jan 27, 2026

build_jenkins

alvoron approved these changes

View reviewed changes

maxnick added this to the 2026.1 milestone

Contributor

alvoron commented Jan 28, 2026

@abhijain1204fujitsu could you please rebase one more time?
If it's possible to add me as collaborator to your fork, I can do rebase of your branches by myself.


          enable SDPA with Sink on ARM

c9934c9

abhijain1204fujitsu force-pushed the add-support-for-sink-input-GPT-OSS-ARM branch from a1b7ad3 to c9934c9 Compare

January 30, 2026 06:43

Contributor

ashwins990 commented Jan 30, 2026

@abhijain1204fujitsu could you please rebase one more time? If it's possible to add me as collaborator to your fork, I can do rebase of your branches by myself.

I have rebased it. Please check. Thanks !

alvoron added this pull request to the merge queue

Merged via the queue into openvinotoolkit:master with commit f2dbee2

221 of 223 checks passed

insoow pushed a commit to insoow/openvino that referenced this pull request


          [Arm] Enable SDPA Fusion with Sink input on ARM CPUs (openvinotoolkit…

cf62886

…#33566)

This PR enables SDPA fusion on ARM, when attention contains additional
input **Sink**
Checked and validated the operation with GPT-OSS 20b Model.

On implementing this code and enhanced performance during the LLM
inference has been observed, refer the below table

<img width="604" height="120" alt="image"
src="https://github.com/user-attachments/assets/c77b7908-4c8a-4a94-9feb-7f887bb9697b"
/>

** All values are in Tokens per second - decoding throughput.
Machine : Graviton4 - single socket - 96 cores.

Kindly support to review the PR and share feedback if any.

Thanks!

This work is contributed by @ashwins990 & @abhijain1204fujitsu

Co-authored-by: Ashwin <Ashwin.Sekhar@fujitsu.com>

Naseer-010 pushed a commit to Naseer-010/openvino that referenced this pull request


          [Arm] Enable SDPA Fusion with Sink input on ARM CPUs (openvinotoolkit…

38a059e

…#33566)

This PR enables SDPA fusion on ARM, when attention contains additional
input **Sink**
Checked and validated the operation with GPT-OSS 20b Model.

On implementing this code and enhanced performance during the LLM
inference has been observed, refer the below table

<img width="604" height="120" alt="image"
src="https://github.com/user-attachments/assets/c77b7908-4c8a-4a94-9feb-7f887bb9697b"
/>

** All values are in Tokens per second - decoding throughput.
Machine : Graviton4 - single socket - 96 cores.

Kindly support to review the PR and share feedback if any.

Thanks!

This work is contributed by @ashwins990 & @abhijain1204fujitsu

Co-authored-by: Ashwin <Ashwin.Sekhar@fujitsu.com>

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

category: CPU ExternalPR platform: arm