Persistent SDPA kernel #608

wuxun-zhang · 2025-11-04T06:36:45Z

The new kernel implements below method, key points are:

num of work groups are fixed to num of total XeCores
dynamically split KV seq length from all seqs into all work groups
each XeCore gets balanced work units

As of now there are two limitations:

only decode support (seq_len_qo==1)
batch_size * num_heads_q <= num of total XeCores

pengzhao-intel · 2025-11-04T23:05:39Z

maybe add the limitation of this algorithm in the code as well, especially for one with atomic.

wuxun-zhang · 2025-11-05T01:16:48Z

applications/flash_attention_v2/kernel/xe_fhma_fwd_kernel.hpp

+    if (args.kernel.shape.seq_len_qo > 1) {
+      return false;
+    }
+    // current kernel only support num batch heads less than total XeCore count
+    if (args.kernel.shape.batch * args.kernel.shape.num_heads_q > args.hw_info.sm_count) {
+      return false;
+    }


@pengzhao-intel Added checks here in can_implement().

Copilot

Pull Request Overview

This PR introduces a persistent SDPA (Scaled Dot Product Attention) kernel for decode scenarios that implements dynamic load balancing across XeCores. The key innovation is fixing the number of work groups to match total XeCores and dynamically splitting KV sequence length across all work groups for balanced workload distribution.

Key changes:

New persistent tile scheduler (XeFHMAIndividualPersistentTileScheduler) that distributes work evenly across fixed XeCore count
New kernel implementation (XeFMHAFwdDynamicSplitKernel) with split-K reduction for partial results
Support infrastructure including atomic operations (atomicSub, atomicLoad) for synchronization

Reviewed Changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 7 comments.

Show a summary per file

File	Description
`include/cutlass/gpu_generics.h`	Adds atomic operations (`atomicSub`, `atomicLoad`) for synchronization primitives
`examples/06_bmg_flash_attention/xe_fmha_fwd_runner.hpp`	Integrates persistent kernel selection and queries hardware XeCore count
`examples/06_bmg_flash_attention/CMakeLists.txt`	Adds build target for persistent kernel testing
`examples/06_bmg_flash_attention/06_xe_fmha_fwd.cpp`	Configures persistent kernel with appropriate tile sizes and subgroup layouts
`applications/flash_attention_v2/kernel/xe_tile_scheduler.hpp`	Implements persistent tile scheduler with dynamic work distribution
`applications/flash_attention_v2/kernel/xe_fhma_fwd_kernel.hpp`	Implements dynamic split-K kernel with partial result reduction
`applications/flash_attention_v2/collective/xe_fmha_fwd_mainloop.hpp`	Updates mainloop to use total block count for remainder masking

Copilot · 2025-11-05T07:31:26Z

applications/flash_attention_v2/kernel/xe_fhma_fwd_kernel.hpp

+
+  // Important: make sure multiple of 16 element for each copy
+  // this is for storing partial results from different KV partitions
+  static constexpr int num_elem_per_thead = (size(FragA{}.shape()) + 2 * size(FragARow{}.shape()) + 15) / 16 * 16;


Corrected spelling of 'thead' to 'thread'.

Suggested change

static constexpr int num_elem_per_thead = (size(FragA{}.shape()) + 2 * size(FragARow{}.shape()) + 15) / 16 * 16;

static constexpr int num_elem_per_thread = (size(FragA{}.shape()) + 2 * size(FragARow{}.shape()) + 15) / 16 * 16;

Copilot · 2025-11-05T07:31:26Z

applications/flash_attention_v2/kernel/xe_fhma_fwd_kernel.hpp

+        int offset = batch_head_id * max_num_partitions * num_elem_per_thead * SGPerWG::value * intel::sg_size 
+                    + partition_id * num_elem_per_thead * SGPerWG::value * intel::sg_size
+                    + sg_id * intel::sg_size * num_elem_per_thead
+                    + tid_in_sg * num_elem_per_thead;
+        Tensor tPartial = make_tensor(params.partial_results_ptr + offset, make_shape(Int<num_elem_per_thead>{}));
+        Tensor merged_res = make_tensor<ElementA>(Int<num_elem_per_thead>{});


Variable name 'num_elem_per_thead' uses misspelled 'thead' instead of 'thread'. This should be renamed for consistency.

Suggested change

int offset = batch_head_id * max_num_partitions * num_elem_per_thead * SGPerWG::value * intel::sg_size

+ partition_id * num_elem_per_thead * SGPerWG::value * intel::sg_size

+ sg_id * intel::sg_size * num_elem_per_thead

+ tid_in_sg * num_elem_per_thead;

Tensor tPartial = make_tensor(params.partial_results_ptr + offset, make_shape(Int<num_elem_per_thead>{}));

Tensor merged_res = make_tensor<ElementA>(Int<num_elem_per_thead>{});

int offset = batch_head_id * max_num_partitions * num_elem_per_thread * SGPerWG::value * intel::sg_size

+ partition_id * num_elem_per_thread * SGPerWG::value * intel::sg_size

+ sg_id * intel::sg_size * num_elem_per_thread

+ tid_in_sg * num_elem_per_thread;

Tensor tPartial = make_tensor(params.partial_results_ptr + offset, make_shape(Int<num_elem_per_thread>{}));

Tensor merged_res = make_tensor<ElementA>(Int<num_elem_per_thread>{});

Copilot · 2025-11-05T07:31:27Z

applications/flash_attention_v2/kernel/xe_fhma_fwd_kernel.hpp

+          int offset = wg_id * max_num_partitions * SGPerWG::value * intel::sg_size * num_elem_per_thead 
+                     + i * SGPerWG::value * intel::sg_size * num_elem_per_thead
+                     + sg_id * intel::sg_size * num_elem_per_thead
+                     + tid_in_sg * num_elem_per_thead;
+          Tensor tPartial = make_tensor(params.partial_results_ptr + offset, make_shape(Int<num_elem_per_thead>{}));
+          Tensor merged_res = make_tensor<ElementA>(Int<num_elem_per_thead>{});


Variable name 'num_elem_per_thead' uses misspelled 'thead' instead of 'thread'. This should be renamed for consistency.

Suggested change

int offset = wg_id * max_num_partitions * SGPerWG::value * intel::sg_size * num_elem_per_thead

+ i * SGPerWG::value * intel::sg_size * num_elem_per_thead

+ sg_id * intel::sg_size * num_elem_per_thead

+ tid_in_sg * num_elem_per_thead;

Tensor tPartial = make_tensor(params.partial_results_ptr + offset, make_shape(Int<num_elem_per_thead>{}));

Tensor merged_res = make_tensor<ElementA>(Int<num_elem_per_thead>{});

int offset = wg_id * max_num_partitions * SGPerWG::value * intel::sg_size * num_elem_per_thread

+ i * SGPerWG::value * intel::sg_size * num_elem_per_thread

+ sg_id * intel::sg_size * num_elem_per_thread

+ tid_in_sg * num_elem_per_thread;

Tensor tPartial = make_tensor(params.partial_results_ptr + offset, make_shape(Int<num_elem_per_thread>{}));

Tensor merged_res = make_tensor<ElementA>(Int<num_elem_per_thread>{});

Copilot · 2025-11-05T07:31:27Z

include/cutlass/gpu_generics.h

+CUTLASS_DEVICE int atomicLoad(int *address) {
+  int result = 0;
+#if defined(__SYCL_DEVICE_ONLY__)
+  auto atm = sycl::atomic_ref<int, sycl::memory_order::relaxed, sycl::memory_scope::device, sycl::access::address_space::generic_space>(address[0]);


The atomic_ref is constructed with address[0] which dereferences the pointer. This should be *address for clarity and consistency with standard atomic operations patterns.

Suggested change

auto atm = sycl::atomic_ref<int, sycl::memory_order::relaxed, sycl::memory_scope::device, sycl::access::address_space::generic_space>(address[0]);

auto atm = sycl::atomic_ref<int, sycl::memory_order::relaxed, sycl::memory_scope::device, sycl::access::address_space::generic_space>(*address);

Copilot · 2025-11-05T07:31:27Z

applications/flash_attention_v2/kernel/xe_fhma_fwd_kernel.hpp

+        CUTLASS_PRAGMA_UNROLL
+        for (int i = 0; i < size(FragARow{}.shape()); ++i) {
+          merged_res(i + size(FragA{}.shape())) = tA_max(i);
+          merged_res(i + 1 + size(FragA{}.shape())) = tA_sum(i);


Indexing logic appears incorrect. For tA_sum, the offset should be size(FragA{}.shape()) + size(FragARow{}.shape()), not size(FragA{}.shape()) + 1. This will cause tA_max and tA_sum values to overlap/overwrite.

Suggested change

merged_res(i + 1 + size(FragA{}.shape())) = tA_sum(i);

merged_res(i + size(FragA{}.shape()) + size(FragARow{}.shape())) = tA_sum(i);

Copilot · 2025-11-05T07:31:28Z

applications/flash_attention_v2/kernel/xe_fhma_fwd_kernel.hpp

+            CUTLASS_PRAGMA_UNROLL
+            for (int i = 0; i < size(FragARow{}.shape()); ++i) {
+              tA_max(i) = merged_res(i + size(FragA{}.shape()));
+              tA_sum(i) = merged_res(i + 1 + size(FragA{}.shape()));


Indexing logic appears incorrect. This should use offset size(FragA{}.shape()) + size(FragARow{}.shape()) to correctly retrieve tA_sum values, matching the storage layout.

Copilot · 2025-11-05T07:31:28Z

applications/flash_attention_v2/kernel/xe_fhma_fwd_kernel.hpp

+          CUTLASS_PRAGMA_UNROLL
+          for (int i = 0; i < size(FragARow{}.shape()); ++i) {
+            tA_max_2(i) = merged_res(i + size(FragA{}.shape()));
+            tA_sum_2(i) = merged_res(i + 1 + size(FragA{}.shape()));


Indexing logic appears incorrect. This should use offset size(FragA{}.shape()) + size(FragARow{}.shape()) to correctly retrieve tA_sum values, matching the storage layout.

wuxun-zhang added 2 commits November 3, 2025 19:02

Persistent SDPA kernel

252b0d1

update tile scheduler & add runtime check

93ec5c8

wuxun-zhang commented Nov 5, 2025

View reviewed changes

Antonyvance requested a review from Copilot November 5, 2025 07:29

Copilot AI reviewed Nov 5, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Persistent SDPA kernel #608

Persistent SDPA kernel #608

Uh oh!

wuxun-zhang commented Nov 4, 2025

Uh oh!

pengzhao-intel commented Nov 4, 2025

Uh oh!

wuxun-zhang Nov 5, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Nov 5, 2025

Uh oh!

Copilot AI Nov 5, 2025

Uh oh!

Copilot AI Nov 5, 2025

Uh oh!

Copilot AI Nov 5, 2025

Uh oh!

Copilot AI Nov 5, 2025

Uh oh!

Copilot AI Nov 5, 2025

Uh oh!

Copilot AI Nov 5, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	static constexpr int num_elem_per_thead = (size(FragA{}.shape()) + 2 * size(FragARow{}.shape()) + 15) / 16 * 16;
	static constexpr int num_elem_per_thread = (size(FragA{}.shape()) + 2 * size(FragARow{}.shape()) + 15) / 16 * 16;

	auto atm = sycl::atomic_ref<int, sycl::memory_order::relaxed, sycl::memory_scope::device, sycl::access::address_space::generic_space>(address[0]);
	auto atm = sycl::atomic_ref<int, sycl::memory_order::relaxed, sycl::memory_scope::device, sycl::access::address_space::generic_space>(*address);

	merged_res(i + 1 + size(FragA{}.shape())) = tA_sum(i);
	merged_res(i + size(FragA{}.shape()) + size(FragARow{}.shape())) = tA_sum(i);

Persistent SDPA kernel #608

Are you sure you want to change the base?

Persistent SDPA kernel #608

Uh oh!

Conversation

wuxun-zhang commented Nov 4, 2025

Uh oh!

pengzhao-intel commented Nov 4, 2025

Uh oh!

wuxun-zhang Nov 5, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Copilot AI Nov 5, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 5, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 5, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 5, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 5, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 5, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 5, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants