[SYCLTLA] rebase FA2 bwd to latest version #2756

LuFinch · 2026-01-21T03:36:48Z

No description provided.

Copilot

Pull request overview

This PR rebases the Flash Attention 2 backward pass implementation to the latest version, introducing significant refactoring to use newer SYCLTLA APIs and simplify the codebase.

Changes:

Replaced older MMA atom definitions with simplified XE_DPAS_TT architecture
Removed extensive tile shape static assertions and manual TiledCopy definitions
Refactored GEMM operations into reusable kernel functions with prefetching support
Simplified tensor layouts by removing trailing _1{} dimensions throughout

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.

File	Description
src/ATen/native/transformers/xpu/flash_attn/sycltla/mha_bwd.h	Updated MMA atom architecture, simplified tile shapes, removed manual TiledCopy definitions, and reordered Param constructor initialization
src/ATen/native/transformers/xpu/flash_attn/sycltla/mha_bwd.cpp	Major refactoring: replaced specialized GEMM functions with unified kernels, simplified tensor layouts, updated layout computation logic, changed empty tensor to zeros, and adjusted atom layout constants

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-01-21T03:37:51Z

src/ATen/native/transformers/xpu/flash_attn/sycltla/mha_bwd.cpp

    CUTLASS_PRAGMA_UNROLL
-    for (int mi = 0; mi < size<0>(rdO_2d); ++mi) {
+    for (int mi = 0; mi < NumValperCol; ++mi) {
      float accum = 0.0f;


Variable shadowing issue: accum is declared at line 147 but is already declared in the outer scope at line 147 (before the loop). The outer declaration on line 147 is never used, and the inner declaration on line 150 shadows it. Remove the unused outer declaration.

Suggested change

float accum = 0.0f;

accum = 0.0f;

Copilot · 2026-01-21T03:37:52Z

src/ATen/native/transformers/xpu/flash_attn/sycltla/mha_bwd.cpp

-      int y = m_offset + get<0>(rC_2d(m, n)) + diagonal_offset;
+      int y = m_offset + get<1>(rC_2d(m, n)) + sg_local_id + diagonal_offset;
+      int x = n_offset + get<0>(rC_2d(m, n));
      if (x > y) {


The coordinate calculation logic has been swapped (x and y switched) compared to the original implementation. While the variable names x and y are now swapped, the comparison if (x > y) remains the same, which effectively inverts the mask logic. Ensure this change is intentional and correctly implements the causal masking for the transposed layout.

Suggested change

if (x > y) {

if (y > x) {

Copilot · 2026-01-21T03:37:52Z

src/ATen/native/transformers/xpu/flash_attn/sycltla/mha_bwd.cpp

-    constexpr int AtomLayoutMSdP = 4;
-    constexpr int AtomLayoutNdKV = 4;
+    constexpr int AtomLayoutMSdP = 2;
+    constexpr int AtomLayoutNdKV = 2;


The AtomLayoutMSdP and AtomLayoutNdKV values have been changed from 4 to 2. This is a significant configuration change that affects the layout of matrix multiplication operations. Ensure this change has been validated with comprehensive testing for correctness and performance implications.

Suggested change

constexpr int AtomLayoutNdKV = 2;

constexpr int AtomLayoutNdKV = 4;

Copilot · 2026-01-21T03:37:52Z

src/ATen/native/transformers/xpu/flash_attn/sycltla/mha_bwd.cpp

  int seqlen_kv_pad = (seqlen_kv + kNPad - 1) / kNPad * kNPad;
  auto tensor_odo = at::empty_like(out, opts.dtype(at::kFloat));
-  auto tensor_dqaccum = at::empty(
+  auto tensor_dqaccum = at::zeros(


Changed from at::empty to at::zeros, which initializes all values to zero. This adds initialization overhead that may be unnecessary if all values will be overwritten. If the tensor is fully populated before use, consider reverting to at::empty for better performance.

github-actions · 2026-01-21T06:40:30Z

Performance outliers, please check!

🔴 [-1, 80%), should be regression

Category	Model	Target vs. Baseline [Eager]	Target vs. Baseline [Inductor]
torchbench_bfloat16_training	mnasnet1_0	0.948411	0.686298

github-actions · 2026-01-23T05:12:59Z

Performance outliers, please check!

🔴 [-1, 80%), should be regression

Category	Model	Target vs. Baseline [Eager]	Target vs. Baseline [Inductor]
torchbench_bfloat16_training	resnet18	0.923192	0.774594

github-actions · 2026-01-27T06:40:07Z

Performance outliers, please check!

🟡 [80%, 90%), may be fluctuations

Category	Model	Target vs. Baseline [Eager]	Target vs. Baseline [Inductor]
torchbench_bfloat16_training	resnet18	0.92665	0.838444

Copilot AI review requested due to automatic review settings January 21, 2026 03:36

Copilot AI reviewed Jan 21, 2026

View reviewed changes

LuFinch force-pushed the lfq/rebase_bwd branch from 6a97f85 to a27b4ec Compare January 23, 2026 03:49

LuFinch force-pushed the lfq/rebase_bwd branch from a27b4ec to e8e5754 Compare January 27, 2026 02:53

rebase to 93eaa3b

9edc761

LuFinch force-pushed the lfq/rebase_bwd branch from e8e5754 to 9edc761 Compare January 27, 2026 05:12

remove debug

db03225

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SYCLTLA] rebase FA2 bwd to latest version #2756

[SYCLTLA] rebase FA2 bwd to latest version #2756

LuFinch commented Jan 21, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Jan 21, 2026

Uh oh!

Copilot AI Jan 21, 2026

Uh oh!

Copilot AI Jan 21, 2026

Uh oh!

Copilot AI Jan 21, 2026

Uh oh!

github-actions bot commented Jan 21, 2026

Uh oh!

github-actions bot commented Jan 23, 2026

Uh oh!

github-actions bot commented Jan 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	constexpr int AtomLayoutNdKV = 2;
	constexpr int AtomLayoutNdKV = 4;

[SYCLTLA] rebase FA2 bwd to latest version #2756

Are you sure you want to change the base?

[SYCLTLA] rebase FA2 bwd to latest version #2756

Conversation

LuFinch commented Jan 21, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Jan 21, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 21, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 21, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 21, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Jan 21, 2026

Performance outliers, please check!

Uh oh!

github-actions bot commented Jan 23, 2026

Performance outliers, please check!

Uh oh!

github-actions bot commented Jan 27, 2026

Performance outliers, please check!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants