Skip to content

Conversation

@LuFinch
Copy link
Contributor

@LuFinch LuFinch commented Jan 21, 2026

No description provided.

Copilot AI review requested due to automatic review settings January 21, 2026 03:36
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR rebases the Flash Attention 2 backward pass implementation to the latest version, introducing significant refactoring to use newer SYCLTLA APIs and simplify the codebase.

Changes:

  • Replaced older MMA atom definitions with simplified XE_DPAS_TT architecture
  • Removed extensive tile shape static assertions and manual TiledCopy definitions
  • Refactored GEMM operations into reusable kernel functions with prefetching support
  • Simplified tensor layouts by removing trailing _1{} dimensions throughout

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.

File Description
src/ATen/native/transformers/xpu/flash_attn/sycltla/mha_bwd.h Updated MMA atom architecture, simplified tile shapes, removed manual TiledCopy definitions, and reordered Param constructor initialization
src/ATen/native/transformers/xpu/flash_attn/sycltla/mha_bwd.cpp Major refactoring: replaced specialized GEMM functions with unified kernels, simplified tensor layouts, updated layout computation logic, changed empty tensor to zeros, and adjusted atom layout constants

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

CUTLASS_PRAGMA_UNROLL
for (int mi = 0; mi < size<0>(rdO_2d); ++mi) {
for (int mi = 0; mi < NumValperCol; ++mi) {
float accum = 0.0f;
Copy link

Copilot AI Jan 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Variable shadowing issue: accum is declared at line 147 but is already declared in the outer scope at line 147 (before the loop). The outer declaration on line 147 is never used, and the inner declaration on line 150 shadows it. Remove the unused outer declaration.

Suggested change
float accum = 0.0f;
accum = 0.0f;

Copilot uses AI. Check for mistakes.
int y = m_offset + get<0>(rC_2d(m, n)) + diagonal_offset;
int y = m_offset + get<1>(rC_2d(m, n)) + sg_local_id + diagonal_offset;
int x = n_offset + get<0>(rC_2d(m, n));
if (x > y) {
Copy link

Copilot AI Jan 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The coordinate calculation logic has been swapped (x and y switched) compared to the original implementation. While the variable names x and y are now swapped, the comparison if (x > y) remains the same, which effectively inverts the mask logic. Ensure this change is intentional and correctly implements the causal masking for the transposed layout.

Suggested change
if (x > y) {
if (y > x) {

Copilot uses AI. Check for mistakes.
constexpr int AtomLayoutMSdP = 4;
constexpr int AtomLayoutNdKV = 4;
constexpr int AtomLayoutMSdP = 2;
constexpr int AtomLayoutNdKV = 2;
Copy link

Copilot AI Jan 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The AtomLayoutMSdP and AtomLayoutNdKV values have been changed from 4 to 2. This is a significant configuration change that affects the layout of matrix multiplication operations. Ensure this change has been validated with comprehensive testing for correctness and performance implications.

Suggested change
constexpr int AtomLayoutNdKV = 2;
constexpr int AtomLayoutNdKV = 4;

Copilot uses AI. Check for mistakes.
int seqlen_kv_pad = (seqlen_kv + kNPad - 1) / kNPad * kNPad;
auto tensor_odo = at::empty_like(out, opts.dtype(at::kFloat));
auto tensor_dqaccum = at::empty(
auto tensor_dqaccum = at::zeros(
Copy link

Copilot AI Jan 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed from at::empty to at::zeros, which initializes all values to zero. This adds initialization overhead that may be unnecessary if all values will be overwritten. If the tensor is fully populated before use, consider reverting to at::empty for better performance.

Copilot uses AI. Check for mistakes.
@github-actions
Copy link

Performance outliers, please check!

  • 🔴 [-1, 80%), should be regression
Category Model Target vs. Baseline [Eager] Target vs. Baseline [Inductor]
torchbench_bfloat16_training mnasnet1_0 0.948411 0.686298

@github-actions
Copy link

Performance outliers, please check!

  • 🔴 [-1, 80%), should be regression
Category Model Target vs. Baseline [Eager] Target vs. Baseline [Inductor]
torchbench_bfloat16_training resnet18 0.923192 0.774594

@github-actions
Copy link

Performance outliers, please check!

  • 🟡 [80%, 90%), may be fluctuations
Category Model Target vs. Baseline [Eager] Target vs. Baseline [Inductor]
torchbench_bfloat16_training resnet18 0.92665 0.838444

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants