Fix workgroup barrier deadlock by frost-intel · Pull Request #312 · vllm-project/vllm-xpu-kernels

frost-intel · 2026-04-24T17:39:56Z

Purpose

Every subgroup in the workgroup should execute the same number of K-loops. The Xe2 FMHA kernel computes k_block0, k_blocks, and k_blocks_causal from seq_coord, which depends on q_offset_sg — a per-subgroup broadcast of the thread's row coordinate. Under causal masking, different subgroups within the same WG compute different loop bounds and execute different iteration counts, leaving some subgroups stuck at barrier_wait forever.

As a concrete example:
(seq_q=129, seq_k=463) causal with TileQ=128, tile_k=64:

SG0 (row 0) -> k_blocks = 6
SG7 (row 112) -> k_blocks = 8

I'm fairly confident that this is the root cause of the hang in PVC flash_attn kernels. I'm not sure why this hasn't been an issue on BMG. However I'm not an expert so I'd welcome any feedback on this solution.

This fix computes tight per-SG bounds as before, then reduces across the workgroup to obtain uniform bounds for the loop. This change resolves the hang in the xe_2 FMHA chunk-prefill kernel that occurred under causal masking with short variable-length q, and under sliding-window (local) masking. These are the cases currently guarded by SKIP_HANG_KERNEL in tests/flash_attn/test_flash_attn_varlen_func.py.

Test Plan

pytest -v -s tests/flash_attn/
Note this no longer requires SKIP_HANG_KERNEL=1

Test Result

All tests pass, no hang.

Signed-off-by: frost-intel <frost.mitchell@intel.com>

frost-intel · 2026-04-24T17:44:01Z

@YizhouZ Can you review this?

jikunshang · 2026-04-24T22:22:14Z

cc @xuechendi

frost-intel added 2 commits April 24, 2026 10:41

Fix workgroup barrier deadlock

4e325c7

Signed-off-by: frost-intel <frost.mitchell@intel.com>

Fix variable names

a05baa8

Signed-off-by: frost-intel <frost.mitchell@intel.com>

frost-intel force-pushed the flash_attn_xe2_barrier_deadlock branch from e08162f to a05baa8 Compare April 24, 2026 17:42

Merge branch 'main' into flash_attn_xe2_barrier_deadlock

a36c512

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix workgroup barrier deadlock#312

Fix workgroup barrier deadlock#312
frost-intel wants to merge 3 commits intovllm-project:mainfrom
frost-intel:flash_attn_xe2_barrier_deadlock

frost-intel commented Apr 24, 2026

Uh oh!

frost-intel commented Apr 24, 2026

Uh oh!

jikunshang commented Apr 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

frost-intel commented Apr 24, 2026

Purpose

Test Plan

Test Result

Uh oh!

frost-intel commented Apr 24, 2026

Uh oh!

jikunshang commented Apr 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants