Skip to content

Add inductor_flex_attention_bwd operator#940

Open
OmarPavel wants to merge 2 commits intomainfrom
export-D95461827
Open

Add inductor_flex_attention_bwd operator#940
OmarPavel wants to merge 2 commits intomainfrom
export-D95461827

Conversation

@OmarPavel
Copy link
Copy Markdown
Contributor

Summary:
Add TritonBench operator to benchmark the flex attention backward pass
inductor kernel (triton_tem_fused_flex_attention_backward_zeros_1).

Uses FWD_ONLY=True but manually times backward via
output.backward(dy, retain_graph=True). Compares aten (eager) vs
inductor (torch.compile). Backward FLOP count uses 2.5x multiplier
(2.0 bwd + 0.5 recompute).

Default config: B=8, H=16, D=128, bf16, requires_grad=True on q/k/v.

Reviewed By: stashuk-olek

Differential Revision: D95461827

Summary:
Add TritonBench operator to benchmark the flex attention forward pass
inductor kernel (triton_tem_fused_flex_attention_0).

Compares aten (eager flex_attention) vs inductor (torch.compile) with
causal mask, sweeping seq_len from 128 to 16384. Reports latency,
speedup, and tflops (adjusted for block sparsity).

Default config: B=8, H=16, D=128, bf16.

Reviewed By: stashuk-olek

Differential Revision: D95461825
Summary:
Add TritonBench operator to benchmark the flex attention backward pass
inductor kernel (triton_tem_fused_flex_attention_backward_zeros_1).

Uses FWD_ONLY=True but manually times backward via
output.backward(dy, retain_graph=True). Compares aten (eager) vs
inductor (torch.compile). Backward FLOP count uses 2.5x multiplier
(2.0 bwd + 0.5 recompute).

Default config: B=8, H=16, D=128, bf16, requires_grad=True on q/k/v.

Reviewed By: stashuk-olek

Differential Revision: D95461827
@meta-codesync
Copy link
Copy Markdown

meta-codesync Bot commented Mar 10, 2026

@OmarPavel has exported this pull request. If you are a Meta employee, you can view the originating Diff in D95461827.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants