Added new API for low precision fp8 attention using FA3 by howardzhang-cv · Pull Request #3857 · pytorch/ao

howardzhang-cv · 2026-02-11T00:32:47Z

Stack from ghstack (oldest at bottom):

Summary

Added RoPE fusion compile path for FA3 FP8 low-precision attention (fuse_rope=True)
New elementary block: fp8_fa3_rope_sdpa — fused RoPE + FP8 quantization + low-precision SDPA
New Triton kernel for fused RoPE + QKV quantization with layout transpose ([B,S,H,D] → [B,H,S,D])
RoPE fusion method: Custom Inductor backend that traces the FX graph, detects RoPE + SDPA patterns (NeoX half-split and FLUX interleaved formats), and replaces them with fp8_fa3_rope_sdpa custom ops. Falls back to fp8_fa3_sdpa for SDPA
nodes without RoPE.
Causal mask detection: Pre-flight forward pass identifies HuggingFace-style materialized causal masks so the fusion pass can strip them and use is_causal=True instead.
Added compiled model wrapper (_FP8FlashAttentionCompiledWrapper) with @torch._dynamo.disable to prevent re-tracing.
Added RoPE SDPA numerical accuracy tests and fuse_rope parametrization on model-level tests.

New Files

shared_utils/fusion_utils.py: Shared FX graph fusion pass — RoPE pattern detection, SDPA detection, transpose unwrapping, parameterized graph surgery
shared_utils/custom_ops.py: Factory functions to register backend-specific custom ops with register_fake, and helpers to build fusion passes and compile functions
fp8_fa3/fusion_pass.py: FA3-specific custom op registration, rope_sdpa_fusion_pass, and compile_with_fp8_fusion entry point
quantization/triton_rope_qkv_quantization.py: Fused RoPE + QKV FP8 quantization Triton kernel

Modified Files

shared_utils/attention.py: Added _fp8_rope_sdpa shared implementation
shared_utils/wrapper.py: Added _FP8FlashAttentionCompiledWrapper
shared_utils/setup.py: Added compile path routing via compile_fn parameter, moved detect_causal_mask to fusion_utils.py
quantization/quantization.py: Added _fp8_rope_sdpa_quantize
fp8_fa3/attention.py: Added fp8_fa3_rope_sdpa elementary block
fp8_fa3/setup.py: Passes compile_with_fp8_fusion as compile_fn
test_fp8_attention.py: Added TestFP8RopeSDPANumericalAccuracy, fuse_rope parametrization on model test

Test Plan

python -m pytest test/prototype/attention/test_fp8_attention.py -v

Example Usage

  from torchao.prototype.attention import (
      AttentionBackend,
      LowPrecisionAttentionConfig,
      apply_low_precision_attention,
  )

  model = MyModel()

  # Compile path with RoPE fusion
  config = LowPrecisionAttentionConfig(
      backend=AttentionBackend.FP8_FA3,
      fuse_rope=True,
  )
  model = apply_low_precision_attention(model, config)

  # Flash activation is handled internally by the wrapper
  output = model(inputs)

Results

Single-Layer Results

Results directly comparing FA3 SDPA versus FA3 fp8 SDPA (including quantization time):

Llama3 Model Results

Results comparing Llama3 model with FA3 SDPA versus Llama3 using the FA3 fp8 wrapper. Uses RoPE fusion.
Perplexity: 6.19 -> 6.24

[ghstack-poisoned]

Summary: Added new folder for low precision attention APIs in torchao/attention Test Plan: python test/attention/test_fp8_fa3.py Reviewers: Subscribers: Tasks: Tags: ghstack-source-id: a86eedf Pull-Request: #3857

pytorch-bot · 2026-02-11T00:32:51Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/3857

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (2 Unrelated Failures)

As of commit 94d9200 with merge base aad1018 ():

BROKEN TRUNK - The following jobs failed but were present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

Run Regression Tests / test-nightly (CPU Nightly, linux.4xlarge, --pre torch --index-url https://download.pytorch.org/wh... / linux-job (gh) (trunk failure)
test/quantization/pt2e/test_x86inductor_fusion.py::TestDynamicPatternMatcher::test_q_attention_block
Run Regression Tests / test-nightly (CUDA Nightly, linux.g5.12xlarge.nvidia.gpu, --pre torch --index-url https://downloa... / linux-job (gh) (trunk failure)
test/quantization/pt2e/test_x86inductor_fusion.py::TestDynamicPatternMatcher::test_q_attention_block

This comment was automatically generated by Dr. CI and updates every 15 minutes.

[ghstack-poisoned]

Summary: Added new folder for low precision attention APIs in torchao/attention Test Plan: python test/attention/test_fp8_fa3.py Reviewers: Subscribers: Tasks: Tags: ghstack-source-id: 33c50a4 Pull-Request: #3857

howardzhang-cv · 2026-02-11T01:04:16Z

Assuming this is gonna take a bit of back and forth to land to get it just right, since this is a user-facing change. @drisspg please take a look and let me know if the API looks okay

jerryzh168 · 2026-02-11T01:12:15Z

torchao/prototype/attention/fp8_fa3/wrappers.py

+    original_forward = model.forward
+
+    def wrapped_forward(*args, **kwargs):
+        with _fp8_fa3_attention_context(config):


not sure if we talked about this before, what's the thoughts on this v.s. following other torchao APIs, e.g.

q = Float8Tensor.from_hp(q) k = Float8Tensor.from_hp(k) v = Float8Tensor.from_hp(v) # dispatch to `_fp8_fa3_sdpa` in Float8Tensor implementation F.scaled_dot_product_attention(q, k, v, is_causal=True)

?

Yeah, ideally we follow other torchao APIs but it's a bit difficult for attention. This first backend/recipe works well because it is a simple replacement of F.scaled_dot_product_attention, but as we add other features such as RoPE fusion or RoPE + Hadamard it becomes much more difficult, since it becomes model specific (i.e. if there is RoPE followed by SDPA, replace with RoPE + quantization fused kernel into fp8 SDPA). To future-proof other attention backends and recipes, I think it's better to have it as its own separate APIs. What are your thoughts?

I thought the fused kernel replacement should happen in inductor? or is there a reason why we have to hand replace these in eager mode?

Inductor unfortunately does not fuse RoPE with the quantization kernel

oh I mean is this something that's not possible to do in inductor or just something that does not exist right now

Not too sure, I feel like it should be possible in inductor? But it definitely is something that does not exist right now. Could look into it for the future, but for now, I think it would be simpler just to move this into prototype and have it be available through that. Will work on moving this over.

@jerryzh168 I've gotten the rope fusion to work by using the inductor path. I've moved everything to a prototype folder. Due to the way attention works, I think it's better if it's own API, as it doesn't fit cleanly with others for now (especially with rope fusion and hadamard and stuff requiring inductor path). What do you think?

[ghstack-poisoned]

vkuzo · 2026-02-23T15:44:47Z

torchao/prototype/attention/fp8_fa3/quantization.py

+    if q.shape[3] != k.shape[3]:
+        raise ValueError(f"Head dim mismatch: {q.shape[3]} vs {k.shape[3]}")
+
+    if torch.compiler.is_compiling():


is there a test comparing numerics between these two paths

The numerics are exactly the same between the two paths. I tested the runtime and it seems like the triton implementation was much faster (~100 ms difference on llama3 model with 124k sequence length)

vkuzo · 2026-02-23T15:46:25Z

torchao/prototype/attention/__init__.py

+
+    from torchao.prototype.attention import apply_low_precision_attention
+
+    model = MyTransformer()


we should specify what kind of syntax is automatically converted to low precision here. F.SDPA? something else?

vkuzo · 2026-02-23T15:49:56Z

torchao/prototype/attention/api.py

+    )
+
+    def fp8_attention_backend(gm, example_inputs):
+        """Custom Inductor backend that applies the RoPE + FP8 fusion pass."""


this should specify what kind of user syntax counts as "attention"

vkuzo · 2026-02-23T15:51:17Z

torchao/prototype/attention/config.py

+    Different backends have different hardware requirements and capabilities.
+    """
+
+    FP8_FA3 = "fa3"


make the string match the enum value name

vkuzo · 2026-02-23T15:52:21Z

torchao/prototype/attention/api.py

+            restore_flash_attention_impl()
+
+
+def apply_low_precision_attention(


I think this should be explicit that parts of torch.compile is used to do the logic swap

Added, additionally added a warning as well to be even more explicit

Summary: Added new folder for low precision attention APIs in torchao/prototype/attention Test Plan: python test/prototype/attention/test_fp8_fa3.py Reviewers: Subscribers: Tasks: Tags: ghstack-source-id: 25b6a97 Pull-Request: pytorch#3857

Adds the compile path (fuse_rope=True) which compiles the model with a custom Inductor backend that fuses RoPE + FP8 quantization + SDPA into optimized kernels via FX graph pattern matching. Key additions: - shared_utils/fusion_utils.py: FX graph RoPE/SDPA pattern detection and parameterized graph surgery (NeoX + FLUX interleaved RoPE variants) - shared_utils/custom_ops.py: custom op registration factory with register_fake for torch.compile traceability - fp8_fa3/fusion_pass.py: FA3-specific custom ops and compile helper - quantization/triton_rope_qkv_quantization.py: fused RoPE + FP8 quantization Triton kernels with layout transpose - _FP8FlashAttentionCompiledWrapper with @dynamo.disable boundary - _fp8_rope_sdpa shared implementation + fp8_fa3_rope_sdpa entry point - Tests parametrized over fuse_rope={True, False} ghstack-source-id: 4e11b16 Pull-Request: pytorch#3857

[ghstack-poisoned]

Adds the compile path (fuse_rope=True) which compiles the model with a custom Inductor backend that fuses RoPE + FP8 quantization + SDPA into optimized kernels via FX graph pattern matching. Key additions: - shared_utils/fusion_utils.py: FX graph RoPE/SDPA pattern detection and parameterized graph surgery (NeoX + FLUX interleaved RoPE variants) - shared_utils/custom_ops.py: custom op registration factory with register_fake for torch.compile traceability - fp8_fa3/fusion_pass.py: FA3-specific custom ops and compile helper - quantization/triton_rope_qkv_quantization.py: fused RoPE + FP8 quantization Triton kernels with layout transpose - _FP8FlashAttentionCompiledWrapper with @dynamo.disable boundary - _fp8_rope_sdpa shared implementation + fp8_fa3_rope_sdpa entry point - Tests parametrized over fuse_rope={True, False} ghstack-source-id: 7cf571a Pull-Request: pytorch#3857

[ghstack-poisoned]

Adds the compile path (fuse_rope=True) which compiles the model with a custom Inductor backend that fuses RoPE + FP8 quantization + SDPA into optimized kernels via FX graph pattern matching. Key additions: - shared_utils/fusion_utils.py: FX graph RoPE/SDPA pattern detection and parameterized graph surgery (NeoX + FLUX interleaved RoPE variants) - shared_utils/custom_ops.py: custom op registration factory with register_fake for torch.compile traceability - fp8_fa3/fusion_pass.py: FA3-specific custom ops and compile helper - quantization/triton_rope_qkv_quantization.py: fused RoPE + FP8 quantization Triton kernels with layout transpose - _FP8FlashAttentionCompiledWrapper with @dynamo.disable boundary - _fp8_rope_sdpa shared implementation + fp8_fa3_rope_sdpa entry point - Tests parametrized over fuse_rope={True, False} ghstack-source-id: 35e5edf Pull-Request: pytorch#3857

[ghstack-poisoned]

Adds the compile path (fuse_rope=True) which compiles the model with a custom Inductor backend that fuses RoPE + FP8 quantization + SDPA into optimized kernels via FX graph pattern matching. Key additions: - shared_utils/fusion_utils.py: FX graph RoPE/SDPA pattern detection and parameterized graph surgery (NeoX + FLUX interleaved RoPE variants) - shared_utils/custom_ops.py: custom op registration factory with register_fake for torch.compile traceability - fp8_fa3/fusion_pass.py: FA3-specific custom ops and compile helper - quantization/triton_rope_qkv_quantization.py: fused RoPE + FP8 quantization Triton kernels with layout transpose - _FP8FlashAttentionCompiledWrapper with @dynamo.disable boundary - _fp8_rope_sdpa shared implementation + fp8_fa3_rope_sdpa entry point - Tests parametrized over fuse_rope={True, False} ghstack-source-id: 35e5edf Pull-Request: pytorch#3857

[ghstack-poisoned]

howardzhang-cv added 2 commits February 10, 2026 16:32

Update

cf8280f

[ghstack-poisoned]

Update (base update)

9acfc52

[ghstack-poisoned]

meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Feb 11, 2026

Update

88dff89

[ghstack-poisoned]

howardzhang-cv added the topic: new feature Use this tag if this PR adds a new feature label Feb 11, 2026

howardzhang-cv requested a review from drisspg February 11, 2026 01:03

jerryzh168 reviewed Feb 11, 2026

View reviewed changes

howardzhang-cv added 2 commits February 11, 2026 18:39

Update

11e7cad

[ghstack-poisoned]

Update (base update)

95cccd5

[ghstack-poisoned]

howardzhang-cv mentioned this pull request Feb 12, 2026

Added benchmarking for new torchao low precision attention api #3865

Merged

howardzhang-cv added 3 commits February 11, 2026 19:29

Update

fdf88ac

[ghstack-poisoned]

Update (base update)

ad075ac

[ghstack-poisoned]

Update

878b464

[ghstack-poisoned]

howardzhang-cv mentioned this pull request Feb 13, 2026

use helion instead of triton for low precision attention quantization kernels #3880

Closed

howardzhang-cv added 6 commits February 12, 2026 17:06

Update

3be7bbb

[ghstack-poisoned]

Update (base update)

3eea34a

[ghstack-poisoned]

Update

333e08c

[ghstack-poisoned]

Update

8e227d0

[ghstack-poisoned]

Update

d85dcc2

[ghstack-poisoned]

Update (base update)

56ba611

[ghstack-poisoned]

This was referenced Feb 21, 2026

Added benchmark for single attention layer across different sequence lengths #3929

Merged

Added benchmark for LLaMA 3 model for attention tests #3930

Merged

howardzhang-cv marked this pull request as draft February 21, 2026 02:49

vkuzo reviewed Feb 23, 2026

View reviewed changes

howardzhang-cv requested a review from vkuzo March 2, 2026 19:28

howardzhang-cv added 2 commits March 2, 2026 14:45

Update

708547f

[ghstack-poisoned]

Update (base update)

d60829a

[ghstack-poisoned]

Update

1d26fd8

[ghstack-poisoned]

howardzhang-cv added 2 commits March 2, 2026 17:11

Update (base update)

68efede

[ghstack-poisoned]

Update

e5a8c5a

[ghstack-poisoned]

howardzhang-cv added 6 commits March 5, 2026 12:58

Update (base update)

fec81e6

[ghstack-poisoned]

Update

669829e

[ghstack-poisoned]

Update (base update)

edb1f38

[ghstack-poisoned]

Update

7db5ce9

[ghstack-poisoned]

Update (base update)

58b0e6a

[ghstack-poisoned]

Update

d18f997

[ghstack-poisoned]

howardzhang-cv added the module: inference quantize_ api inference flow label Mar 6, 2026

drisspg approved these changes Mar 6, 2026

View reviewed changes

howardzhang-cv added 6 commits March 6, 2026 14:44

Update (base update)

100382a

[ghstack-poisoned]

Update

58c838f

[ghstack-poisoned]

Update (base update)

c348a9f

[ghstack-poisoned]

Update

a719b90

[ghstack-poisoned]

Update (base update)

f140854

[ghstack-poisoned]

Update

ed23fd0

[ghstack-poisoned]

howardzhang-cv changed the base branch from gh/howardzhang-cv/16/base to main March 9, 2026 17:30

Merge branch 'main' into gh/howardzhang-cv/16/head

94d9200

howardzhang-cv merged commit 2ec82b3 into main Mar 9, 2026
36 of 40 checks passed

howardzhang-cv deleted the gh/howardzhang-cv/16/head branch March 9, 2026 22:03


		from torchao.prototype.attention import apply_low_precision_attention

		model = MyTransformer()

		restore_flash_attention_impl()


		def apply_low_precision_attention(

Conversation

howardzhang-cv commented Feb 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

New Files

Modified Files

Test Plan

Example Usage

Results

Single-Layer Results

Llama3 Model Results

Uh oh!

pytorch-bot bot commented Feb 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/3857

✅ You can merge normally! (2 Unrelated Failures)

Uh oh!

howardzhang-cv commented Feb 11, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

howardzhang-cv commented Feb 11, 2026 •

edited

Loading

pytorch-bot bot commented Feb 11, 2026 •

edited

Loading