Added benchmark for LLaMA 3 model for attention tests by howardzhang-cv · Pull Request #3930 · pytorch/ao

howardzhang-cv · 2026-02-21T02:47:59Z

Stack from ghstack (oldest at bottom):

Summary

Added new benchmark for new low precision attention API
uses Llama3-8b model, prefill
2-phase test
- perplexity using the WikiText-2 test set
- runtime over different sequence lengths (1024 -> 131072)
Can set baseline and test models between different backends: (fa2, fa3, fa3_fp8, fa4, fa4_fp8)
has options to control torch.compile usage, warmup_iters, sequence length, number of runtime test iterations, rope fusion

Example Run

python benchmarks/prototype/attention/eval_llama3_model.py --baseline fa3 --test fa3_fp8

[ghstack-poisoned]

pytorch-bot · 2026-02-21T02:48:03Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/3930

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit b6f072f with merge base 42bcdc4 ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags: ghstack-source-id: 528c2ec Pull-Request: #3930

namgyu-youn · 2026-02-21T06:58:44Z

benchmarks/prototype/attention/eval_llama3_model.py

+# =============================================================================
+
+
+def load_wikitext2_tokens(tokenizer, seq_len: int):


We don't have to define wiki-text tokenizer here. Instead, we can run lm-eval directly like

ao/benchmarks/quantization/calibration_based/create_quantized_model.py

Lines 30 to 37 in 450699a

# Calibrate

evaluator.simple_evaluate(

HFLM(pretrained=model, tokenizer=tokenizer),

tasks=tasks,

limit=limit,

batch_size=1,

)

quantize_(model, config_class(base_config, step="convert"), filter_fn=filter_fn)

Thanks! I changed it to use this instead

namgyu-youn · 2026-02-21T06:59:24Z

benchmarks/prototype/attention/eval_llama3_model.py

+    return chunks
+
+
+def compute_perplexity(model, chunks, device: str, backend_name: str) -> float:


This is also not needed. See above comment

namgyu-youn · 2026-02-21T07:03:28Z

benchmarks/prototype/attention/eval_llama3_model.py

+    return math.exp(avg_loss)
+
+
+def benchmark_runtime(


Q. Can we compute forward pass latency using vLLM directly, similar to e2e?

ao/benchmarks/quantization/measure_accuracy_and_performance.sh

Line 176 in 3d3dd50

vllm bench throughput --model $OUTPUT_DIR --dtype bfloat16 $PREFILL_ARGS 2>&1 | tee -a "$LOG_FILE"

Unfortunately not. Unlike the other quantization APIs in TorchAO, the low precision attention path requires replacing F.scaled_dot_product_attention with a specific attention backend capable of low precision attention (e.g. FA3/4). So we need to ensure that our model calls F.SDPA.

namgyu-youn · 2026-02-21T07:04:41Z

benchmarks/prototype/attention/eval_llama3_model.py

+}
+
+RANDOM_SEED = 42
+DEFAULT_MODEL_ID = "meta-llama/Llama-3.1-8B"


What does DEFAULT_MODEL_ID do? Should it be called by args (--model_id) default directly?

Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags: ghstack-source-id: 528c2ec Pull-Request: pytorch#3930

[ghstack-poisoned]

Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags: ghstack-source-id: 240c54f Pull-Request: pytorch#3930

[ghstack-poisoned]

Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags: ghstack-source-id: c635ea3 Pull-Request: #3930

Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags: ghstack-source-id: c635ea3 Pull-Request: pytorch#3930

[ghstack-poisoned]

Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags: ghstack-source-id: e397044 Pull-Request: #3930

Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags: ghstack-source-id: e397044 Pull-Request: pytorch#3930

Benchmark script for evaluating FP8 attention on LLaMA 3 models. Measures perplexity on WikiText-2 and runtime performance across sequence lengths with and without RoPE fusion. ghstack-source-id: 859f523 Pull-Request: pytorch#3930

[ghstack-poisoned]

Benchmark script for evaluating FP8 attention on LLaMA 3 models. Measures perplexity on WikiText-2 and runtime performance across sequence lengths with and without RoPE fusion. ghstack-source-id: b24335c Pull-Request: pytorch#3930

[ghstack-poisoned]

Benchmark script for evaluating FP8 attention on LLaMA 3 models. Measures perplexity on WikiText-2 and runtime performance across sequence lengths with and without RoPE fusion. ghstack-source-id: c3386ef Pull-Request: pytorch#3930

namgyu-youn

LGTM, thanks for addressing all the comments!

[ghstack-poisoned]

Benchmark script for evaluating FP8 attention on LLaMA 3 models. Measures perplexity on WikiText-2 and runtime performance across sequence lengths with and without RoPE fusion. ghstack-source-id: c3386ef Pull-Request: pytorch#3930

[ghstack-poisoned]

howardzhang-cv added 2 commits February 20, 2026 18:47

Update (base update)

65eec66

[ghstack-poisoned]

Update

5f12437

[ghstack-poisoned]

howardzhang-cv added a commit that referenced this pull request Feb 21, 2026

Added benchmark for LLaMA 3 model for attention tests

4f47d3e

Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags: ghstack-source-id: 528c2ec Pull-Request: #3930

howardzhang-cv mentioned this pull request Feb 21, 2026

Added new API for low precision fp8 attention using FA3 #3857

Merged

meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Feb 21, 2026

This was referenced Feb 21, 2026

Added benchmarking for new torchao low precision attention api #3865

Merged

Added benchmark for single attention layer across different sequence lengths #3929

Merged

howardzhang-cv marked this pull request as draft February 21, 2026 02:49

namgyu-youn reviewed Feb 21, 2026

View reviewed changes

howardzhang-cv added a commit to howardzhang-cv/ao that referenced this pull request Feb 23, 2026

Added benchmark for LLaMA 3 model for attention tests

c14ca31

Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags: ghstack-source-id: 528c2ec Pull-Request: pytorch#3930

howardzhang-cv added a commit to howardzhang-cv/ao that referenced this pull request Feb 24, 2026

Added benchmark for LLaMA 3 model for attention tests

10540ed

Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags: ghstack-source-id: 528c2ec Pull-Request: pytorch#3930

howardzhang-cv added 2 commits February 24, 2026 15:25

Update (base update)

9955568

[ghstack-poisoned]

Update

b6ba8c4

[ghstack-poisoned]

howardzhang-cv mentioned this pull request Feb 24, 2026

Add FA4 fp8 backend to low precision attention api #3944

Closed

howardzhang-cv added a commit to howardzhang-cv/ao that referenced this pull request Feb 25, 2026

Added benchmark for LLaMA 3 model for attention tests

d349c7e

Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags: ghstack-source-id: 240c54f Pull-Request: pytorch#3930

howardzhang-cv added a commit to howardzhang-cv/ao that referenced this pull request Feb 25, 2026

Added benchmark for LLaMA 3 model for attention tests

7f14454

Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags: ghstack-source-id: 240c54f Pull-Request: pytorch#3930

howardzhang-cv added 2 commits February 24, 2026 20:29

Update (base update)

e471bd4

[ghstack-poisoned]

Update

1fb4411

[ghstack-poisoned]

howardzhang-cv added a commit that referenced this pull request Feb 25, 2026

Added benchmark for LLaMA 3 model for attention tests

06eda9a

Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags: ghstack-source-id: c635ea3 Pull-Request: #3930

howardzhang-cv mentioned this pull request Feb 25, 2026

Add FA4 fp8 backend to low precision attention api #3947

Draft

howardzhang-cv added benchmark module: not user facing Use this tag if you don't want this PR to show up in release notes labels Feb 25, 2026

howardzhang-cv added a commit to howardzhang-cv/ao that referenced this pull request Feb 25, 2026

Added benchmark for LLaMA 3 model for attention tests

a984205

Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags: ghstack-source-id: c635ea3 Pull-Request: pytorch#3930

howardzhang-cv added a commit to howardzhang-cv/ao that referenced this pull request Feb 25, 2026

Added benchmark for LLaMA 3 model for attention tests

a655677

Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags: ghstack-source-id: c635ea3 Pull-Request: pytorch#3930

howardzhang-cv added 2 commits February 25, 2026 13:16

Update (base update)

589d82c

[ghstack-poisoned]

Update

8c8ca6a

[ghstack-poisoned]

howardzhang-cv added a commit that referenced this pull request Feb 25, 2026

Added benchmark for LLaMA 3 model for attention tests

e0ca8e7

Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags: ghstack-source-id: e397044 Pull-Request: #3930

howardzhang-cv added a commit to howardzhang-cv/ao that referenced this pull request Feb 25, 2026

Added benchmark for LLaMA 3 model for attention tests

b37404c

Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags: ghstack-source-id: e397044 Pull-Request: pytorch#3930

howardzhang-cv added a commit to howardzhang-cv/ao that referenced this pull request Feb 26, 2026

Added benchmark for LLaMA 3 model for attention tests

13b1db3

Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags: ghstack-source-id: e397044 Pull-Request: pytorch#3930

howardzhang-cv added 2 commits March 2, 2026 14:45

Update (base update)

f1e4899

[ghstack-poisoned]

Update

52038bd

[ghstack-poisoned]

howardzhang-cv added 2 commits March 2, 2026 16:28

Update (base update)

b3bff30

[ghstack-poisoned]

Update

c43f801

[ghstack-poisoned]

namgyu-youn approved these changes Mar 3, 2026

View reviewed changes

howardzhang-cv added 2 commits March 2, 2026 17:11

Update (base update)

cdc3088

[ghstack-poisoned]

Update

b953741

[ghstack-poisoned]

howardzhang-cv added 8 commits March 5, 2026 12:58

Update (base update)

0d398c9

[ghstack-poisoned]

Update

37c88fa

[ghstack-poisoned]

Update (base update)

d58e023

[ghstack-poisoned]

Update

4f6b977

[ghstack-poisoned]

Update (base update)

2298633

[ghstack-poisoned]

Update

2e6fb2f

[ghstack-poisoned]

Update (base update)

9507b0f

[ghstack-poisoned]

Update

a5e4031

[ghstack-poisoned]

drisspg approved these changes Mar 7, 2026

View reviewed changes

howardzhang-cv added 4 commits March 6, 2026 18:03

Update (base update)

7e4f0f8

[ghstack-poisoned]

Update

4d03f07

[ghstack-poisoned]

Update (base update)

a82bc42

[ghstack-poisoned]

Update

b6f072f

[ghstack-poisoned]

howardzhang-cv changed the base branch from gh/howardzhang-cv/20/base to main March 9, 2026 22:18

howardzhang-cv merged commit 21fc296 into main Mar 9, 2026
36 checks passed

howardzhang-cv deleted the gh/howardzhang-cv/20/head branch March 9, 2026 22:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Added benchmark for LLaMA 3 model for attention tests#3930

Added benchmark for LLaMA 3 model for attention tests#3930
howardzhang-cv merged 34 commits intomainfrom
gh/howardzhang-cv/20/head

howardzhang-cv commented Feb 21, 2026 •

edited

Loading

Uh oh!

pytorch-bot bot commented Feb 21, 2026 •

edited

Loading

Uh oh!

namgyu-youn Feb 21, 2026 •

edited

Loading

Uh oh!

howardzhang-cv Feb 28, 2026

Uh oh!

namgyu-youn Feb 21, 2026

Uh oh!

howardzhang-cv Feb 28, 2026

Uh oh!

namgyu-youn Feb 21, 2026 •

edited

Loading

Uh oh!

howardzhang-cv Feb 28, 2026

Uh oh!

namgyu-youn Feb 21, 2026 •

edited

Loading

Uh oh!

howardzhang-cv Feb 28, 2026

Uh oh!

namgyu-youn left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		# =============================================================================


		def load_wikitext2_tokens(tokenizer, seq_len: int):

	# Calibrate
	evaluator.simple_evaluate(
	HFLM(pretrained=model, tokenizer=tokenizer),
	tasks=tasks,
	limit=limit,
	batch_size=1,
	)
	quantize_(model, config_class(base_config, step="convert"), filter_fn=filter_fn)

		return chunks


		def compute_perplexity(model, chunks, device: str, backend_name: str) -> float:

Conversation

howardzhang-cv commented Feb 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Example Run

Uh oh!

pytorch-bot bot commented Feb 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/3930

✅ No Failures

Uh oh!

namgyu-youn Feb 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

howardzhang-cv Feb 28, 2026

Choose a reason for hiding this comment

Uh oh!

namgyu-youn Feb 21, 2026

Choose a reason for hiding this comment

Uh oh!

howardzhang-cv Feb 28, 2026

Choose a reason for hiding this comment

Uh oh!

namgyu-youn Feb 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

howardzhang-cv Feb 28, 2026

Choose a reason for hiding this comment

Uh oh!

namgyu-youn Feb 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

howardzhang-cv Feb 28, 2026

Choose a reason for hiding this comment

Uh oh!

namgyu-youn left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

howardzhang-cv commented Feb 21, 2026 •

edited

Loading

pytorch-bot bot commented Feb 21, 2026 •

edited

Loading

namgyu-youn Feb 21, 2026 •

edited

Loading

namgyu-youn Feb 21, 2026 •

edited

Loading

namgyu-youn Feb 21, 2026 •

edited

Loading