[llama32_1b] int4-AWQ RMS + Q/K/V GEMV + RoPE multi-launch decode ELF by erwei-xilinx · Pull Request #1633 · Xilinx/mlir-air

erwei-xilinx · 2026-05-29T21:14:41Z

Summary

Adds rms_qkv_int4_rope_multi.py, an int4-AWQ sibling of rms_gemv_rope_multi.py for the Llama-3.2-1B decode pipeline. Q/K/V projections use the packed int4 GEMV landed in [programming_examples] AWQ int4 matvec examples (GEMV + GEMV+R) #1632; RMSNorm and RoPE remain bf16 (HF AWQ quantizes only Linear weights — RoPE flows through the same rotate_half path as the bf16 model).
Same 6-launch layout and 13 func args as the bf16 sibling, with arg3/arg5/arg7 retyped from bf16 weight matrices to packed [Q|S|Z] uint8 BOs. Reuses kernel_builder.stitching unchanged; adds a small _extract_air_channels helper so the explicit channel decls emitted by the int4 GEMV survive func-body extraction.
Adds compile_mv_int4_bf16() in kernel_builder/external_kernels.py for the standalone test to compile mv_int4_bf16.o on demand.

Results

Standalone validation on NPU2:

module parses (13 args, 6 launches)
compile-only emits a single ELF
compile-and-run matches CPU reference (q_roped / k_roped correlation 0.99996 / 0.99996; tolerances rtol=0.2 atol=0.5 corr=0.99)

Profiling vs bf16 baseline (100 iters, 30 warmup, NPU2 end-to-end XRT run.start()→run.wait2()):

variant	avg µs	min µs	wQ bytes	wK/V bytes
bf16	724.6	719	8.39 MB	2.10 MB
int4 AWQ	579.7	573	2.20 MB	0.55 MB
speedup	1.25×	1.26×	3.82× less	3.82× less

Weight DMA drops 3.82× as expected. End-to-end speedup is Amdahl-limited because RMSNorm and both RoPE launches are unchanged bf16 work in both variants; a larger win is expected when the same approach is applied to ELF 2 (O + Gate + Up + Down) whose weights dominate the per-layer footprint.

Out of scope (follow-up)

Wiring into kernel_cache / backend_presets / llama32_1b_inference.py (the standalone test currently feeds synthetic uint4 weights).
HF AWQ → packed-BO repacker.
ELF 2 (o_gemv_ffn) int4-AWQ variant — 4 more GEMVs including the two GEMV+R fusions.

Test plan

python3 rms_qkv_int4_rope_multi.py -p → module parses (13 args, 6 launches)
python3 rms_qkv_int4_rope_multi.py --compile-mode compile-only → produces ELF
python3 rms_qkv_int4_rope_multi.py → PASS on NPU2 (correlation 0.99996/0.99996)
bf16 sibling unchanged; no regression to rms_gemv_rope_multi.py

🤖 Generated with Claude Code

Adds rms_qkv_int4_rope_multi.py: an int4-AWQ sibling of rms_gemv_rope_multi.py for the Llama-3.2-1B decode pipeline. Q/K/V projections use the packed int4 GEMV from int4_awq/, while RMSNorm and RoPE remain bf16 (HF AWQ quantizes only Linear weights; RoPE flows through the same rotate_half path as the bf16 model). Same 6-launch layout and 13 func args as the bf16 sibling, with arg3/5/7 retyped from bf16 weight matrices to packed [Q|S|Z] uint8 BOs. Reuses kernel_builder.stitching unchanged; adds a small _extract_air_channels helper so the explicit channel decls emitted by the int4 GEMV survive func-body extraction. Also adds compile_mv_int4_bf16() in kernel_builder/external_kernels.py for the standalone test to compile mv_int4_bf16.o on demand. Validated standalone on NPU2: - module parses (13 args, 6 launches) - compile-only emits a single ELF - compile-and-run matches CPU reference (q_roped/k_roped correlation 0.99996 / 0.99996; tolerances rtol=0.2 atol=0.5 corr=0.99) Profiling vs bf16 baseline (100 iters, 30 warmup, NPU2): bf16 724.6 us avg / 719 us min int4 579.7 us avg / 573 us min (~1.25x end-to-end) Weight DMA drops 3.82x; end-to-end speedup is Amdahl-limited because RMSNorm + RoPE Q + RoPE K stay bf16 in both variants. Out of scope (follow-up): wiring into kernel_cache/backend_presets/ inference flow and an HF AWQ -> packed-BO repacker. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Copilot

Pull request overview

Adds an int4-AWQ variant of the Llama-3.2-1B decode pipeline that fuses RMSNorm + Q/K/V GEMV (packed int4 weights) + RoPE Q/K into a single 6-launch ELF. RMSNorm and RoPE remain bf16; only the linear projections use the packed int4 GEMV from PR #1632. A small helper is added for compiling the new micro-kernel.

Changes:

New rms_qkv_int4_rope_multi.py mirroring the bf16 rms_gemv_rope_multi.py structure (13 args, 6 launches) with arg3/5/7 retyped to packed [Q|S|Z] uint8 BOs and a _extract_air_channels helper to preserve explicit air.channel @... decls during stitching.
New compile_mv_int4_bf16() in kernel_builder/external_kernels.py for building mv_int4_bf16.o on demand.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.

File	Description
`programming_examples/llama32_1b/multi_launch_builder/rms_qkv_int4_rope_multi.py`	New 6-launch int4-AWQ decode module + standalone test.
`programming_examples/llama32_1b/kernel_builder/external_kernels.py`	Adds `compile_mv_int4_bf16` helper for the int4 GEMV micro-kernel.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

# Conflicts: # programming_examples/llama32_1b/kernel_builder/external_kernels.py

Copilot AI review requested due to automatic review settings May 29, 2026 21:14

erwei-xilinx requested a review from jgmelber as a code owner May 29, 2026 21:14

Copilot started reviewing on behalf of erwei-xilinx May 29, 2026 21:14 View session

Copilot AI reviewed May 29, 2026

View reviewed changes

erwei-xilinx and others added 2 commits May 29, 2026 21:09

[llama32_1b] black format rms_qkv_int4_rope_multi.py

4e1ff8f

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Merge remote-tracking branch 'origin/main' into int4-rms-qkv-rope-elf1

c932e8d

# Conflicts: # programming_examples/llama32_1b/kernel_builder/external_kernels.py

erwei-xilinx added this pull request to the merge queue May 30, 2026

Merged via the queue into Xilinx:main with commit 306226e May 30, 2026
27 checks passed

erwei-xilinx deleted the int4-rms-qkv-rope-elf1 branch May 30, 2026 05:38

erwei-xilinx mentioned this pull request May 30, 2026

[llama32_1b] int4-AWQ end-to-end decode with HF AutoAWQ checkpoint #1638

Open

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[llama32_1b] int4-AWQ RMS + Q/K/V GEMV + RoPE multi-launch decode ELF#1633

[llama32_1b] int4-AWQ RMS + Q/K/V GEMV + RoPE multi-launch decode ELF#1633
erwei-xilinx merged 3 commits into
Xilinx:mainfrom
erwei-xilinx:int4-rms-qkv-rope-elf1

erwei-xilinx commented May 29, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

erwei-xilinx commented May 29, 2026

Summary

Results

Out of scope (follow-up)

Test plan

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants