[llama32_1b] int4-AWQ RMS + Q/K/V GEMV + RoPE multi-launch decode ELF#1633
Merged
Conversation
Adds rms_qkv_int4_rope_multi.py: an int4-AWQ sibling of
rms_gemv_rope_multi.py for the Llama-3.2-1B decode pipeline. Q/K/V
projections use the packed int4 GEMV from int4_awq/, while RMSNorm and
RoPE remain bf16 (HF AWQ quantizes only Linear weights; RoPE flows
through the same rotate_half path as the bf16 model).
Same 6-launch layout and 13 func args as the bf16 sibling, with
arg3/5/7 retyped from bf16 weight matrices to packed [Q|S|Z] uint8
BOs. Reuses kernel_builder.stitching unchanged; adds a small
_extract_air_channels helper so the explicit channel decls emitted by
the int4 GEMV survive func-body extraction.
Also adds compile_mv_int4_bf16() in kernel_builder/external_kernels.py
for the standalone test to compile mv_int4_bf16.o on demand.
Validated standalone on NPU2:
- module parses (13 args, 6 launches)
- compile-only emits a single ELF
- compile-and-run matches CPU reference (q_roped/k_roped correlation
0.99996 / 0.99996; tolerances rtol=0.2 atol=0.5 corr=0.99)
Profiling vs bf16 baseline (100 iters, 30 warmup, NPU2):
bf16 724.6 us avg / 719 us min
int4 579.7 us avg / 573 us min (~1.25x end-to-end)
Weight DMA drops 3.82x; end-to-end speedup is Amdahl-limited because
RMSNorm + RoPE Q + RoPE K stay bf16 in both variants.
Out of scope (follow-up): wiring into kernel_cache/backend_presets/
inference flow and an HF AWQ -> packed-BO repacker.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Contributor
There was a problem hiding this comment.
Pull request overview
Adds an int4-AWQ variant of the Llama-3.2-1B decode pipeline that fuses RMSNorm + Q/K/V GEMV (packed int4 weights) + RoPE Q/K into a single 6-launch ELF. RMSNorm and RoPE remain bf16; only the linear projections use the packed int4 GEMV from PR #1632. A small helper is added for compiling the new micro-kernel.
Changes:
- New
rms_qkv_int4_rope_multi.pymirroring the bf16rms_gemv_rope_multi.pystructure (13 args, 6 launches) witharg3/5/7retyped to packed[Q|S|Z]uint8BOs and a_extract_air_channelshelper to preserve explicitair.channel @...decls during stitching. - New
compile_mv_int4_bf16()inkernel_builder/external_kernels.pyfor buildingmv_int4_bf16.oon demand.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.
| File | Description |
|---|---|
programming_examples/llama32_1b/multi_launch_builder/rms_qkv_int4_rope_multi.py |
New 6-launch int4-AWQ decode module + standalone test. |
programming_examples/llama32_1b/kernel_builder/external_kernels.py |
Adds compile_mv_int4_bf16 helper for the int4 GEMV micro-kernel. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
# Conflicts: # programming_examples/llama32_1b/kernel_builder/external_kernels.py
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
rms_qkv_int4_rope_multi.py, an int4-AWQ sibling ofrms_gemv_rope_multi.pyfor the Llama-3.2-1B decode pipeline. Q/K/V projections use the packed int4 GEMV landed in [programming_examples] AWQ int4 matvec examples (GEMV + GEMV+R) #1632; RMSNorm and RoPE remain bf16 (HF AWQ quantizes only Linear weights — RoPE flows through the samerotate_halfpath as the bf16 model).arg3/arg5/arg7retyped from bf16 weight matrices to packed[Q|S|Z]uint8BOs. Reuseskernel_builder.stitchingunchanged; adds a small_extract_air_channelshelper so the explicit channel decls emitted by the int4 GEMV survive func-body extraction.compile_mv_int4_bf16()inkernel_builder/external_kernels.pyfor the standalone test to compilemv_int4_bf16.oon demand.Results
Standalone validation on NPU2:
q_roped/k_ropedcorrelation 0.99996 / 0.99996; tolerancesrtol=0.2 atol=0.5 corr=0.99)Profiling vs bf16 baseline (100 iters, 30 warmup, NPU2 end-to-end XRT
run.start()→run.wait2()):Weight DMA drops 3.82× as expected. End-to-end speedup is Amdahl-limited because RMSNorm and both RoPE launches are unchanged bf16 work in both variants; a larger win is expected when the same approach is applied to ELF 2 (O + Gate + Up + Down) whose weights dominate the per-layer footprint.
Out of scope (follow-up)
kernel_cache/backend_presets/llama32_1b_inference.py(the standalone test currently feeds synthetic uint4 weights).o_gemv_ffn) int4-AWQ variant — 4 more GEMVs including the two GEMV+R fusions.Test plan
python3 rms_qkv_int4_rope_multi.py -p→ module parses (13 args, 6 launches)python3 rms_qkv_int4_rope_multi.py --compile-mode compile-only→ produces ELFpython3 rms_qkv_int4_rope_multi.py→ PASS on NPU2 (correlation 0.99996/0.99996)rms_gemv_rope_multi.py🤖 Generated with Claude Code