Skip to content

[llama32_1b] int4-AWQ RMS + Q/K/V GEMV + RoPE multi-launch decode ELF#1633

Merged
erwei-xilinx merged 3 commits into
Xilinx:mainfrom
erwei-xilinx:int4-rms-qkv-rope-elf1
May 30, 2026
Merged

[llama32_1b] int4-AWQ RMS + Q/K/V GEMV + RoPE multi-launch decode ELF#1633
erwei-xilinx merged 3 commits into
Xilinx:mainfrom
erwei-xilinx:int4-rms-qkv-rope-elf1

Conversation

@erwei-xilinx
Copy link
Copy Markdown
Collaborator

Summary

  • Adds rms_qkv_int4_rope_multi.py, an int4-AWQ sibling of rms_gemv_rope_multi.py for the Llama-3.2-1B decode pipeline. Q/K/V projections use the packed int4 GEMV landed in [programming_examples] AWQ int4 matvec examples (GEMV + GEMV+R) #1632; RMSNorm and RoPE remain bf16 (HF AWQ quantizes only Linear weights — RoPE flows through the same rotate_half path as the bf16 model).
  • Same 6-launch layout and 13 func args as the bf16 sibling, with arg3/arg5/arg7 retyped from bf16 weight matrices to packed [Q|S|Z] uint8 BOs. Reuses kernel_builder.stitching unchanged; adds a small _extract_air_channels helper so the explicit channel decls emitted by the int4 GEMV survive func-body extraction.
  • Adds compile_mv_int4_bf16() in kernel_builder/external_kernels.py for the standalone test to compile mv_int4_bf16.o on demand.

Results

Standalone validation on NPU2:

  • module parses (13 args, 6 launches)
  • compile-only emits a single ELF
  • compile-and-run matches CPU reference (q_roped / k_roped correlation 0.99996 / 0.99996; tolerances rtol=0.2 atol=0.5 corr=0.99)

Profiling vs bf16 baseline (100 iters, 30 warmup, NPU2 end-to-end XRT run.start()run.wait2()):

variant avg µs min µs wQ bytes wK/V bytes
bf16 724.6 719 8.39 MB 2.10 MB
int4 AWQ 579.7 573 2.20 MB 0.55 MB
speedup 1.25× 1.26× 3.82× less 3.82× less

Weight DMA drops 3.82× as expected. End-to-end speedup is Amdahl-limited because RMSNorm and both RoPE launches are unchanged bf16 work in both variants; a larger win is expected when the same approach is applied to ELF 2 (O + Gate + Up + Down) whose weights dominate the per-layer footprint.

Out of scope (follow-up)

  • Wiring into kernel_cache / backend_presets / llama32_1b_inference.py (the standalone test currently feeds synthetic uint4 weights).
  • HF AWQ → packed-BO repacker.
  • ELF 2 (o_gemv_ffn) int4-AWQ variant — 4 more GEMVs including the two GEMV+R fusions.

Test plan

  • python3 rms_qkv_int4_rope_multi.py -p → module parses (13 args, 6 launches)
  • python3 rms_qkv_int4_rope_multi.py --compile-mode compile-only → produces ELF
  • python3 rms_qkv_int4_rope_multi.py → PASS on NPU2 (correlation 0.99996/0.99996)
  • bf16 sibling unchanged; no regression to rms_gemv_rope_multi.py

🤖 Generated with Claude Code

Adds rms_qkv_int4_rope_multi.py: an int4-AWQ sibling of
rms_gemv_rope_multi.py for the Llama-3.2-1B decode pipeline. Q/K/V
projections use the packed int4 GEMV from int4_awq/, while RMSNorm and
RoPE remain bf16 (HF AWQ quantizes only Linear weights; RoPE flows
through the same rotate_half path as the bf16 model).

Same 6-launch layout and 13 func args as the bf16 sibling, with
arg3/5/7 retyped from bf16 weight matrices to packed [Q|S|Z] uint8
BOs. Reuses kernel_builder.stitching unchanged; adds a small
_extract_air_channels helper so the explicit channel decls emitted by
the int4 GEMV survive func-body extraction.

Also adds compile_mv_int4_bf16() in kernel_builder/external_kernels.py
for the standalone test to compile mv_int4_bf16.o on demand.

Validated standalone on NPU2:
  - module parses (13 args, 6 launches)
  - compile-only emits a single ELF
  - compile-and-run matches CPU reference (q_roped/k_roped correlation
    0.99996 / 0.99996; tolerances rtol=0.2 atol=0.5 corr=0.99)

Profiling vs bf16 baseline (100 iters, 30 warmup, NPU2):
  bf16   724.6 us avg / 719 us min
  int4   579.7 us avg / 573 us min   (~1.25x end-to-end)
Weight DMA drops 3.82x; end-to-end speedup is Amdahl-limited because
RMSNorm + RoPE Q + RoPE K stay bf16 in both variants.

Out of scope (follow-up): wiring into kernel_cache/backend_presets/
inference flow and an HF AWQ -> packed-BO repacker.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings May 29, 2026 21:14
@erwei-xilinx erwei-xilinx requested a review from jgmelber as a code owner May 29, 2026 21:14
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds an int4-AWQ variant of the Llama-3.2-1B decode pipeline that fuses RMSNorm + Q/K/V GEMV (packed int4 weights) + RoPE Q/K into a single 6-launch ELF. RMSNorm and RoPE remain bf16; only the linear projections use the packed int4 GEMV from PR #1632. A small helper is added for compiling the new micro-kernel.

Changes:

  • New rms_qkv_int4_rope_multi.py mirroring the bf16 rms_gemv_rope_multi.py structure (13 args, 6 launches) with arg3/5/7 retyped to packed [Q|S|Z] uint8 BOs and a _extract_air_channels helper to preserve explicit air.channel @... decls during stitching.
  • New compile_mv_int4_bf16() in kernel_builder/external_kernels.py for building mv_int4_bf16.o on demand.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.

File Description
programming_examples/llama32_1b/multi_launch_builder/rms_qkv_int4_rope_multi.py New 6-launch int4-AWQ decode module + standalone test.
programming_examples/llama32_1b/kernel_builder/external_kernels.py Adds compile_mv_int4_bf16 helper for the int4 GEMV micro-kernel.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

erwei-xilinx and others added 2 commits May 29, 2026 21:09
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
# Conflicts:
#	programming_examples/llama32_1b/kernel_builder/external_kernels.py
@erwei-xilinx erwei-xilinx added this pull request to the merge queue May 30, 2026
Merged via the queue into Xilinx:main with commit 306226e May 30, 2026
27 checks passed
@erwei-xilinx erwei-xilinx deleted the int4-rms-qkv-rope-elf1 branch May 30, 2026 05:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants