[llama32_1b] int4-AWQ end-to-end decode with HF AutoAWQ checkpoint by erwei-xilinx · Pull Request #1638 · Xilinx/mlir-air

erwei-xilinx · 2026-05-30T18:27:57Z

Summary

Wires the int4-AWQ decode ELFs from #1633 (rms_qkv_int4_rope) and #1637 (o_gemv_ffn_int4) into the existing inference / chat-REPL pipeline so the user can run a HuggingFace AutoAWQ-quantized Llama-3.2-1B end-to-end on NPU2 with one flag:

```
python3 llama32_1b_inference.py \
--quant awq --run-only --n-tokens 30 \
--model-path amd/Llama-3.2-1B-Instruct-awq-uint4-asym-g128-bf16-lmhead \
--prompt "Once upon a time"

-> "Once upon a time, in a small village nestled in the rolling hills of a far-off land,

there lived a young girl named Sophia. Sophia was a"

12.4 tok/s decode (~81 ms/tok), coherent continuation.

```

Prefill stays on CPU as a placeholder for this PR (no int4 GEMM kernel / prefill ELFs yet — that's a separate project). The placeholder reuses `llama32_1b_reference.transformer_block` over dequantized-to-bf16 AWQ weights to populate the KV cache, then hands off to the NPU int4 decode loop. Replacing it with int4 NPU prefill later doesn't touch any of the decode wiring landed here.

New files

`awq_repacker.py`
- Unpacks AutoAWQ int32-packed nibbles via `AWQ_PACK_ORDER = [0,2,4,6,1,3,5,7]`, composes with `matvec_int4_packed.pack_inputs` to produce the per-tile packed-BO layout the int4 decode ELFs consume.
- `dequant_to_bf16` (fp16→bf16 scales, asymmetric uint4) for the CPU prefill path.
- Built-in synthetic round-trip self-test (≥ 0.9999 correlation vs dense dequant); passes at K/N up to 2048.
`cpu_prefill.py`
- Drop-in replacement for `run_npu_prefill`'s signature. Harvests per-layer `k_roped`/`v` from `transformer_block` intermediates into the KV cache layout that `run_decode_block` expects. ~165 s for a 40-token prompt — fine for validation, not for production.

Modified

`llama32_1b_weights.py` — new `load_weights_awq` populates both bf16 dequant (existing fields) and packed-BO attrs (`_wq_packed`/.../`_wgateup_packed`/`_wdown_packed`). gate+up are interleaved at the nibble level so the int4 FFN ELF consumes them in one arg slot.
`kernel_builder/backend_presets.py` — `RGR_INT4_BACKEND`, `OGF_INT4_BACKEND` (distinct `instance_name` so cache files don't collide).
`kernel_builder/external_kernels.py` — `compile_all_external_kernels(quant=)` builds `mv_int4_bf16.o` when `quant="awq"`.
`kernel_builder/cache.py` — `prepare_air_project(quant=)` stages `mv_int4_bf16.o` into `air_project/`. `compile_and_cache` detects int4 ELFs from the kernel name and pipes the right quant through — existing call sites don't change.
`llama32_1b_decode.py` — `compile_decode_kernels(..., quant=)` and `run_decode_block(..., quant=)` branch between bf16 and int4 ELFs.
`llama32_1b_inference.py` — `--quant {bf16,awq}` flag, `--model-path` AWQ override. AWQ mode skips prefill compile / bf16 transpose / NPU prefill preload, and dispatches `cpu_prefill.run_cpu_prefill` instead.

Test plan

AWQ repacker self-test (synthetic round-trip) passes at K=2048, N∈{512, 2048}
`--quant=awq --compile-only` builds all 3 int4 decode ELFs cleanly
`--quant=awq --run-only` against the AMD AWQ checkpoint produces coherent text and stable ~12.4 tok/s decode
`--quant=bf16` path unchanged (no regression in the bf16 NPU prefill + NPU decode pipeline)
Token-level match vs HF AutoAWQ greedy reference (requires `pip install autoawq` — follow-up)

Out of scope (deferred)

Int4 prefill ELFs and the int4 GEMM kernel they would need — a separate, substantially larger project. CPU prefill is the temporary scaffolding until that lands.
Other quantization formats (GPTQ, EXL2). AWQ-only.

🤖 Generated with Claude Code

Wires the int4-AWQ decode ELFs from PR Xilinx#1633 (rms_qkv_int4_rope) and PR Xilinx#1637 (o_gemv_ffn_int4) into the full inference pipeline so the existing chat_repl / llama32_1b_inference can drive them against a real HuggingFace AutoAWQ-quantized Llama-3.2-1B checkpoint. python3 llama32_1b_inference.py \ --quant awq --run-only --n-tokens 30 \ --model-path amd/Llama-3.2-1B-Instruct-awq-uint4-asym-g128-bf16-lmhead \ --prompt "Once upon a time" # -> "Once upon a time, in a small village nestled in the rolling hills # of a far-off land, there lived a young girl named Sophia. Sophia was a" # 12.4 tok/s decode (~81 ms/tok), coherent continuation. Prefill stays on CPU as a placeholder for this PR (no int4 GEMM kernel / prefill ELFs yet — that's a separate project). The placeholder runs the existing llama32_1b_reference.transformer_block over dequantized-to-bf16 AWQ weights to populate the KV cache, then hands off to the NPU int4 decode loop. Replacing it with int4 NPU prefill later doesn't touch any of the decode wiring landed here. New files: awq_repacker.py - Unpacks AutoAWQ int32-packed nibbles via AWQ_PACK_ORDER [0,2,4,6,1,3,5,7], composes with matvec_int4_packed.pack_inputs to produce the per-tile packed-BO layout the int4 decode ELFs consume. - dequant_to_bf16 (fp16->bf16 scales, asymmetric uint4) for CPU prefill. - Built-in synthetic round-trip self-test (>= 0.9999 correlation vs dense dequant); passes at K/N up to 2048. cpu_prefill.py - Drop-in replacement for run_npu_prefill signature, harvests per-layer k_roped/v from transformer_block intermediates into the KV cache layout expected by run_decode_block. ~165 s for a 40-token prompt; fine for validation, not for production. Modified: llama32_1b_weights.py - load_weights_awq(model_id, config): loads HF AutoAWQ tensors, attaches both bf16 dequant (existing LayerWeights fields, for CPU prefill) and per-tile packed BOs (_wq_packed / .../ _wgateup_packed / _wdown_packed, for NPU decode). Gate+up are interleaved at the nibble level so the int4 FFN ELF consumes them in one arg slot. kernel_builder/backend_presets.py - RGR_INT4_BACKEND, OGF_INT4_BACKEND (same shape as the bf16 ones; distinct instance_name so kernel-cache files don't collide). kernel_builder/external_kernels.py - compile_all_external_kernels(quant=) builds mv_int4_bf16.o when quant="awq" (object already had compile_mv_int4_bf16 from Xilinx#1633). kernel_builder/cache.py - prepare_air_project(quant=) stages mv_int4_bf16.o into air_project/. compile_and_cache detects int4 ELFs from the name and pipes the right quant through, so existing call sites don't need to change. llama32_1b_decode.py - compile_decode_kernels(cache, config, quant=) builds either the bf16 ELFs or the int4 ELFs. - run_decode_block(..., quant=) reads either bf16 transposed weights or packed-i8 BOs from the same arg slots. llama32_1b_inference.py - --quant {bf16,awq} flag, --model-path AWQ checkpoint override. - awq mode: no prefill compile, no bf16 transpose, no bf16 prefill preload; CPU prefill replaces run_npu_prefill. - --quant=awq is incompatible with --synthetic-weights. Verification on NPU2 with amd/Llama-3.2-1B-Instruct-awq-uint4-asym-g128-bf16-lmhead: - compile-only: rms_qkv_int4_rope (3 s) + o_gemv_ffn_int4 (35 s) + lm_head_gemv (10 s) - 30-token greedy generation produces coherent English, 12.4 tok/s decode - Decode latency (~81 ms/tok) tracks the standalone-ELF PR Xilinx#1637 result Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Copilot

Pull request overview

Wires the int4-AWQ decode ELFs (rms_qkv_int4_rope, o_gemv_ffn_int4) into the existing Llama-3.2-1B inference pipeline behind a new --quant awq flag. AWQ mode loads HF AutoAWQ checkpoints (both as bf16 dequant for a CPU prefill placeholder and as per-tile packed-BO weights for the int4 decode ELFs) while leaving the bf16 NPU prefill + NPU decode path intact for --quant bf16.

Changes:

New awq_repacker.py / cpu_prefill.py / load_weights_awq to bridge HF AutoAWQ → mlir-air packed-BO layout and to run prefill on CPU as a temporary scaffold.
Plumb quant= through prepare/compile/runtime (backend_presets, external_kernels, cache, llama32_1b_decode, llama32_1b_inference) to dispatch int4 vs bf16 ELFs and weights.
Skip prefill compile/preload/transpose and EOS-padding in AWQ mode; dispatch run_cpu_prefill instead of run_npu_prefill.

Reviewed changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 1 comment.

Show a summary per file

File	Description
programming_examples/llama32_1b/awq_repacker.py	New HF AutoAWQ → packed-BO repacker with synthetic self-test
programming_examples/llama32_1b/cpu_prefill.py	CPU-only prefill placeholder mirroring `run_npu_prefill` contract
programming_examples/llama32_1b/llama32_1b_weights.py	New `load_weights_awq` populating bf16 dequant + per-tile packed-BO attrs
programming_examples/llama32_1b/llama32_1b_decode.py	`compile_decode_kernels` / `run_decode_block` branch by `quant`
programming_examples/llama32_1b/llama32_1b_inference.py	`--quant` / `--model-path` flags; AWQ skips prefill compile/preload/pad
programming_examples/llama32_1b/kernel_builder/backend_presets.py	Adds `RGR_INT4_BACKEND` / `OGF_INT4_BACKEND` presets
programming_examples/llama32_1b/kernel_builder/external_kernels.py	Replaces `compile_mv_bf16` with `compile_mv_k8192`; adds AWQ branch
programming_examples/llama32_1b/kernel_builder/cache.py	`prepare_air_project(quant=)` stages `mv_int4_bf16.o` for int4 ELFs

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

    compile_mv()
-    compile_mv_bf16()
+    compile_mv_k8192()
+    if quant == "awq":
+        compile_mv_int4_bf16()


erwei-xilinx requested a review from jgmelber as a code owner May 30, 2026 18:27

Copilot AI review requested due to automatic review settings May 30, 2026 18:27

Copilot started reviewing on behalf of erwei-xilinx May 30, 2026 18:28 View session

Copilot AI reviewed May 30, 2026

View reviewed changes

Comment thread programming_examples/llama32_1b/kernel_builder/external_kernels.py

Comment on lines 220 to +223

compile_mv()

compile_mv_bf16()

compile_mv_k8192()

if quant == "awq":

compile_mv_int4_bf16()

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[llama32_1b] int4-AWQ end-to-end decode with HF AutoAWQ checkpoint#1638

[llama32_1b] int4-AWQ end-to-end decode with HF AutoAWQ checkpoint#1638
erwei-xilinx wants to merge 1 commit into
Xilinx:mainfrom
erwei-xilinx:int4-llama-e2e-awq

erwei-xilinx commented May 30, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

erwei-xilinx commented May 30, 2026

Summary

-> "Once upon a time, in a small village nestled in the rolling hills of a far-off land,

there lived a young girl named Sophia. Sophia was a"

12.4 tok/s decode (~81 ms/tok), coherent continuation.

New files

Modified

Test plan

Out of scope (deferred)

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants