[llama32_1b] int4-AWQ end-to-end decode with HF AutoAWQ checkpoint#1638
Open
erwei-xilinx wants to merge 1 commit into
Open
[llama32_1b] int4-AWQ end-to-end decode with HF AutoAWQ checkpoint#1638erwei-xilinx wants to merge 1 commit into
erwei-xilinx wants to merge 1 commit into
Conversation
Wires the int4-AWQ decode ELFs from PR Xilinx#1633 (rms_qkv_int4_rope) and PR Xilinx#1637 (o_gemv_ffn_int4) into the full inference pipeline so the existing chat_repl / llama32_1b_inference can drive them against a real HuggingFace AutoAWQ-quantized Llama-3.2-1B checkpoint. python3 llama32_1b_inference.py \ --quant awq --run-only --n-tokens 30 \ --model-path amd/Llama-3.2-1B-Instruct-awq-uint4-asym-g128-bf16-lmhead \ --prompt "Once upon a time" # -> "Once upon a time, in a small village nestled in the rolling hills # of a far-off land, there lived a young girl named Sophia. Sophia was a" # 12.4 tok/s decode (~81 ms/tok), coherent continuation. Prefill stays on CPU as a placeholder for this PR (no int4 GEMM kernel / prefill ELFs yet — that's a separate project). The placeholder runs the existing llama32_1b_reference.transformer_block over dequantized-to-bf16 AWQ weights to populate the KV cache, then hands off to the NPU int4 decode loop. Replacing it with int4 NPU prefill later doesn't touch any of the decode wiring landed here. New files: awq_repacker.py - Unpacks AutoAWQ int32-packed nibbles via AWQ_PACK_ORDER [0,2,4,6,1,3,5,7], composes with matvec_int4_packed.pack_inputs to produce the per-tile packed-BO layout the int4 decode ELFs consume. - dequant_to_bf16 (fp16->bf16 scales, asymmetric uint4) for CPU prefill. - Built-in synthetic round-trip self-test (>= 0.9999 correlation vs dense dequant); passes at K/N up to 2048. cpu_prefill.py - Drop-in replacement for run_npu_prefill signature, harvests per-layer k_roped/v from transformer_block intermediates into the KV cache layout expected by run_decode_block. ~165 s for a 40-token prompt; fine for validation, not for production. Modified: llama32_1b_weights.py - load_weights_awq(model_id, config): loads HF AutoAWQ tensors, attaches both bf16 dequant (existing LayerWeights fields, for CPU prefill) and per-tile packed BOs (_wq_packed / .../ _wgateup_packed / _wdown_packed, for NPU decode). Gate+up are interleaved at the nibble level so the int4 FFN ELF consumes them in one arg slot. kernel_builder/backend_presets.py - RGR_INT4_BACKEND, OGF_INT4_BACKEND (same shape as the bf16 ones; distinct instance_name so kernel-cache files don't collide). kernel_builder/external_kernels.py - compile_all_external_kernels(quant=) builds mv_int4_bf16.o when quant="awq" (object already had compile_mv_int4_bf16 from Xilinx#1633). kernel_builder/cache.py - prepare_air_project(quant=) stages mv_int4_bf16.o into air_project/. compile_and_cache detects int4 ELFs from the name and pipes the right quant through, so existing call sites don't need to change. llama32_1b_decode.py - compile_decode_kernels(cache, config, quant=) builds either the bf16 ELFs or the int4 ELFs. - run_decode_block(..., quant=) reads either bf16 transposed weights or packed-i8 BOs from the same arg slots. llama32_1b_inference.py - --quant {bf16,awq} flag, --model-path AWQ checkpoint override. - awq mode: no prefill compile, no bf16 transpose, no bf16 prefill preload; CPU prefill replaces run_npu_prefill. - --quant=awq is incompatible with --synthetic-weights. Verification on NPU2 with amd/Llama-3.2-1B-Instruct-awq-uint4-asym-g128-bf16-lmhead: - compile-only: rms_qkv_int4_rope (3 s) + o_gemv_ffn_int4 (35 s) + lm_head_gemv (10 s) - 30-token greedy generation produces coherent English, 12.4 tok/s decode - Decode latency (~81 ms/tok) tracks the standalone-ELF PR Xilinx#1637 result Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Contributor
There was a problem hiding this comment.
Pull request overview
Wires the int4-AWQ decode ELFs (rms_qkv_int4_rope, o_gemv_ffn_int4) into the existing Llama-3.2-1B inference pipeline behind a new --quant awq flag. AWQ mode loads HF AutoAWQ checkpoints (both as bf16 dequant for a CPU prefill placeholder and as per-tile packed-BO weights for the int4 decode ELFs) while leaving the bf16 NPU prefill + NPU decode path intact for --quant bf16.
Changes:
- New
awq_repacker.py/cpu_prefill.py/load_weights_awqto bridge HF AutoAWQ → mlir-air packed-BO layout and to run prefill on CPU as a temporary scaffold. - Plumb
quant=through prepare/compile/runtime (backend_presets,external_kernels,cache,llama32_1b_decode,llama32_1b_inference) to dispatch int4 vs bf16 ELFs and weights. - Skip prefill compile/preload/transpose and EOS-padding in AWQ mode; dispatch
run_cpu_prefillinstead ofrun_npu_prefill.
Reviewed changes
Copilot reviewed 8 out of 8 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
| programming_examples/llama32_1b/awq_repacker.py | New HF AutoAWQ → packed-BO repacker with synthetic self-test |
| programming_examples/llama32_1b/cpu_prefill.py | CPU-only prefill placeholder mirroring run_npu_prefill contract |
| programming_examples/llama32_1b/llama32_1b_weights.py | New load_weights_awq populating bf16 dequant + per-tile packed-BO attrs |
| programming_examples/llama32_1b/llama32_1b_decode.py | compile_decode_kernels / run_decode_block branch by quant |
| programming_examples/llama32_1b/llama32_1b_inference.py | --quant / --model-path flags; AWQ skips prefill compile/preload/pad |
| programming_examples/llama32_1b/kernel_builder/backend_presets.py | Adds RGR_INT4_BACKEND / OGF_INT4_BACKEND presets |
| programming_examples/llama32_1b/kernel_builder/external_kernels.py | Replaces compile_mv_bf16 with compile_mv_k8192; adds AWQ branch |
| programming_examples/llama32_1b/kernel_builder/cache.py | prepare_air_project(quant=) stages mv_int4_bf16.o for int4 ELFs |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Comment on lines
220
to
+223
| compile_mv() | ||
| compile_mv_bf16() | ||
| compile_mv_k8192() | ||
| if quant == "awq": | ||
| compile_mv_int4_bf16() |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Wires the int4-AWQ decode ELFs from #1633 (
rms_qkv_int4_rope) and #1637 (o_gemv_ffn_int4) into the existing inference / chat-REPL pipeline so the user can run a HuggingFace AutoAWQ-quantized Llama-3.2-1B end-to-end on NPU2 with one flag:```
python3 llama32_1b_inference.py \
--quant awq --run-only --n-tokens 30 \
--model-path amd/Llama-3.2-1B-Instruct-awq-uint4-asym-g128-bf16-lmhead \
--prompt "Once upon a time"
-> "Once upon a time, in a small village nestled in the rolling hills of a far-off land,
there lived a young girl named Sophia. Sophia was a"
12.4 tok/s decode (~81 ms/tok), coherent continuation.
```
Prefill stays on CPU as a placeholder for this PR (no int4 GEMM kernel / prefill ELFs yet — that's a separate project). The placeholder reuses `llama32_1b_reference.transformer_block` over dequantized-to-bf16 AWQ weights to populate the KV cache, then hands off to the NPU int4 decode loop. Replacing it with int4 NPU prefill later doesn't touch any of the decode wiring landed here.
New files
Modified
Test plan
Out of scope (deferred)
🤖 Generated with Claude Code