Skip to content

[llama32_1b] int4-AWQ end-to-end decode with HF AutoAWQ checkpoint#1638

Open
erwei-xilinx wants to merge 1 commit into
Xilinx:mainfrom
erwei-xilinx:int4-llama-e2e-awq
Open

[llama32_1b] int4-AWQ end-to-end decode with HF AutoAWQ checkpoint#1638
erwei-xilinx wants to merge 1 commit into
Xilinx:mainfrom
erwei-xilinx:int4-llama-e2e-awq

Conversation

@erwei-xilinx
Copy link
Copy Markdown
Collaborator

Summary

Wires the int4-AWQ decode ELFs from #1633 (rms_qkv_int4_rope) and #1637 (o_gemv_ffn_int4) into the existing inference / chat-REPL pipeline so the user can run a HuggingFace AutoAWQ-quantized Llama-3.2-1B end-to-end on NPU2 with one flag:

```
python3 llama32_1b_inference.py \
--quant awq --run-only --n-tokens 30 \
--model-path amd/Llama-3.2-1B-Instruct-awq-uint4-asym-g128-bf16-lmhead \
--prompt "Once upon a time"

-> "Once upon a time, in a small village nestled in the rolling hills of a far-off land,

there lived a young girl named Sophia. Sophia was a"

12.4 tok/s decode (~81 ms/tok), coherent continuation.

```

Prefill stays on CPU as a placeholder for this PR (no int4 GEMM kernel / prefill ELFs yet — that's a separate project). The placeholder reuses `llama32_1b_reference.transformer_block` over dequantized-to-bf16 AWQ weights to populate the KV cache, then hands off to the NPU int4 decode loop. Replacing it with int4 NPU prefill later doesn't touch any of the decode wiring landed here.

New files

  • `awq_repacker.py`
    • Unpacks AutoAWQ int32-packed nibbles via `AWQ_PACK_ORDER = [0,2,4,6,1,3,5,7]`, composes with `matvec_int4_packed.pack_inputs` to produce the per-tile packed-BO layout the int4 decode ELFs consume.
    • `dequant_to_bf16` (fp16→bf16 scales, asymmetric uint4) for the CPU prefill path.
    • Built-in synthetic round-trip self-test (≥ 0.9999 correlation vs dense dequant); passes at K/N up to 2048.
  • `cpu_prefill.py`
    • Drop-in replacement for `run_npu_prefill`'s signature. Harvests per-layer `k_roped`/`v` from `transformer_block` intermediates into the KV cache layout that `run_decode_block` expects. ~165 s for a 40-token prompt — fine for validation, not for production.

Modified

  • `llama32_1b_weights.py` — new `load_weights_awq` populates both bf16 dequant (existing fields) and packed-BO attrs (`_wq_packed`/.../`_wgateup_packed`/`_wdown_packed`). gate+up are interleaved at the nibble level so the int4 FFN ELF consumes them in one arg slot.
  • `kernel_builder/backend_presets.py` — `RGR_INT4_BACKEND`, `OGF_INT4_BACKEND` (distinct `instance_name` so cache files don't collide).
  • `kernel_builder/external_kernels.py` — `compile_all_external_kernels(quant=)` builds `mv_int4_bf16.o` when `quant="awq"`.
  • `kernel_builder/cache.py` — `prepare_air_project(quant=)` stages `mv_int4_bf16.o` into `air_project/`. `compile_and_cache` detects int4 ELFs from the kernel name and pipes the right quant through — existing call sites don't change.
  • `llama32_1b_decode.py` — `compile_decode_kernels(..., quant=)` and `run_decode_block(..., quant=)` branch between bf16 and int4 ELFs.
  • `llama32_1b_inference.py` — `--quant {bf16,awq}` flag, `--model-path` AWQ override. AWQ mode skips prefill compile / bf16 transpose / NPU prefill preload, and dispatches `cpu_prefill.run_cpu_prefill` instead.

Test plan

  • AWQ repacker self-test (synthetic round-trip) passes at K=2048, N∈{512, 2048}
  • `--quant=awq --compile-only` builds all 3 int4 decode ELFs cleanly
  • `--quant=awq --run-only` against the AMD AWQ checkpoint produces coherent text and stable ~12.4 tok/s decode
  • `--quant=bf16` path unchanged (no regression in the bf16 NPU prefill + NPU decode pipeline)
  • Token-level match vs HF AutoAWQ greedy reference (requires `pip install autoawq` — follow-up)

Out of scope (deferred)

  • Int4 prefill ELFs and the int4 GEMM kernel they would need — a separate, substantially larger project. CPU prefill is the temporary scaffolding until that lands.
  • Other quantization formats (GPTQ, EXL2). AWQ-only.

🤖 Generated with Claude Code

Wires the int4-AWQ decode ELFs from PR Xilinx#1633 (rms_qkv_int4_rope) and
PR Xilinx#1637 (o_gemv_ffn_int4) into the full inference pipeline so the
existing chat_repl / llama32_1b_inference can drive them against a real
HuggingFace AutoAWQ-quantized Llama-3.2-1B checkpoint.

  python3 llama32_1b_inference.py \
      --quant awq --run-only --n-tokens 30 \
      --model-path amd/Llama-3.2-1B-Instruct-awq-uint4-asym-g128-bf16-lmhead \
      --prompt "Once upon a time"
  # -> "Once upon a time, in a small village nestled in the rolling hills
  #     of a far-off land, there lived a young girl named Sophia. Sophia was a"
  # 12.4 tok/s decode (~81 ms/tok), coherent continuation.

Prefill stays on CPU as a placeholder for this PR (no int4 GEMM kernel /
prefill ELFs yet — that's a separate project). The placeholder runs the
existing llama32_1b_reference.transformer_block over dequantized-to-bf16
AWQ weights to populate the KV cache, then hands off to the NPU int4
decode loop. Replacing it with int4 NPU prefill later doesn't touch any
of the decode wiring landed here.

New files:
  awq_repacker.py
    - Unpacks AutoAWQ int32-packed nibbles via AWQ_PACK_ORDER
      [0,2,4,6,1,3,5,7], composes with matvec_int4_packed.pack_inputs to
      produce the per-tile packed-BO layout the int4 decode ELFs consume.
    - dequant_to_bf16 (fp16->bf16 scales, asymmetric uint4) for CPU prefill.
    - Built-in synthetic round-trip self-test (>= 0.9999 correlation vs
      dense dequant); passes at K/N up to 2048.
  cpu_prefill.py
    - Drop-in replacement for run_npu_prefill signature, harvests
      per-layer k_roped/v from transformer_block intermediates into the
      KV cache layout expected by run_decode_block. ~165 s for a
      40-token prompt; fine for validation, not for production.

Modified:
  llama32_1b_weights.py
    - load_weights_awq(model_id, config): loads HF AutoAWQ tensors,
      attaches both bf16 dequant (existing LayerWeights fields, for CPU
      prefill) and per-tile packed BOs (_wq_packed / .../ _wgateup_packed /
      _wdown_packed, for NPU decode). Gate+up are interleaved at the
      nibble level so the int4 FFN ELF consumes them in one arg slot.
  kernel_builder/backend_presets.py
    - RGR_INT4_BACKEND, OGF_INT4_BACKEND (same shape as the bf16 ones;
      distinct instance_name so kernel-cache files don't collide).
  kernel_builder/external_kernels.py
    - compile_all_external_kernels(quant=) builds mv_int4_bf16.o when
      quant="awq" (object already had compile_mv_int4_bf16 from Xilinx#1633).
  kernel_builder/cache.py
    - prepare_air_project(quant=) stages mv_int4_bf16.o into air_project/.
      compile_and_cache detects int4 ELFs from the name and pipes the
      right quant through, so existing call sites don't need to change.
  llama32_1b_decode.py
    - compile_decode_kernels(cache, config, quant=) builds either the
      bf16 ELFs or the int4 ELFs.
    - run_decode_block(..., quant=) reads either bf16 transposed weights
      or packed-i8 BOs from the same arg slots.
  llama32_1b_inference.py
    - --quant {bf16,awq} flag, --model-path AWQ checkpoint override.
    - awq mode: no prefill compile, no bf16 transpose, no bf16 prefill
      preload; CPU prefill replaces run_npu_prefill.
    - --quant=awq is incompatible with --synthetic-weights.

Verification on NPU2 with amd/Llama-3.2-1B-Instruct-awq-uint4-asym-g128-bf16-lmhead:
- compile-only: rms_qkv_int4_rope (3 s) + o_gemv_ffn_int4 (35 s) + lm_head_gemv (10 s)
- 30-token greedy generation produces coherent English, 12.4 tok/s decode
- Decode latency (~81 ms/tok) tracks the standalone-ELF PR Xilinx#1637 result

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@erwei-xilinx erwei-xilinx requested a review from jgmelber as a code owner May 30, 2026 18:27
Copilot AI review requested due to automatic review settings May 30, 2026 18:27
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Wires the int4-AWQ decode ELFs (rms_qkv_int4_rope, o_gemv_ffn_int4) into the existing Llama-3.2-1B inference pipeline behind a new --quant awq flag. AWQ mode loads HF AutoAWQ checkpoints (both as bf16 dequant for a CPU prefill placeholder and as per-tile packed-BO weights for the int4 decode ELFs) while leaving the bf16 NPU prefill + NPU decode path intact for --quant bf16.

Changes:

  • New awq_repacker.py / cpu_prefill.py / load_weights_awq to bridge HF AutoAWQ → mlir-air packed-BO layout and to run prefill on CPU as a temporary scaffold.
  • Plumb quant= through prepare/compile/runtime (backend_presets, external_kernels, cache, llama32_1b_decode, llama32_1b_inference) to dispatch int4 vs bf16 ELFs and weights.
  • Skip prefill compile/preload/transpose and EOS-padding in AWQ mode; dispatch run_cpu_prefill instead of run_npu_prefill.

Reviewed changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
programming_examples/llama32_1b/awq_repacker.py New HF AutoAWQ → packed-BO repacker with synthetic self-test
programming_examples/llama32_1b/cpu_prefill.py CPU-only prefill placeholder mirroring run_npu_prefill contract
programming_examples/llama32_1b/llama32_1b_weights.py New load_weights_awq populating bf16 dequant + per-tile packed-BO attrs
programming_examples/llama32_1b/llama32_1b_decode.py compile_decode_kernels / run_decode_block branch by quant
programming_examples/llama32_1b/llama32_1b_inference.py --quant / --model-path flags; AWQ skips prefill compile/preload/pad
programming_examples/llama32_1b/kernel_builder/backend_presets.py Adds RGR_INT4_BACKEND / OGF_INT4_BACKEND presets
programming_examples/llama32_1b/kernel_builder/external_kernels.py Replaces compile_mv_bf16 with compile_mv_k8192; adds AWQ branch
programming_examples/llama32_1b/kernel_builder/cache.py prepare_air_project(quant=) stages mv_int4_bf16.o for int4 ELFs

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 220 to +223
compile_mv()
compile_mv_bf16()
compile_mv_k8192()
if quant == "awq":
compile_mv_int4_bf16()
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants