[programming_examples] llama32_1b: verify subsystem + docs by tonyjie · Pull Request #1636 · Xilinx/mlir-air

tonyjie · 2026-05-29T21:51:29Z

Re-targeted from #1635 onto llama-verify so a maintainer can workflow_dispatch buildAndTestRyzenAI.yml against this branch and exercise HF_TOKEN-gated tests end-to-end before merging to main.

Summary

Builds on the existing programming_examples/llama32_1b/ example (PR #1590 + #1610) with two additions plus CI wiring to run the new verification test:

Verify subsystem (verify/) — end-to-end HF parity gate that runs the production NPU prefill+decode path and compares per-position logits against the HuggingFace bf16 reference using top-k token inclusion (vLLM-style check_logprobs_close method).
- make verify — fast CI gate, 2 prompts × 32 greedy tokens, k=5, ~2 min.
- make verify-full — exhaustive local sweep over the full prompt file (currently 8 prompts), ~6 min.
- make diagnosis — per-layer ffn_out cosine + max_abs vs HF bf16 for one prompt (informational microscope).
- Includes production-loop refinements (llama32_1b_{inference,prefill,decode,cpu_helpers}.py, kernel_builder/cache.py, kernel_builder/external_kernels.py) needed to surface intermediates cleanly, plus a profiling redesign (end-to-end dataflow timing, TTFT reporting, per-token trend).
Standalone docs (docs/detail/) — IMPLEMENTATION_GUIDE.html, VERIFICATION.html, PROFILE.html walkthroughs (HTML kept under docs/detail/ to stay separate from the markdown reference docs). Plus updates to docs/{usage,profile,explain}.md describing the new verify/profile workflows.
CI wiring — exposes the existing HF_TOKEN repo secret to the check-programming-examples-peano step in .github/workflows/buildAndTestRyzenAI.yml so the new run_npu2_verify.lit (which needs gated meta-llama weights) can authenticate.

Three commits, one per logical area, all confined to programming_examples/llama32_1b/ (+ the hf_token LIT feature in programming_examples/lit.cfg.py and the CI line in .github/workflows/buildAndTestRyzenAI.yml). Also removes docs/issues.md (added by #1590) — its content is either resolved (BF16 RoPE bug now fixed) or covered by the new HTML walkthroughs.

Scope: 33 files, +6522 / −1132.

How to validate before merge

Because PRs from forks can't access repo secrets (GitHub policy), automated CI on this PR skips run_npu2_verify.lit with UNSUPPORTED. To exercise it for real:

Merge this PR into Xilinx:llama-verify (or otherwise land its commits there).
Trigger workflow_dispatch on Build and Test with AIE tools on Ryzen AI against the llama-verify branch directly (the branch, not the PR). That run has full secret access.
Confirm in the lit output:
- HF_TOKEN found in environment; hf_token feature enabled.
- PASS: AIR_TEST :: programming_examples/llama32_1b/run_npu2_verify.lit
Once green on llama-verify, fast-forward into main.

hf_token LIT feature

run_npu2_verify.lit requires gated meta-llama/Llama-3.2-1B weights, so it carries REQUIRES: ryzen_ai_npu2, peano, hf_token. The feature is registered in programming_examples/lit.cfg.py (~lines 128–135) and only available when HF_TOKEN is set in the environment. Tests skip cleanly with UNSUPPORTED when it's unset.

Rendered docs preview

The three HTML walkthroughs render in-browser via raw.githack.com (a free GitHub-content proxy):

IMPLEMENTATION_GUIDE.html — model architecture, per-kernel building blocks, NPU mapping
VERIFICATION.html — HF parity gate methodology and results
PROFILE.html — end-to-end dataflow and per-step timing

If you'd prefer these published to the official Pages site (xilinx.github.io/mlir-air/), happy to follow up with a small change to .github/workflows/generateDocs.yml to copy them into the publish directory — kept out of this PR so it can be reviewed separately.

Test plan

3 representative XRT tests pass under flock /tmp/mlir-air-npu.lock on NPU2 (02_mul_shim_1x1, 06_add_shim_bf16, 12_matmul_transform_1x4_bf16)
make compile succeeds end-to-end on programming_examples/llama32_1b/ against latest upstream/main toolchain
make verify PASS 2/2 prompts on NPU2 against real meta-llama/Llama-3.2-1B-Instruct weights (top-k inclusion, k=5)
make run (N_TOKENS=5): TTFT 1.41 s, 10.6 tok/s, sensible output
make profile (N_TOKENS=3): all 5 redesigned profiling sections (Wall-Time Attribution, Per-Layer Execution, NPU XRT Call Breakdown, CPU Op Breakdown, Fine-Grained NPU Breakdown) + END-TO-END DATAFLOW
make diagnosis: per-layer probe across 16 layers, report written
CI side: pending maintainer workflow_dispatch against llama-verify to validate secret wiring end-to-end

Supersedes #1635.

…-loop refinements Adds an end-to-end verification subsystem (verify/) for the LLAMA-3.2-1B example: an HF parity gate that runs the production NPU prefill+decode path and compares per-position logits against HuggingFace bf16 reference using top-k token inclusion. Includes: - verify/ package: runners (HF, NPU, CPU), comparators, report, prompts - Production loop refinements in llama32_1b_{inference,prefill,decode}.py to expose intermediates needed by the verify path - kernel_builder/ cache + external-kernel handling updates - Profiling redesign: end-to-end dataflow timing, TTFT reporting, tokenize/pad accounting, per-token trend - Two Make targets: `make verify` (2 prompts × 32 tokens, ~2 min, the fast CI gate) and `make verify-full` (all prompts in the file, currently 8, for exhaustive local validation) - run_npu2_verify.lit: REQUIRES hf_token (gated meta-llama download) - lit.cfg.py: hf_token feature, available only when HF_TOKEN env var set, so REQUIRES: hf_token tests skip cleanly on machines without it - cpu_helpers.py: extracted helpers (renamed from older reference.py)

…nd profile docs Adds standalone HTML walkthroughs under docs/detail/: - IMPLEMENTATION_GUIDE.html: model architecture, per-kernel building blocks, NPU mapping decisions, SVG diagrams - VERIFICATION.html: HF parity gate methodology, threshold tables, per-layer diagnosis flow - PROFILE.html: end-to-end dataflow + per-step timing visualization HTMLs live in docs/detail/ to keep them separate from the original markdown reference docs. Markdown updates (usage.md, profile.md, explain.md) describe the new verify/profile workflows and cross-link to the HTML walkthroughs in detail/. Also removes docs/issues.md (added by PR Xilinx#1590), whose content is either resolved (BF16 RoPE bug now fixed) or covered by the new HTML walkthroughs.

Wires the existing HF_TOKEN repository secret into the check-programming-examples-peano invocation in .github/workflows/buildAndTestRyzenAI.yml, so lit tests with REQUIRES: hf_token (currently programming_examples/llama32_1b/run_npu2_verify.lit) can authenticate against Hugging Face Hub for gated model downloads. Tests without REQUIRES: hf_token are unaffected — they continue to run as before. When the secret is unset (e.g. on fork-originating PR builds, where GitHub doesn't expose secrets by policy), the lit feature stays disabled and gated tests skip cleanly with UNSUPPORTED.

tonyjie requested review from eddierichter-amd, erwei-xilinx, fifield and jgmelber as code owners May 29, 2026 21:51

tonyjie mentioned this pull request May 29, 2026

[programming_examples] llama32_1b: verify subsystem, ablation studies, and docs #1635

Closed

5 tasks

tonyjie added 3 commits May 29, 2026 23:14

tonyjie force-pushed the llama32_1b-verify-pr branch from 8db6734 to 183421a Compare May 30, 2026 03:22

tonyjie changed the title ~~[programming_examples] llama32_1b: verify subsystem, ablation studies, and docs~~ [programming_examples] llama32_1b: verify subsystem + docs May 30, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[programming_examples] llama32_1b: verify subsystem + docs#1636

[programming_examples] llama32_1b: verify subsystem + docs#1636
tonyjie wants to merge 3 commits into
Xilinx:llama-verifyfrom
tonyjie:llama32_1b-verify-pr

tonyjie commented May 29, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

tonyjie commented May 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

How to validate before merge

hf_token LIT feature

Rendered docs preview

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

tonyjie commented May 29, 2026 •

edited

Loading