Skip to content

[programming_examples] llama32_1b: verify subsystem + docs#1636

Open
tonyjie wants to merge 3 commits into
Xilinx:llama-verifyfrom
tonyjie:llama32_1b-verify-pr
Open

[programming_examples] llama32_1b: verify subsystem + docs#1636
tonyjie wants to merge 3 commits into
Xilinx:llama-verifyfrom
tonyjie:llama32_1b-verify-pr

Conversation

@tonyjie
Copy link
Copy Markdown
Contributor

@tonyjie tonyjie commented May 29, 2026

Re-targeted from #1635 onto llama-verify so a maintainer can workflow_dispatch buildAndTestRyzenAI.yml against this branch and exercise HF_TOKEN-gated tests end-to-end before merging to main.

Summary

Builds on the existing programming_examples/llama32_1b/ example (PR #1590 + #1610) with two additions plus CI wiring to run the new verification test:

  • Verify subsystem (verify/) — end-to-end HF parity gate that runs the production NPU prefill+decode path and compares per-position logits against the HuggingFace bf16 reference using top-k token inclusion (vLLM-style check_logprobs_close method).
    • make verify — fast CI gate, 2 prompts × 32 greedy tokens, k=5, ~2 min.
    • make verify-full — exhaustive local sweep over the full prompt file (currently 8 prompts), ~6 min.
    • make diagnosis — per-layer ffn_out cosine + max_abs vs HF bf16 for one prompt (informational microscope).
    • Includes production-loop refinements (llama32_1b_{inference,prefill,decode,cpu_helpers}.py, kernel_builder/cache.py, kernel_builder/external_kernels.py) needed to surface intermediates cleanly, plus a profiling redesign (end-to-end dataflow timing, TTFT reporting, per-token trend).
  • Standalone docs (docs/detail/) — IMPLEMENTATION_GUIDE.html, VERIFICATION.html, PROFILE.html walkthroughs (HTML kept under docs/detail/ to stay separate from the markdown reference docs). Plus updates to docs/{usage,profile,explain}.md describing the new verify/profile workflows.
  • CI wiring — exposes the existing HF_TOKEN repo secret to the check-programming-examples-peano step in .github/workflows/buildAndTestRyzenAI.yml so the new run_npu2_verify.lit (which needs gated meta-llama weights) can authenticate.

Three commits, one per logical area, all confined to programming_examples/llama32_1b/ (+ the hf_token LIT feature in programming_examples/lit.cfg.py and the CI line in .github/workflows/buildAndTestRyzenAI.yml). Also removes docs/issues.md (added by #1590) — its content is either resolved (BF16 RoPE bug now fixed) or covered by the new HTML walkthroughs.

Scope: 33 files, +6522 / −1132.

How to validate before merge

Because PRs from forks can't access repo secrets (GitHub policy), automated CI on this PR skips run_npu2_verify.lit with UNSUPPORTED. To exercise it for real:

  1. Merge this PR into Xilinx:llama-verify (or otherwise land its commits there).
  2. Trigger workflow_dispatch on Build and Test with AIE tools on Ryzen AI against the llama-verify branch directly (the branch, not the PR). That run has full secret access.
  3. Confirm in the lit output:
    • HF_TOKEN found in environment; hf_token feature enabled.
    • PASS: AIR_TEST :: programming_examples/llama32_1b/run_npu2_verify.lit
  4. Once green on llama-verify, fast-forward into main.

hf_token LIT feature

run_npu2_verify.lit requires gated meta-llama/Llama-3.2-1B weights, so it carries REQUIRES: ryzen_ai_npu2, peano, hf_token. The feature is registered in programming_examples/lit.cfg.py (~lines 128–135) and only available when HF_TOKEN is set in the environment. Tests skip cleanly with UNSUPPORTED when it's unset.

Rendered docs preview

The three HTML walkthroughs render in-browser via raw.githack.com (a free GitHub-content proxy):

If you'd prefer these published to the official Pages site (xilinx.github.io/mlir-air/), happy to follow up with a small change to .github/workflows/generateDocs.yml to copy them into the publish directory — kept out of this PR so it can be reviewed separately.

Test plan

  • 3 representative XRT tests pass under flock /tmp/mlir-air-npu.lock on NPU2 (02_mul_shim_1x1, 06_add_shim_bf16, 12_matmul_transform_1x4_bf16)
  • make compile succeeds end-to-end on programming_examples/llama32_1b/ against latest upstream/main toolchain
  • make verify PASS 2/2 prompts on NPU2 against real meta-llama/Llama-3.2-1B-Instruct weights (top-k inclusion, k=5)
  • make run (N_TOKENS=5): TTFT 1.41 s, 10.6 tok/s, sensible output
  • make profile (N_TOKENS=3): all 5 redesigned profiling sections (Wall-Time Attribution, Per-Layer Execution, NPU XRT Call Breakdown, CPU Op Breakdown, Fine-Grained NPU Breakdown) + END-TO-END DATAFLOW
  • make diagnosis: per-layer probe across 16 layers, report written
  • CI side: pending maintainer workflow_dispatch against llama-verify to validate secret wiring end-to-end

Supersedes #1635.

tonyjie added 3 commits May 29, 2026 23:14
…-loop refinements

Adds an end-to-end verification subsystem (verify/) for the LLAMA-3.2-1B
example: an HF parity gate that runs the production NPU prefill+decode
path and compares per-position logits against HuggingFace bf16 reference
using top-k token inclusion.

Includes:
- verify/ package: runners (HF, NPU, CPU), comparators, report, prompts
- Production loop refinements in llama32_1b_{inference,prefill,decode}.py
  to expose intermediates needed by the verify path
- kernel_builder/ cache + external-kernel handling updates
- Profiling redesign: end-to-end dataflow timing, TTFT reporting,
  tokenize/pad accounting, per-token trend
- Two Make targets: `make verify` (2 prompts × 32 tokens, ~2 min, the
  fast CI gate) and `make verify-full` (all prompts in the file,
  currently 8, for exhaustive local validation)
- run_npu2_verify.lit: REQUIRES hf_token (gated meta-llama download)
- lit.cfg.py: hf_token feature, available only when HF_TOKEN env var
  set, so REQUIRES: hf_token tests skip cleanly on machines without it
- cpu_helpers.py: extracted helpers (renamed from older reference.py)
…nd profile docs

Adds standalone HTML walkthroughs under docs/detail/:
- IMPLEMENTATION_GUIDE.html: model architecture, per-kernel building
  blocks, NPU mapping decisions, SVG diagrams
- VERIFICATION.html: HF parity gate methodology, threshold tables,
  per-layer diagnosis flow
- PROFILE.html: end-to-end dataflow + per-step timing visualization

HTMLs live in docs/detail/ to keep them separate from the original
markdown reference docs. Markdown updates (usage.md, profile.md,
explain.md) describe the new verify/profile workflows and cross-link
to the HTML walkthroughs in detail/.

Also removes docs/issues.md (added by PR Xilinx#1590), whose content is
either resolved (BF16 RoPE bug now fixed) or covered by the new
HTML walkthroughs.
Wires the existing HF_TOKEN repository secret into the
check-programming-examples-peano invocation in
.github/workflows/buildAndTestRyzenAI.yml, so lit tests with
REQUIRES: hf_token (currently
programming_examples/llama32_1b/run_npu2_verify.lit) can authenticate
against Hugging Face Hub for gated model downloads.

Tests without REQUIRES: hf_token are unaffected — they continue to
run as before. When the secret is unset (e.g. on fork-originating PR
builds, where GitHub doesn't expose secrets by policy), the lit
feature stays disabled and gated tests skip cleanly with UNSUPPORTED.
@tonyjie tonyjie force-pushed the llama32_1b-verify-pr branch from 8db6734 to 183421a Compare May 30, 2026 03:22
@tonyjie tonyjie changed the title [programming_examples] llama32_1b: verify subsystem, ablation studies, and docs [programming_examples] llama32_1b: verify subsystem + docs May 30, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant