[programming_examples] llama32_1b: verify subsystem + docs#1636
Open
tonyjie wants to merge 3 commits into
Open
[programming_examples] llama32_1b: verify subsystem + docs#1636tonyjie wants to merge 3 commits into
tonyjie wants to merge 3 commits into
Conversation
5 tasks
…-loop refinements
Adds an end-to-end verification subsystem (verify/) for the LLAMA-3.2-1B
example: an HF parity gate that runs the production NPU prefill+decode
path and compares per-position logits against HuggingFace bf16 reference
using top-k token inclusion.
Includes:
- verify/ package: runners (HF, NPU, CPU), comparators, report, prompts
- Production loop refinements in llama32_1b_{inference,prefill,decode}.py
to expose intermediates needed by the verify path
- kernel_builder/ cache + external-kernel handling updates
- Profiling redesign: end-to-end dataflow timing, TTFT reporting,
tokenize/pad accounting, per-token trend
- Two Make targets: `make verify` (2 prompts × 32 tokens, ~2 min, the
fast CI gate) and `make verify-full` (all prompts in the file,
currently 8, for exhaustive local validation)
- run_npu2_verify.lit: REQUIRES hf_token (gated meta-llama download)
- lit.cfg.py: hf_token feature, available only when HF_TOKEN env var
set, so REQUIRES: hf_token tests skip cleanly on machines without it
- cpu_helpers.py: extracted helpers (renamed from older reference.py)
…nd profile docs Adds standalone HTML walkthroughs under docs/detail/: - IMPLEMENTATION_GUIDE.html: model architecture, per-kernel building blocks, NPU mapping decisions, SVG diagrams - VERIFICATION.html: HF parity gate methodology, threshold tables, per-layer diagnosis flow - PROFILE.html: end-to-end dataflow + per-step timing visualization HTMLs live in docs/detail/ to keep them separate from the original markdown reference docs. Markdown updates (usage.md, profile.md, explain.md) describe the new verify/profile workflows and cross-link to the HTML walkthroughs in detail/. Also removes docs/issues.md (added by PR Xilinx#1590), whose content is either resolved (BF16 RoPE bug now fixed) or covered by the new HTML walkthroughs.
Wires the existing HF_TOKEN repository secret into the check-programming-examples-peano invocation in .github/workflows/buildAndTestRyzenAI.yml, so lit tests with REQUIRES: hf_token (currently programming_examples/llama32_1b/run_npu2_verify.lit) can authenticate against Hugging Face Hub for gated model downloads. Tests without REQUIRES: hf_token are unaffected — they continue to run as before. When the secret is unset (e.g. on fork-originating PR builds, where GitHub doesn't expose secrets by policy), the lit feature stays disabled and gated tests skip cleanly with UNSUPPORTED.
8db6734 to
183421a
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Re-targeted from #1635 onto
llama-verifyso a maintainer canworkflow_dispatchbuildAndTestRyzenAI.ymlagainst this branch and exerciseHF_TOKEN-gated tests end-to-end before merging tomain.Summary
Builds on the existing
programming_examples/llama32_1b/example (PR #1590 + #1610) with two additions plus CI wiring to run the new verification test:verify/) — end-to-end HF parity gate that runs the production NPU prefill+decode path and compares per-position logits against the HuggingFace bf16 reference using top-k token inclusion (vLLM-stylecheck_logprobs_closemethod).make verify— fast CI gate, 2 prompts × 32 greedy tokens, k=5, ~2 min.make verify-full— exhaustive local sweep over the full prompt file (currently 8 prompts), ~6 min.make diagnosis— per-layerffn_outcosine + max_abs vs HF bf16 for one prompt (informational microscope).llama32_1b_{inference,prefill,decode,cpu_helpers}.py,kernel_builder/cache.py,kernel_builder/external_kernels.py) needed to surface intermediates cleanly, plus a profiling redesign (end-to-end dataflow timing, TTFT reporting, per-token trend).docs/detail/) —IMPLEMENTATION_GUIDE.html,VERIFICATION.html,PROFILE.htmlwalkthroughs (HTML kept underdocs/detail/to stay separate from the markdown reference docs). Plus updates todocs/{usage,profile,explain}.mddescribing the new verify/profile workflows.HF_TOKENrepo secret to thecheck-programming-examples-peanostep in.github/workflows/buildAndTestRyzenAI.ymlso the newrun_npu2_verify.lit(which needs gatedmeta-llamaweights) can authenticate.Three commits, one per logical area, all confined to
programming_examples/llama32_1b/(+ thehf_tokenLIT feature inprogramming_examples/lit.cfg.pyand the CI line in.github/workflows/buildAndTestRyzenAI.yml). Also removesdocs/issues.md(added by #1590) — its content is either resolved (BF16 RoPE bug now fixed) or covered by the new HTML walkthroughs.Scope: 33 files, +6522 / −1132.
How to validate before merge
Because PRs from forks can't access repo secrets (GitHub policy), automated CI on this PR skips
run_npu2_verify.litwithUNSUPPORTED. To exercise it for real:Xilinx:llama-verify(or otherwise land its commits there).workflow_dispatchonBuild and Test with AIE tools on Ryzen AIagainst thellama-verifybranch directly (the branch, not the PR). That run has full secret access.HF_TOKEN found in environment; hf_token feature enabled.PASS: AIR_TEST :: programming_examples/llama32_1b/run_npu2_verify.litllama-verify, fast-forward intomain.hf_token LIT feature
run_npu2_verify.litrequires gatedmeta-llama/Llama-3.2-1Bweights, so it carriesREQUIRES: ryzen_ai_npu2, peano, hf_token. The feature is registered inprogramming_examples/lit.cfg.py(~lines 128–135) and only available whenHF_TOKENis set in the environment. Tests skip cleanly withUNSUPPORTEDwhen it's unset.Rendered docs preview
The three HTML walkthroughs render in-browser via raw.githack.com (a free GitHub-content proxy):
If you'd prefer these published to the official Pages site (
xilinx.github.io/mlir-air/), happy to follow up with a small change to.github/workflows/generateDocs.ymlto copy them into the publish directory — kept out of this PR so it can be reviewed separately.Test plan
flock /tmp/mlir-air-npu.lockon NPU2 (02_mul_shim_1x1,06_add_shim_bf16,12_matmul_transform_1x4_bf16)make compilesucceeds end-to-end onprogramming_examples/llama32_1b/against latest upstream/main toolchainmake verifyPASS 2/2 prompts on NPU2 against realmeta-llama/Llama-3.2-1B-Instructweights (top-k inclusion, k=5)make run(N_TOKENS=5): TTFT 1.41 s, 10.6 tok/s, sensible outputmake profile(N_TOKENS=3): all 5 redesigned profiling sections (Wall-Time Attribution, Per-Layer Execution, NPU XRT Call Breakdown, CPU Op Breakdown, Fine-Grained NPU Breakdown) + END-TO-END DATAFLOWmake diagnosis: per-layer probe across 16 layers, report writtenworkflow_dispatchagainstllama-verifyto validate secret wiring end-to-endSupersedes #1635.