[programming_examples] llama32_1b: verify subsystem, ablation studies, and docs#1635
[programming_examples] llama32_1b: verify subsystem, ablation studies, and docs#1635tonyjie wants to merge 4 commits into
Conversation
|
…-loop refinements
Adds an end-to-end verification subsystem (verify/) for the LLAMA-3.2-1B
example: an HF parity gate that runs the production NPU prefill+decode
path and compares per-position logits against HuggingFace bf16 reference
using top-k token inclusion across 8 prompts × 32 greedy tokens.
Includes:
- verify/ package: runners (HF, NPU, CPU), comparators, report, prompts
- Production loop refinements in llama32_1b_{inference,prefill,decode}.py
to expose intermediates needed by the verify path
- kernel_builder/ cache + external-kernel handling updates
- Profiling redesign: end-to-end dataflow timing, TTFT reporting,
tokenize/pad accounting, per-token trend
- run_npu2_verify.lit: REQUIRES hf_token (gated meta-llama download)
- run_npu2_makefile_peano_synthetic_verify.lit: no HF download needed
- lit.cfg.py: hf_token feature, available only when HF_TOKEN env var set,
so REQUIRES: hf_token tests skip cleanly on machines without it
- cpu_helpers.py: extracted helpers (renamed from older reference.py)
…ill) Adds two ablation studies under programming_examples/llama32_1b/ablation/: - decode/: full-decode ablation comparing 4 cells (A naive → D production merged) on rms_gemv_rope and o_gemv_ffn kernel groups, with KV-cache baton-pass and per-token-loop wrappers. Measured A→D = 2.83× speedup on the configured hardware. - prefill/: prefill ablation across same 4 cells × rms_gemms_rope and o_ffn kernel groups, with FA invariant integration and 16-layer per-layer threading. Each study includes: KernelGroupSpec / SubLaunchSpec / BatonLink dataclasses, standalone-compile harnesses, golden fixtures with bit-exact validation gates, pytest test suites, orchestrator scripts, and a markdown report generator with profile.md comparison. Top-level ablation/README.md indexes both studies.
…blation, and profile docs Adds standalone HTML walkthroughs under llama32_1b/docs/: - IMPLEMENTATION_GUIDE.html: model architecture, per-kernel building blocks, NPU mapping decisions, SVG diagrams - VERIFICATION.html: HF parity gate methodology, threshold tables, per-layer diagnosis flow - ABLATION_STUDY.html: decode + prefill ablation results, per-kernel speedup tables - PROFILE.html: end-to-end dataflow + per-step timing visualization Plus markdown supplements (explain.md, profile.md, usage.md) covering usage, per-kernel timing reference, and model walkthrough.
Wires the existing `HF_TOKEN` repository secret into the `check-programming-examples-peano` invocation in `.github/workflows/buildAndTestRyzenAI.yml`, so lit tests with `REQUIRES: hf_token` (currently `programming_examples/llama32_1b/run_npu2_verify.lit`) can authenticate against Hugging Face Hub for gated model downloads. Tests without `REQUIRES: hf_token` are unaffected — they continue to run as before. When the secret is unset (e.g. on fork-originating PR builds, where GitHub doesn't expose secrets by policy), the lit feature stays disabled and gated tests skip cleanly with UNSUPPORTED.
7f37dbf to
8db6734
Compare
Update: re-targeting to
|
Summary
Builds on the existing
programming_examples/llama32_1b/example (PR #1590 + #1610) with three additions plus the CI wiring to run the new verification test:verify/) — end-to-end HF parity gate that runs the production NPU prefill+decode path and compares per-position logits against the HuggingFace bf16 reference using top-k token inclusion (8 prompts × 32 greedy tokens). Includes the production-loop refinements (llama32_1b_{inference,prefill,decode,cpu_helpers,reference}.py,kernel_builder/) needed to surface intermediates, plus a profiling redesign (end-to-end dataflow timing, TTFT, per-token trend).ablation/decode/,ablation/prefill/) — four-cell experiments (A naive → D production-merged) overrms_gemv_rope/o_gemv_ffn(decode) andrms_gemms_rope/o_ffn(prefill), with bit-exact validation gates and per-cell speedup measurements (decode A→D: 2.83× on the configured hardware).docs/) —IMPLEMENTATION_GUIDE.html,VERIFICATION.html,ABLATION_STUDY.html,PROFILE.htmlplusexplain.md/usage.md/profile.md.HF_TOKENrepo secret to thecheck-programming-examples-peanostep so the newrun_npu2_verify.lit(which needs gatedmeta-llamaweights) can authenticate.Four commits, one per logical area, all confined to
programming_examples/llama32_1b/(+ the hf_token feature inprogramming_examples/lit.cfg.pyand the CI wiring in.github/workflows/buildAndTestRyzenAI.yml).hf_token LIT feature
run_npu2_verify.litrequires gatedmeta-llama/Llama-3.2-1Bweights, so it carriesREQUIRES: ryzen_ai_npu2, peano, hf_token. The feature is registered inprogramming_examples/lit.cfg.py(~lines 128–135) and only available whenHF_TOKENis set in the environment. Tests skip cleanly withUNSUPPORTEDwhen it's unset.GitHub does not expose repository secrets to fork-originated PR builds (security policy), so this PR's own CI run will continue to skip the verify test. Post-merge runs on
mainwill pick up the secret and exercise it.Rendered docs preview
The four HTML walkthroughs render in-browser via raw.githack.com (a free GitHub-content proxy):
If you'd prefer these published to the official Pages site (
xilinx.github.io/mlir-air/), happy to follow up with a small change to.github/workflows/generateDocs.ymlto copy them into the publish directory — kept out of this PR so it can be reviewed separately.Test plan
flock /tmp/mlir-air-npu.lockon NPU2 (02_mul_shim_1x1,06_add_shim_bf16,12_matmul_transform_1x4_bf16)make compilesucceeds end-to-end onprogramming_examples/llama32_1b/against latest upstream/main toolchainmake runwithHF_TOKENset verified locally on NPU2 (verify gate passes top-k inclusion across the 8 prompts)ablation/decode/results_*.jsonandreport_*.mdare gitignored as run artifacts, so the measured numbers in the docs reference reproducible outputs rather than checked-in data