[programming_examples] llama32_1b: verify subsystem, ablation studies, and docs by tonyjie · Pull Request #1635 · Xilinx/mlir-air

tonyjie · 2026-05-29T21:20:07Z

Summary

Builds on the existing programming_examples/llama32_1b/ example (PR #1590 + #1610) with three additions plus the CI wiring to run the new verification test:

Verify subsystem (verify/) — end-to-end HF parity gate that runs the production NPU prefill+decode path and compares per-position logits against the HuggingFace bf16 reference using top-k token inclusion (8 prompts × 32 greedy tokens). Includes the production-loop refinements (llama32_1b_{inference,prefill,decode,cpu_helpers,reference}.py, kernel_builder/) needed to surface intermediates, plus a profiling redesign (end-to-end dataflow timing, TTFT, per-token trend).
Ablation studies (ablation/decode/, ablation/prefill/) — four-cell experiments (A naive → D production-merged) over rms_gemv_rope/o_gemv_ffn (decode) and rms_gemms_rope/o_ffn (prefill), with bit-exact validation gates and per-cell speedup measurements (decode A→D: 2.83× on the configured hardware).
Standalone docs (docs/) — IMPLEMENTATION_GUIDE.html, VERIFICATION.html, ABLATION_STUDY.html, PROFILE.html plus explain.md/usage.md/profile.md.
CI wiring — exposes the existing HF_TOKEN repo secret to the check-programming-examples-peano step so the new run_npu2_verify.lit (which needs gated meta-llama weights) can authenticate.

Four commits, one per logical area, all confined to programming_examples/llama32_1b/ (+ the hf_token feature in programming_examples/lit.cfg.py and the CI wiring in .github/workflows/buildAndTestRyzenAI.yml).

hf_token LIT feature

run_npu2_verify.lit requires gated meta-llama/Llama-3.2-1B weights, so it carries REQUIRES: ryzen_ai_npu2, peano, hf_token. The feature is registered in programming_examples/lit.cfg.py (~lines 128–135) and only available when HF_TOKEN is set in the environment. Tests skip cleanly with UNSUPPORTED when it's unset.

GitHub does not expose repository secrets to fork-originated PR builds (security policy), so this PR's own CI run will continue to skip the verify test. Post-merge runs on main will pick up the secret and exercise it.

Rendered docs preview

The four HTML walkthroughs render in-browser via raw.githack.com (a free GitHub-content proxy):

IMPLEMENTATION_GUIDE.html — model architecture, per-kernel building blocks, NPU mapping
VERIFICATION.html — HF parity gate methodology and results
ABLATION_STUDY.html — decode + prefill ablation cells
PROFILE.html — end-to-end dataflow and per-step timing

If you'd prefer these published to the official Pages site (xilinx.github.io/mlir-air/), happy to follow up with a small change to .github/workflows/generateDocs.yml to copy them into the publish directory — kept out of this PR so it can be reviewed separately.

Test plan

3 representative XRT tests pass under flock /tmp/mlir-air-npu.lock on NPU2 (02_mul_shim_1x1, 06_add_shim_bf16, 12_matmul_transform_1x4_bf16)
make compile succeeds end-to-end on programming_examples/llama32_1b/ against latest upstream/main toolchain
make run with HF_TOKEN set verified locally on NPU2 (verify gate passes top-k inclusion across the 8 prompts)
CI side: gated by repo-secret availability post-merge (verified secret + workflow wiring are consistent; only postsubmit will exercise it)
Reviewer: skim ablation reports — note that ablation/decode/results_*.json and report_*.md are gitignored as run artifacts, so the measured numbers in the docs reference reproducible outputs rather than checked-in data

tonyjie · 2026-05-29T21:39:54Z

⚠️ Pre-merge action needed: please trigger CI manually

The verify test added in this PR (programming_examples/llama32_1b/run_npu2_verify.lit) needs the HF_TOKEN repository secret to authenticate gated meta-llama/* model downloads. The automated CI run on this PR cannot exercise it because GitHub does not expose repository secrets to fork-originated PR builds (security policy) — you'll see HF_TOKEN not set; hf_token feature disabled in the lit output and UNSUPPORTED: run_npu2_verify.lit. That's expected, not a real skip.

To validate the wiring end-to-end before merge, could a maintainer please trigger Build and Test with AIE tools on Ryzen AI manually via workflow_dispatch against this PR branch (tonyjie:llama32_1b-verify-pr)? That run has full secret access and would confirm:

The HF_TOKEN secret is reachable from the workflow (no typos / scope issues)
The verify gate actually passes on NPU2 against real meta-llama/Llama-3.2-1B weights

I'd recommend doing this before approving merge, so any HF/auth/CI surprises surface while the PR is still iterable. Happy to follow up on anything that turns up. Thanks!

…-loop refinements Adds an end-to-end verification subsystem (verify/) for the LLAMA-3.2-1B example: an HF parity gate that runs the production NPU prefill+decode path and compares per-position logits against HuggingFace bf16 reference using top-k token inclusion across 8 prompts × 32 greedy tokens. Includes: - verify/ package: runners (HF, NPU, CPU), comparators, report, prompts - Production loop refinements in llama32_1b_{inference,prefill,decode}.py to expose intermediates needed by the verify path - kernel_builder/ cache + external-kernel handling updates - Profiling redesign: end-to-end dataflow timing, TTFT reporting, tokenize/pad accounting, per-token trend - run_npu2_verify.lit: REQUIRES hf_token (gated meta-llama download) - run_npu2_makefile_peano_synthetic_verify.lit: no HF download needed - lit.cfg.py: hf_token feature, available only when HF_TOKEN env var set, so REQUIRES: hf_token tests skip cleanly on machines without it - cpu_helpers.py: extracted helpers (renamed from older reference.py)

…ill) Adds two ablation studies under programming_examples/llama32_1b/ablation/: - decode/: full-decode ablation comparing 4 cells (A naive → D production merged) on rms_gemv_rope and o_gemv_ffn kernel groups, with KV-cache baton-pass and per-token-loop wrappers. Measured A→D = 2.83× speedup on the configured hardware. - prefill/: prefill ablation across same 4 cells × rms_gemms_rope and o_ffn kernel groups, with FA invariant integration and 16-layer per-layer threading. Each study includes: KernelGroupSpec / SubLaunchSpec / BatonLink dataclasses, standalone-compile harnesses, golden fixtures with bit-exact validation gates, pytest test suites, orchestrator scripts, and a markdown report generator with profile.md comparison. Top-level ablation/README.md indexes both studies.

…blation, and profile docs Adds standalone HTML walkthroughs under llama32_1b/docs/: - IMPLEMENTATION_GUIDE.html: model architecture, per-kernel building blocks, NPU mapping decisions, SVG diagrams - VERIFICATION.html: HF parity gate methodology, threshold tables, per-layer diagnosis flow - ABLATION_STUDY.html: decode + prefill ablation results, per-kernel speedup tables - PROFILE.html: end-to-end dataflow + per-step timing visualization Plus markdown supplements (explain.md, profile.md, usage.md) covering usage, per-kernel timing reference, and model walkthrough.

Wires the existing `HF_TOKEN` repository secret into the `check-programming-examples-peano` invocation in `.github/workflows/buildAndTestRyzenAI.yml`, so lit tests with `REQUIRES: hf_token` (currently `programming_examples/llama32_1b/run_npu2_verify.lit`) can authenticate against Hugging Face Hub for gated model downloads. Tests without `REQUIRES: hf_token` are unaffected — they continue to run as before. When the secret is unset (e.g. on fork-originating PR builds, where GitHub doesn't expose secrets by policy), the lit feature stays disabled and gated tests skip cleanly with UNSUPPORTED.

tonyjie · 2026-05-29T21:49:31Z

Update: re-targeting to `Xilinx:llama-verify` for CI validation

Per maintainer feedback, workflow_dispatch doesn't expose fork PR branches in the picker — so my previous request to dispatch CI against tonyjie:llama32_1b-verify-pr isn't actionable from the maintainer side.

To unblock pre-merge CI validation, the proposed path is to land the changes on the existing Xilinx:llama-verify branch first (which workflow_dispatch can target), exercise the verify gate there with the real HF_TOKEN, and only merge to main once green.

Current state

I've rebased my 4 commits onto Xilinx:llama-verify HEAD (was 7151ac09 AWQ matvec, no conflicts) and force-pushed to my fork branch. Latest commits:

8db67343  [ci] Expose HF_TOKEN to programming-examples-peano test step
e9751e86  [programming_examples/llama32_1b] Add ... docs
b54777b5  [programming_examples/llama32_1b] Add ablation studies (decode + prefill)
eb6777cd  [programming_examples/llama32_1b] Add verify subsystem and production-loop refinements

Since I don't have push access to Xilinx/mlir-air, could a maintainer please pull these into Xilinx:llama-verify? The simplest one-liner:

git fetch https://github.com/tonyjie/mlir-air.git llama32_1b-verify-pr
git push origin FETCH_HEAD:llama-verify

After that:

Trigger workflow_dispatch on Build and Test with AIE tools on Ryzen AI against llama-verify. That run has full secret access, so HF_TOKEN will be exposed and run_npu2_verify.lit will actually execute against meta-llama/Llama-3.2-1B.
Confirm in the log:
- HF_TOKEN found in environment; hf_token feature enabled. (proves wiring works)
- PASS: AIR_TEST :: programming_examples/llama32_1b/run_npu2_verify.lit (proves the gate passes)
Once green, fast-forward Xilinx:llama-verify into main (or merge this PR, whichever workflow you prefer — I'll leave the call to you).

Status of this PR

Leaving #1635 open for visibility / discussion; happy to close it in favor of a new PR with llama-verify as the head once the validation flow is settled. Just let me know which you prefer.

Thanks for the help coordinating this!

tonyjie · 2026-05-29T21:51:38Z

Superseded by #1636, which targets Xilinx:llama-verify instead of main so a maintainer can workflow_dispatch CI against the branch and exercise the HF_TOKEN-gated tests before merging. Closing this PR; please review #1636 instead.

Copilot AI review requested due to automatic review settings May 29, 2026 21:20

tonyjie requested review from eddierichter-amd, erwei-xilinx, fifield and jgmelber as code owners May 29, 2026 21:20

tonyjie added 4 commits May 29, 2026 17:49

tonyjie force-pushed the llama32_1b-verify-pr branch from 7f37dbf to 8db6734 Compare May 29, 2026 21:49

tonyjie mentioned this pull request May 29, 2026

[programming_examples] llama32_1b: verify subsystem + docs #1636

Open

7 tasks

tonyjie closed this May 29, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[programming_examples] llama32_1b: verify subsystem, ablation studies, and docs#1635

[programming_examples] llama32_1b: verify subsystem, ablation studies, and docs#1635
tonyjie wants to merge 4 commits into
Xilinx:mainfrom
tonyjie:llama32_1b-verify-pr

tonyjie commented May 29, 2026

Uh oh!

tonyjie commented May 29, 2026

Uh oh!

tonyjie commented May 29, 2026

Uh oh!

tonyjie commented May 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

tonyjie commented May 29, 2026

Summary

hf_token LIT feature

Rendered docs preview

Test plan

Uh oh!

tonyjie commented May 29, 2026

⚠️ Pre-merge action needed: please trigger CI manually

Uh oh!

tonyjie commented May 29, 2026

Update: re-targeting to Xilinx:llama-verify for CI validation

Current state

Status of this PR

Uh oh!

tonyjie commented May 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Update: re-targeting to `Xilinx:llama-verify` for CI validation