Skip to content

[programming_examples] llama32_1b: verify subsystem, ablation studies, and docs#1635

Closed
tonyjie wants to merge 4 commits into
Xilinx:mainfrom
tonyjie:llama32_1b-verify-pr
Closed

[programming_examples] llama32_1b: verify subsystem, ablation studies, and docs#1635
tonyjie wants to merge 4 commits into
Xilinx:mainfrom
tonyjie:llama32_1b-verify-pr

Conversation

@tonyjie
Copy link
Copy Markdown
Contributor

@tonyjie tonyjie commented May 29, 2026

Summary

Builds on the existing programming_examples/llama32_1b/ example (PR #1590 + #1610) with three additions plus the CI wiring to run the new verification test:

  • Verify subsystem (verify/) — end-to-end HF parity gate that runs the production NPU prefill+decode path and compares per-position logits against the HuggingFace bf16 reference using top-k token inclusion (8 prompts × 32 greedy tokens). Includes the production-loop refinements (llama32_1b_{inference,prefill,decode,cpu_helpers,reference}.py, kernel_builder/) needed to surface intermediates, plus a profiling redesign (end-to-end dataflow timing, TTFT, per-token trend).
  • Ablation studies (ablation/decode/, ablation/prefill/) — four-cell experiments (A naive → D production-merged) over rms_gemv_rope/o_gemv_ffn (decode) and rms_gemms_rope/o_ffn (prefill), with bit-exact validation gates and per-cell speedup measurements (decode A→D: 2.83× on the configured hardware).
  • Standalone docs (docs/) — IMPLEMENTATION_GUIDE.html, VERIFICATION.html, ABLATION_STUDY.html, PROFILE.html plus explain.md/usage.md/profile.md.
  • CI wiring — exposes the existing HF_TOKEN repo secret to the check-programming-examples-peano step so the new run_npu2_verify.lit (which needs gated meta-llama weights) can authenticate.

Four commits, one per logical area, all confined to programming_examples/llama32_1b/ (+ the hf_token feature in programming_examples/lit.cfg.py and the CI wiring in .github/workflows/buildAndTestRyzenAI.yml).

hf_token LIT feature

run_npu2_verify.lit requires gated meta-llama/Llama-3.2-1B weights, so it carries REQUIRES: ryzen_ai_npu2, peano, hf_token. The feature is registered in programming_examples/lit.cfg.py (~lines 128–135) and only available when HF_TOKEN is set in the environment. Tests skip cleanly with UNSUPPORTED when it's unset.

GitHub does not expose repository secrets to fork-originated PR builds (security policy), so this PR's own CI run will continue to skip the verify test. Post-merge runs on main will pick up the secret and exercise it.

Rendered docs preview

The four HTML walkthroughs render in-browser via raw.githack.com (a free GitHub-content proxy):

If you'd prefer these published to the official Pages site (xilinx.github.io/mlir-air/), happy to follow up with a small change to .github/workflows/generateDocs.yml to copy them into the publish directory — kept out of this PR so it can be reviewed separately.

Test plan

  • 3 representative XRT tests pass under flock /tmp/mlir-air-npu.lock on NPU2 (02_mul_shim_1x1, 06_add_shim_bf16, 12_matmul_transform_1x4_bf16)
  • make compile succeeds end-to-end on programming_examples/llama32_1b/ against latest upstream/main toolchain
  • make run with HF_TOKEN set verified locally on NPU2 (verify gate passes top-k inclusion across the 8 prompts)
  • CI side: gated by repo-secret availability post-merge (verified secret + workflow wiring are consistent; only postsubmit will exercise it)
  • Reviewer: skim ablation reports — note that ablation/decode/results_*.json and report_*.md are gitignored as run artifacts, so the measured numbers in the docs reference reproducible outputs rather than checked-in data

Copilot AI review requested due to automatic review settings May 29, 2026 21:20
@tonyjie
Copy link
Copy Markdown
Contributor Author

tonyjie commented May 29, 2026

⚠️ Pre-merge action needed: please trigger CI manually

The verify test added in this PR (programming_examples/llama32_1b/run_npu2_verify.lit) needs the HF_TOKEN repository secret to authenticate gated meta-llama/* model downloads. The automated CI run on this PR cannot exercise it because GitHub does not expose repository secrets to fork-originated PR builds (security policy) — you'll see HF_TOKEN not set; hf_token feature disabled in the lit output and UNSUPPORTED: run_npu2_verify.lit. That's expected, not a real skip.

To validate the wiring end-to-end before merge, could a maintainer please trigger Build and Test with AIE tools on Ryzen AI manually via workflow_dispatch against this PR branch (tonyjie:llama32_1b-verify-pr)? That run has full secret access and would confirm:

  1. The HF_TOKEN secret is reachable from the workflow (no typos / scope issues)
  2. The verify gate actually passes on NPU2 against real meta-llama/Llama-3.2-1B weights

I'd recommend doing this before approving merge, so any HF/auth/CI surprises surface while the PR is still iterable. Happy to follow up on anything that turns up. Thanks!

tonyjie added 4 commits May 29, 2026 17:49
…-loop refinements

Adds an end-to-end verification subsystem (verify/) for the LLAMA-3.2-1B
example: an HF parity gate that runs the production NPU prefill+decode
path and compares per-position logits against HuggingFace bf16 reference
using top-k token inclusion across 8 prompts × 32 greedy tokens.

Includes:
- verify/ package: runners (HF, NPU, CPU), comparators, report, prompts
- Production loop refinements in llama32_1b_{inference,prefill,decode}.py
  to expose intermediates needed by the verify path
- kernel_builder/ cache + external-kernel handling updates
- Profiling redesign: end-to-end dataflow timing, TTFT reporting,
  tokenize/pad accounting, per-token trend
- run_npu2_verify.lit: REQUIRES hf_token (gated meta-llama download)
- run_npu2_makefile_peano_synthetic_verify.lit: no HF download needed
- lit.cfg.py: hf_token feature, available only when HF_TOKEN env var set,
  so REQUIRES: hf_token tests skip cleanly on machines without it
- cpu_helpers.py: extracted helpers (renamed from older reference.py)
…ill)

Adds two ablation studies under programming_examples/llama32_1b/ablation/:

- decode/: full-decode ablation comparing 4 cells (A naive → D production
  merged) on rms_gemv_rope and o_gemv_ffn kernel groups, with KV-cache
  baton-pass and per-token-loop wrappers. Measured A→D = 2.83× speedup
  on the configured hardware.

- prefill/: prefill ablation across same 4 cells × rms_gemms_rope and
  o_ffn kernel groups, with FA invariant integration and 16-layer
  per-layer threading.

Each study includes: KernelGroupSpec / SubLaunchSpec / BatonLink
dataclasses, standalone-compile harnesses, golden fixtures with
bit-exact validation gates, pytest test suites, orchestrator
scripts, and a markdown report generator with profile.md comparison.

Top-level ablation/README.md indexes both studies.
…blation, and profile docs

Adds standalone HTML walkthroughs under llama32_1b/docs/:
- IMPLEMENTATION_GUIDE.html: model architecture, per-kernel building
  blocks, NPU mapping decisions, SVG diagrams
- VERIFICATION.html: HF parity gate methodology, threshold tables,
  per-layer diagnosis flow
- ABLATION_STUDY.html: decode + prefill ablation results, per-kernel
  speedup tables
- PROFILE.html: end-to-end dataflow + per-step timing visualization

Plus markdown supplements (explain.md, profile.md, usage.md) covering
usage, per-kernel timing reference, and model walkthrough.
Wires the existing `HF_TOKEN` repository secret into the
`check-programming-examples-peano` invocation in
`.github/workflows/buildAndTestRyzenAI.yml`, so lit tests with
`REQUIRES: hf_token` (currently `programming_examples/llama32_1b/run_npu2_verify.lit`)
can authenticate against Hugging Face Hub for gated model downloads.

Tests without `REQUIRES: hf_token` are unaffected — they continue to
run as before. When the secret is unset (e.g. on fork-originating PR
builds, where GitHub doesn't expose secrets by policy), the lit
feature stays disabled and gated tests skip cleanly with UNSUPPORTED.
@tonyjie tonyjie force-pushed the llama32_1b-verify-pr branch from 7f37dbf to 8db6734 Compare May 29, 2026 21:49
@tonyjie
Copy link
Copy Markdown
Contributor Author

tonyjie commented May 29, 2026

Update: re-targeting to Xilinx:llama-verify for CI validation

Per maintainer feedback, workflow_dispatch doesn't expose fork PR branches in the picker — so my previous request to dispatch CI against tonyjie:llama32_1b-verify-pr isn't actionable from the maintainer side.

To unblock pre-merge CI validation, the proposed path is to land the changes on the existing Xilinx:llama-verify branch first (which workflow_dispatch can target), exercise the verify gate there with the real HF_TOKEN, and only merge to main once green.

Current state

I've rebased my 4 commits onto Xilinx:llama-verify HEAD (was 7151ac09 AWQ matvec, no conflicts) and force-pushed to my fork branch. Latest commits:

8db67343  [ci] Expose HF_TOKEN to programming-examples-peano test step
e9751e86  [programming_examples/llama32_1b] Add ... docs
b54777b5  [programming_examples/llama32_1b] Add ablation studies (decode + prefill)
eb6777cd  [programming_examples/llama32_1b] Add verify subsystem and production-loop refinements

Since I don't have push access to Xilinx/mlir-air, could a maintainer please pull these into Xilinx:llama-verify? The simplest one-liner:

git fetch https://github.com/tonyjie/mlir-air.git llama32_1b-verify-pr
git push origin FETCH_HEAD:llama-verify

After that:

  1. Trigger workflow_dispatch on Build and Test with AIE tools on Ryzen AI against llama-verify. That run has full secret access, so HF_TOKEN will be exposed and run_npu2_verify.lit will actually execute against meta-llama/Llama-3.2-1B.
  2. Confirm in the log:
    • HF_TOKEN found in environment; hf_token feature enabled. (proves wiring works)
    • PASS: AIR_TEST :: programming_examples/llama32_1b/run_npu2_verify.lit (proves the gate passes)
  3. Once green, fast-forward Xilinx:llama-verify into main (or merge this PR, whichever workflow you prefer — I'll leave the call to you).

Status of this PR

Leaving #1635 open for visibility / discussion; happy to close it in favor of a new PR with llama-verify as the head once the validation flow is settled. Just let me know which you prefer.

Thanks for the help coordinating this!

@tonyjie
Copy link
Copy Markdown
Contributor Author

tonyjie commented May 29, 2026

Superseded by #1636, which targets Xilinx:llama-verify instead of main so a maintainer can workflow_dispatch CI against the branch and exercise the HF_TOKEN-gated tests before merging. Closing this PR; please review #1636 instead.

@tonyjie tonyjie closed this May 29, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant