Skip to content

Commit 6f7c864

Browse files
noahgiftclaude
andauthored
chore(release): 0.55.0 — convert/export runnable models + GPU parity reconciled + autograd training proven (PMAT-918..921) (#2218)
Version bump 0.54.0 → 0.55.0 across all workspace Cargo.toml (101 ecosystem version pins) + Cargo.lock regen + CHANGELOG. The v0.55.0 wave = 6 merged beats: - #2208 — degate the duckdb competitive bench behind a feature; cold merge_group builds were intermittently failing the merge queue while PR heads were green. - #2209 (PMAT-918) — apr convert --quantize q4k now synthesizes the tied lm_head for tied-embedding models; the Q4K save path produced a non-runnable .apr that failed at load with "tensor not found: lm_head.weight". - #2210 (PMAT-919) — GPU/CPU parity gate reconciled against ground truth (llama.cpp, per-position): fp32-Mwv is the correct Blackwell default; HwDp4a is genuinely degraded. F2 gate now checks per-position argmax-match + min-cosine over positions >=1, replacing the last-token-only check. - #2211 — gate the coop_gemm_bench example behind opt-in cooperative-matrix; wgpu 27 dropped the Vulkan cooperative-matrix path, breaking --all-targets. - #2212 (PMAT-920) — apr export --format gguf now uses explicit head_dim for exact num_heads, and hard-fails with an actionable error instead of silently stamping a wrong num_heads into a valid-looking GGUF. - #2213 (PMAT-921) — fix the transformer FFN gelu severing the autograd graph (functional::gelu builds output via Tensor::from_vec with no grad_fn), plus a new end-to-end train-to-loss proof that catches the severed-graph class. Version-bump PR only. crates.io publish + git tag + GH release are deferred to a separate human-gated step (do NOT run make publish / cargo publish from this PR). Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
1 parent 53c9b54 commit 6f7c864

35 files changed

Lines changed: 234 additions & 179 deletions

File tree

CHANGELOG.md

Lines changed: 55 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -7,6 +7,61 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
77

88
## [Unreleased]
99

10+
## [0.55.0] - 2026-06-24
11+
12+
User-facing correctness + a reconciled GPU-parity gate + the autograd training story *proven*, not
13+
just asserted. The headline pair: (1) `apr convert`/`apr export` now produce **runnable** models for
14+
tied-embedding architectures (a converted `.apr` was missing its `lm_head`; an exported GGUF
15+
mis-stamped `num_heads`); (2) an **end-to-end training proof** caught that the transformer FFN was
16+
*still* severing the autograd graph (`functional::gelu`) after the v0.53/v0.54 "complete" sweep —
17+
per-layer gradchecks never saw it; a real train-to-loss test did. Plus the Blackwell GPU/CPU parity
18+
gate, reconciled against ground truth (llama.cpp, per-position). Each ships a named proof-obligation +
19+
a mutation-verified RED-on-bug / GREEN-on-fix falsifier + a `pv`-validated contract.
20+
21+
### Fixed
22+
23+
- **`apr convert --quantize q4k` produced a non-runnable `.apr` for tied-embedding models** (PMAT-918,
24+
Pillar-4, `OBLIG-CONVERT-TIED-EMBEDDING-LMHEAD`) — the Q4K save path iterated the raw tensor map and
25+
never synthesized the tied `lm_head` (the f32/int8 path did), so a converted model failed at load with
26+
"tensor not found: lm_head.weight". Tie-synthesis is now hoisted before the quant dispatch; verified
27+
end-to-end on Blackwell (`def add(a,b): return a+b`).
28+
- **`apr export --format gguf` silently mis-inferred `num_heads` (or hard-failed) on metadata-light
29+
`.apr`** (PMAT-920, Pillar-4, `OBLIG-APR-GGUF-EXPORT-INFER-METADATA`) — a `[64,128,96,80]`
30+
first-divisor guess would stamp e.g. Qwen2-1.5B as 24 heads (true 12) into a valid-looking GGUF. Now
31+
uses the explicit `head_dim` for exact `num_heads = q_dim/head_dim`, and hard-fails with an actionable
32+
error (no GGUF written) when it's genuinely absent — never a silently-wrong model.
33+
- **GPU/CPU parity gate falsely rejected the correct Blackwell kernel (and used a known-insufficient
34+
metric)** (PMAT-919, Pillar-4, `gpu-cpu-parity-gate-v2`) — reconciled against ground truth
35+
(llama.cpp + CPU-Q4K, per-position, on 1.5B/7B/8B): fp32-`Mwv` is the **correct** Blackwell default
36+
(matches token-for-token); `HwDp4a` is genuinely degraded (INT8-activation quant → mid-context argmax
37+
errors). The F2 gate now checks **per-position** argmax-match + min-cosine over positions ≥1 (excluding
38+
the benign BOS near-tie), replacing the last-token-only check that let a 0.94-min kernel pass. Verified
39+
on-device on lambda (Ada) + gx10 (Blackwell): accepts fp32-Mwv, rejects HwDp4a.
40+
- **Autograd: the transformer FFN `gelu` severed the graph** (PMAT-921, Pillar-2,
41+
`OBLIG-TRANSFORMER-END-TO-END-TRAINABLE`) — `TransformerEncoderLayer` (and the decoder twin) called
42+
`nn::functional::gelu`, which builds output via `Tensor::from_vec` (no `grad_fn`), **freezing
43+
`ffn.linear1` + `norm2` γ/β in every real training run** while the isolated per-layer gradchecks
44+
stayed green. Routed through autograd-aware `Tensor::gelu` (identical forward). **Caught by a new
45+
end-to-end train-to-loss test** (loss 3.565 → 1.4e-5, every param group updates) — the proof that
46+
per-layer gradchecks can't give.
47+
48+
### Added
49+
50+
- **End-to-end tiny-transformer training proof** (PMAT-921) — a fast, seeded, CI-stable test that
51+
trains a real transformer (embedding → norm+MHA+norm+FFN → lm_head) to decreasing loss and asserts
52+
**every** trainable param-group received a non-zero gradient and moved. Falsifier is non-tautological:
53+
severing any one edge (gelu, attention, a norm) freezes the corresponding params and trips it. This is
54+
the end-to-end guard the severed-graph class needed.
55+
56+
### Build / CI
57+
58+
- **Gate the duckdb competitive bench behind a feature** (#2208) — `merge_group` builds are cold and
59+
recompiled bundled duckdb C++ under load, intermittently failing the merge queue while PR heads were
60+
green. The bench is now `required-features`-gated; the cold-build flake is eliminated.
61+
- **Gate the `coop_gemm_bench` example behind opt-in `cooperative-matrix`** (#2211, PMAT) — wgpu 27
62+
dropped the Vulkan cooperative-matrix path the example used, breaking `--all-targets` builds (found by
63+
the Apple-Metal verification on `mini`). The dead example is now opt-in; `--all-targets` is clean.
64+
1065
## [0.54.0] - 2026-06-24
1166

1267
Correctness-beat wave (PMAT-913..917) — the headline **completes the autograd severed-graph sweep**:

0 commit comments

Comments
 (0)