Skip to content

Commit dcd131f

Browse files
noahgiftclaude
andauthored
chore(release): 0.51.0 — hotfix-driven (P0 .apr fix) + PMAT-877..888 wave (#2178)
Version 0.50.0 → 0.51.0 (root + member crates + lock). CHANGELOG [0.51.0]: headline P0 fix (PMAT-888 non-Gemma2 .apr inference garbage, regressed in 0.50.0) + the wave: BatchNorm/Linear/LoRA/GQA correctness, the cuda-oxide attention kernel (first pure-Rust #[kernel] to beat hand-PTX, 1.7-2.9x on Blackwell), the Blackwell CUDA-graph replay fix (+16%), and the QA-gate hardening (Gate 11 publish-dry-run + Gate 18 .apr inference parity) that closes the two QA gaps 0.50.0 exposed. Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
1 parent a637aa6 commit dcd131f

37 files changed

Lines changed: 226 additions & 183 deletions

File tree

CHANGELOG.md

Lines changed: 45 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -7,6 +7,51 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
77

88
## [Unreleased]
99

10+
## [0.51.0] - 2026-06-21
11+
12+
Hotfix-driven release (brought forward from the Friday cadence by a P0). Each fix ships a named
13+
proof-obligation + a RED-on-bug / GREEN-on-fix falsifier + a `pv`-validated contract.
14+
15+
### Fixed
16+
17+
- **P0 — non-Gemma2 `.apr` inference produced garbage** (PMAT-888, regressed in 0.50.0 via PMAT-810b)
18+
— every non-Gemma2 `.apr` (qwen2/llama/mistral/phi/deepseek/qwen3 — the majority of models) generated
19+
garbage on inference (CPU **and** GPU) while the same model as GGUF was coherent. PMAT-810b added a
20+
Gemma2 post-attention-norm load keyed on the HF name `post_attention_layernorm.weight` — which is the
21+
**FFN norm** for all those architectures — **un-gated by architecture**, so a spurious extra RMSNorm
22+
was applied. Now gated on `config.is_gemma2()`, mirroring the GGUF loader. GGUF was never affected.
23+
- **`BatchNorm1d` never updated `running_mean`/`running_var`** (PMAT-877, Pillar-2) — they stayed at
24+
init (0/1) forever, so eval-mode normalization was wrong vs PyTorch. Now EMA-updated each training
25+
forward (`running = (1-momentum)·running + momentum·batch`).
26+
- **`Linear` bias initialized to zeros** (PMAT-878, Pillar-2) — PyTorch uses `U(±1/√fan_in)`; now matches
27+
(seed-deterministic).
28+
- **LoRA dropout never applied** (PMAT-879, Pillar-3) — `LoRALayer::forward` ignored the configured
29+
dropout, so fine-tuning trained with zero regularization. Now applies dropout to the input
30+
(`y = Wx + s·B(A(dropout(x)))`, train-only), matching HF PEFT.
31+
- **Batched-GPU GQA fail-closed** (PMAT-880, Pillar-4) — `attention_with_cache_gqa` did not validate
32+
`kv_dim == num_kv_heads·head_dim`/cache consistency, silently reading wrong memory on a corrupt config;
33+
now returns a clear error (zero false-positives on valid models), where llama.cpp/Ollama run garbage.
34+
35+
### Performance — GPU (Blackwell / GB10)
36+
37+
- **First pure-Rust cuda-oxide `#[kernel]` to BEAT hand-PTX** (PMAT-882) — the incremental KV-cache
38+
attention kernel: bit-exact (cos = 1.0) and **1.7–2.9× faster** than the production hand-PTX kernel on
39+
GB10 (true on-device A/B). FMA/softmax kernels are not DP4A-bound, so pure-Rust competes and wins.
40+
- **Blackwell CUDA-graph replay fixed + re-enabled** (PMAT-886a) — the default sm_121 Q4K GEMV variant
41+
was not recorded into the manual graph, so graph replay dropped ~6 GEMVs/layer → stale buffers →
42+
garbage (cosine 0.53). Now recorded; parity 0.53→0.9934 (== eager, token-for-token), graph decode
43+
re-defaulted ON for Blackwell, **+16% decode** (96→112 tok/s).
44+
- **Blackwell decode throughput-floor guard** (PMAT-885) — a stale-binary / F2-false-fallback that
45+
silently drops the GPU path to ~10 tok/s CPU is now a falsifiable invariant (≥100 tok/s on GB10).
46+
47+
### Infrastructure
48+
49+
- **Pre-release Gate 11** (`cargo publish -p aprender --dry-run`) — catches the two classes that broke
50+
the 0.50.0 cascade mid-publish (sibling path-deps missing a `version`; version-pinned sibling dev-deps
51+
forming publish cycles) which `cargo metadata` does not detect.
52+
- **Dogfood Gate 18** (fresh-convert `.apr` inference parity vs GGUF, CPU+GPU) — catches the PMAT-888
53+
class that `inspect`/`validate`/`tensors` and a stale pre-existing `.apr` all pass through.
54+
1055
## [0.50.0] - 2026-06-21
1156

1257
### Fixed

0 commit comments

Comments
 (0)