Skip to content

v0.55.0 — runnable models + reconciled GPU parity + autograd training proven

Latest

Choose a tag to compare

@noahgift noahgift released this 24 Jun 15:24
· 22 commits to main since this release
6f7c864

v0.55.0 — runnable models, reconciled GPU parity, and the autograd training story proven

A correctness + portability wave. The headline pair: (1) apr convert/apr export now produce runnable models for tied-embedding architectures; (2) an end-to-end training proof caught that the transformer FFN was still severing the autograd graph after the v0.53/v0.54 "complete" sweep — per-layer gradchecks never saw it; a real train-to-loss test did. Plus the Blackwell GPU/CPU parity gate reconciled against ground truth, an Ollama HTTP drop-in, and a real cross-silicon portability crash fix. Every fix ships a named proof-obligation + a mutation-verified RED-on-bug/GREEN-on-fix falsifier + a pv-validated contract.

Correctness

  • apr convert --quantize q4k produced a non-runnable .apr for tied-embedding models (PMAT-918, #2209) — the Q4K save path never synthesized the tied lm_head (the f32 path did), so the model failed to load. Tie-synthesis hoisted before quant dispatch; verified runnable end-to-end on Blackwell.
  • apr export --format gguf silently mis-inferred num_heads on metadata-light .apr (PMAT-920, #2212) — a first-divisor guess would stamp e.g. Qwen2-1.5B as 24 heads (true 12). Now uses the explicit head_dim for exact num_heads, and hard-fails with an actionable error (no GGUF written) when genuinely absent — never a silently-wrong model.
  • GPU/CPU parity gate falsely rejected the correct Blackwell kernel (PMAT-919, #2210) — reconciled against ground truth (llama.cpp + CPU-Q4K, per-position, on 1.5B/7B/8B): fp32-Mwv is the correct Blackwell default; HwDp4a is genuinely degraded (INT8-activation quant). The F2 gate now checks per-position argmax-match + min-cosine over positions ≥1, replacing the last-token-only check that let a degraded kernel pass. On-device verified on Ada (4090) + Blackwell; 7B serves coherently on GPU.
  • Autograd: the transformer FFN gelu severed the graph (PMAT-921, #2213) — TransformerEncoderLayer's FFN built output via Tensor::from_vec (no grad_fn), freezing ffn.linear1 + norm2 γ/β in every real training run while isolated gradchecks stayed green. Caught by a new end-to-end train-to-loss test (loss 3.565 → 1.4e-5, every param group updates). The proof per-layer gradchecks can't give.

Performance / GPU

  • cuda-oxide RoPE kernel (#2215) — adjacent-pair RoPE ported to a pure-Rust #[kernel]; on-device A/B on GB10 Blackwell (sm_121): bit-exact (cos=1.0) and a clean tie with hand-PTX at the DRAM-bandwidth roofline ⟹ migrate-for-free off the hand-PTX + Blackwell-JIT path. Fourth GO kernel (attention, RMSNorm, SwiGLU, RoPE).

Compatibility / Portability

  • Ollama /api/chat + /api/generate (PMAT-923, #2216) — wired into every apr serve router (APR/GGUF/SafeTensors/WGPU), making apr serve a drop-in Ollama HTTP target for non-streaming clients (NDJSON streaming is a documented follow-up).
  • wgpu adapter-enumeration SIGABRT on Linux/AMD-RADV (PMAT-925, #2217) — enumerate_adapters(Backends::all()) instantiated a GLES/EGL adapter whose Drop panics → process abort. Constrained enumeration to Backends::PRIMARY (never GLES); cross-silicon verified on AMD-Vulkan (RADV) + Apple-Metal, no regression. Found by the 4-corner silicon-matrix verification.

Build / CI

  • Gated the duckdb competitive bench (#2208) and the coop_gemm_bench wgpu-27 example (#2211) behind features — eliminates the merge_group cold-build flake and the --all-targets break.

Crates.io publish is handled separately.