v0.55.0 — runnable models, reconciled GPU parity, and the autograd training story proven
A correctness + portability wave. The headline pair: (1) apr convert/apr export now produce runnable models for tied-embedding architectures; (2) an end-to-end training proof caught that the transformer FFN was still severing the autograd graph after the v0.53/v0.54 "complete" sweep — per-layer gradchecks never saw it; a real train-to-loss test did. Plus the Blackwell GPU/CPU parity gate reconciled against ground truth, an Ollama HTTP drop-in, and a real cross-silicon portability crash fix. Every fix ships a named proof-obligation + a mutation-verified RED-on-bug/GREEN-on-fix falsifier + a pv-validated contract.
Correctness
apr convert --quantize q4kproduced a non-runnable.aprfor tied-embedding models (PMAT-918, #2209) — the Q4K save path never synthesized the tiedlm_head(the f32 path did), so the model failed to load. Tie-synthesis hoisted before quant dispatch; verified runnable end-to-end on Blackwell.apr export --format ggufsilently mis-inferrednum_headson metadata-light.apr(PMAT-920, #2212) — a first-divisor guess would stamp e.g. Qwen2-1.5B as 24 heads (true 12). Now uses the explicithead_dimfor exactnum_heads, and hard-fails with an actionable error (no GGUF written) when genuinely absent — never a silently-wrong model.- GPU/CPU parity gate falsely rejected the correct Blackwell kernel (PMAT-919, #2210) — reconciled against ground truth (llama.cpp + CPU-Q4K, per-position, on 1.5B/7B/8B): fp32-
Mwvis the correct Blackwell default;HwDp4ais genuinely degraded (INT8-activation quant). The F2 gate now checks per-position argmax-match + min-cosine over positions ≥1, replacing the last-token-only check that let a degraded kernel pass. On-device verified on Ada (4090) + Blackwell; 7B serves coherently on GPU. - Autograd: the transformer FFN
gelusevered the graph (PMAT-921, #2213) —TransformerEncoderLayer's FFN built output viaTensor::from_vec(nograd_fn), freezingffn.linear1+norm2γ/β in every real training run while isolated gradchecks stayed green. Caught by a new end-to-end train-to-loss test (loss 3.565 → 1.4e-5, every param group updates). The proof per-layer gradchecks can't give.
Performance / GPU
- cuda-oxide RoPE kernel (#2215) — adjacent-pair RoPE ported to a pure-Rust
#[kernel]; on-device A/B on GB10 Blackwell (sm_121): bit-exact (cos=1.0) and a clean tie with hand-PTX at the DRAM-bandwidth roofline ⟹ migrate-for-free off the hand-PTX + Blackwell-JIT path. Fourth GO kernel (attention, RMSNorm, SwiGLU, RoPE).
Compatibility / Portability
- Ollama
/api/chat+/api/generate(PMAT-923, #2216) — wired into everyapr serverouter (APR/GGUF/SafeTensors/WGPU), makingapr servea drop-in Ollama HTTP target for non-streaming clients (NDJSON streaming is a documented follow-up). - wgpu adapter-enumeration SIGABRT on Linux/AMD-RADV (PMAT-925, #2217) —
enumerate_adapters(Backends::all())instantiated a GLES/EGL adapter whoseDroppanics → process abort. Constrained enumeration toBackends::PRIMARY(never GLES); cross-silicon verified on AMD-Vulkan (RADV) + Apple-Metal, no regression. Found by the 4-corner silicon-matrix verification.
Build / CI
- Gated the duckdb competitive bench (#2208) and the
coop_gemm_benchwgpu-27 example (#2211) behind features — eliminates the merge_group cold-build flake and the--all-targetsbreak.
Crates.io publish is handled separately.