Commit fc43023
committed
perf: Ultra-optimize BitNet inference backend with SIMD dispatch, fused SwiGLU, and zero-alloc paths
- Wire AVX2 TL1 GEMV SIMD dispatch into backend hot path via tl1_avx2 module
with scalar LUT fallback for non-x86_64 platforms
- Add ScratchPool with 17 pre-allocated FP32 buffers for zero-alloc forward pass
- Fuse SwiGLU gate+up projections with 4-wide unrolled loop and unsafe indexing
- Optimize RMSNorm with 4-way parallel accumulator and fused scale pass
- Optimize softmax with reciprocal multiply instead of per-element division
- Optimize fp32_matvec_transposed with 4-wide unrolled dot product
- Optimize GQA attention with 4-wide unrolled score computation and skip for
negligible weights
- Add routing history tracking via Mutex<Vec<Vec<usize>>> for expert prediction
(interior mutability preserves LlmBackend Send+Sync trait compatibility)
- Pre-allocate KV caches (512 positions) in load_gguf()
- Add tl1_gemv_into() for zero-allocation GEMV into caller-provided buffers
- All 203 bitnet tests pass
https://claude.ai/code/session_011nTcGcn49b8YKJRVoh4TaK1 parent 9266fea commit fc43023
1 file changed
Lines changed: 465 additions & 129 deletions
0 commit comments