Commit fc43023

committed

perf: Ultra-optimize BitNet inference backend with SIMD dispatch, fused SwiGLU, and zero-alloc paths

- Wire AVX2 TL1 GEMV SIMD dispatch into backend hot path via tl1_avx2 module with scalar LUT fallback for non-x86_64 platforms - Add ScratchPool with 17 pre-allocated FP32 buffers for zero-alloc forward pass - Fuse SwiGLU gate+up projections with 4-wide unrolled loop and unsafe indexing - Optimize RMSNorm with 4-way parallel accumulator and fused scale pass - Optimize softmax with reciprocal multiply instead of per-element division - Optimize fp32_matvec_transposed with 4-wide unrolled dot product - Optimize GQA attention with 4-wide unrolled score computation and skip for negligible weights - Add routing history tracking via Mutex<Vec<Vec<usize>>> for expert prediction (interior mutability preserves LlmBackend Send+Sync trait compatibility) - Pre-allocate KV caches (512 positions) in load_gguf() - Add tl1_gemv_into() for zero-allocation GEMV into caller-provided buffers - All 203 bitnet tests pass https://claude.ai/code/session_011nTcGcn49b8YKJRVoh4TaK

1 parent 9266fea commit fc43023Copy full SHA for fc43023

1 file changed

crates/ruvllm/src/bitnet
- backend.rs

Comments

(0)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Commit fc43023

File tree

0 commit comments