-
Notifications
You must be signed in to change notification settings - Fork 1.3k
Description
llama.cpp achieves superior CPU performance through thread-optimized kernels that compute directly on GGUF's native weight layouts. Candle should follow this approach to match llama.cpp's CPU efficiency and support diverse GGUF architectures seamlessly.
GGUF has become the standard format for quantized models, with various internal layouts optimized for different architectures. llama.cpp's success comes from having kernels that compute directly on these native layouts, leveraging thread-optimized parallelization across attention heads and cache-friendly access patterns for interleaved data.
SmolLM3 is the first model we've encountered requiring this infrastructure - its interleaved Q/K layout revealed the opportunity.
As can be seen in: Add SmolLM3: Full and Quantized Implementation #3180. SmolLM3 is the first model in Candle requiring layout-aware kernels. Its GGUF files use an interleaved Q/K weight layout -- heads are stored in an even/odd pattern across 128-row blocks, split between first and second halves of the weight matrix.
GGUF Layout (optimized for llama.cpp's threading):
Block 0 (rows 0-127): [H0_even, H1_odd, H2_even, H3_odd, ...]
Block 1 (rows 128-255): [H4_even, H5_odd, H6_even, H7_odd, ...]
...
Second Half (rows 768+): Repeat pattern for remaining heads
Candle Expected Layout (sequential):
[H0_complete, H1_complete, H2_complete, H3_complete, ...]
I would like to implement CPU kernels with thread-optimized matmul for interleaved weight layouts (F16, Q8_0, and Q4_K formats).
Question for Maintainers
Is CPU-optimized threading for GGUF layouts a priority?