Skip to content

CPU-Optimized Kernels for Interleaved GGUF Weights (Following llama.cpp) #3183

@DrJesseGlass

Description

@DrJesseGlass

llama.cpp achieves superior CPU performance through thread-optimized kernels that compute directly on GGUF's native weight layouts. Candle should follow this approach to match llama.cpp's CPU efficiency and support diverse GGUF architectures seamlessly.

GGUF has become the standard format for quantized models, with various internal layouts optimized for different architectures. llama.cpp's success comes from having kernels that compute directly on these native layouts, leveraging thread-optimized parallelization across attention heads and cache-friendly access patterns for interleaved data.

SmolLM3 is the first model we've encountered requiring this infrastructure - its interleaved Q/K layout revealed the opportunity.
As can be seen in: Add SmolLM3: Full and Quantized Implementation #3180. SmolLM3 is the first model in Candle requiring layout-aware kernels. Its GGUF files use an interleaved Q/K weight layout -- heads are stored in an even/odd pattern across 128-row blocks, split between first and second halves of the weight matrix.

GGUF Layout (optimized for llama.cpp's threading):

Block 0 (rows 0-127):    [H0_even, H1_odd, H2_even, H3_odd, ...]
Block 1 (rows 128-255):  [H4_even, H5_odd, H6_even, H7_odd, ...]
...
Second Half (rows 768+): Repeat pattern for remaining heads

Candle Expected Layout (sequential):

[H0_complete, H1_complete, H2_complete, H3_complete, ...]

I would like to implement CPU kernels with thread-optimized matmul for interleaved weight layouts (F16, Q8_0, and Q4_K formats).

Question for Maintainers

Is CPU-optimized threading for GGUF layouts a priority?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions