CPU-Optimized Kernels for Interleaved GGUF Weights (Following llama.cpp)

llama.cpp achieves superior CPU performance through thread-optimized kernels that compute directly on GGUF's native weight layouts. Candle should follow this approach to match llama.cpp's CPU efficiency and support diverse GGUF architectures seamlessly.

GGUF has become the standard format for quantized models, with various internal layouts optimized for different architectures. llama.cpp's success comes from having kernels that compute directly on these native layouts, leveraging thread-optimized parallelization across attention heads and cache-friendly access patterns for interleaved data.

SmolLM3 is the first model we've encountered requiring this infrastructure - its interleaved Q/K layout revealed the opportunity.
As can be seen in: Add SmolLM3: Full and Quantized Implementation #3180. SmolLM3 is the first model in Candle requiring layout-aware kernels. Its GGUF files use an interleaved Q/K weight layout -- heads are stored in an even/odd pattern across 128-row blocks, split between first and second halves of the weight matrix.

GGUF Layout (optimized for llama.cpp's threading):
```
Block 0 (rows 0-127):    [H0_even, H1_odd, H2_even, H3_odd, ...]
Block 1 (rows 128-255):  [H4_even, H5_odd, H6_even, H7_odd, ...]
...
Second Half (rows 768+): Repeat pattern for remaining heads
```

**Candle Expected Layout** (sequential):
```
[H0_complete, H1_complete, H2_complete, H3_complete, ...]
```

I would like to implement CPU kernels with thread-optimized matmul for interleaved weight layouts (F16, Q8_0, and Q4_K formats).

## Question for Maintainers
Is CPU-optimized threading for GGUF layouts a priority?




Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

CPU-Optimized Kernels for Interleaved GGUF Weights (Following llama.cpp) #3183

Question for Maintainers

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

CPU-Optimized Kernels for Interleaved GGUF Weights (Following llama.cpp) #3183

Description

Question for Maintainers

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions