ggml-cpu: add repack GEMM and GEMV for floating-point by taimur-10x · Pull Request #17791 · ggml-org/llama.cpp

taimur-10x · 2025-12-05T11:22:55Z

Summary

This PR adds repacking and GEMM/GEMV kernels for floating-point (FP16 and FP32) for RVV (with the zvfh extension).

Key Changes

Added RVV kernels for GEMM with tiling:
- 7 x {16, 32, 64, 128} (selected based on VLEN)
Added RVV kernels for GEMV with tiling:
- 1 x {16, 32, 64, 128} (selected based on VLEN)
Added scalar functions for repacking. They support arbitrary tile sizes.
Generic scalar fallbacks for GEMM/GEMV operations.
ggml_quantize_mat_t is refactored to ggml_repack_mat_t to allow for a common interface for both quantization and floating-point repacking.
Additional template parameter NB_ROWS added to select the number of rows to interleave for repacking. Previously, this was fixed at 4.

Tile Sizes

The repack operation interleaves N rows of activations with an interleave size of K, and M columns of weights with an interleave size of K.

NxK is fixed at 7x1. This introduces 7 accumulators with LMUL=4 (7 x 4 = 28 registers), each accumulating M results.

M is varied based on the available VLEN:

VLEN	Tile Size (N x M x K)
128	7 x 16 x 1
256	7 x 32 x 1
512	7 x 64 x 1
1024	7 x 128 x 1

M is the maximum number of values that can be loaded in (LMUL=2 for F16, LMUL=4 for F32).

Testing

Kernels were functionally tested on QEMU for VLENs (128-bit, 256-bit, 512-bit and 1024-bit) for a range of input sizes.

Benchmarking Results

End-to-end benchmarking on BananaPI-BPI F3 (VLEN=256) with llama-bench (Threads=8)).

Prefill / Prompt Processing (GEMM)

Tokens / Second

Model	Prompt Size	Repack GEMM (7x32)	Vec Dot
Tinyllama F16 1.1B	28	24.72	8.31
Tinyllama F16 1.1B	32	16.72	8.42
Tinyllama F16 1.1B	64	22.55	8.57
Tinyllama F16 1.1B	128	22.78	8.78
Tinyllama F16 1.1B	256	21.82	8.57
Tinyllama F16 1.1B	512	21.81	8.68

Model	Prompt Size	Repack GEMM (7x32)	Vec Dot
Tinyllama F32 1.1B	28	11.45	3.72
Tinyllama F32 1.1B	32	7.13	3.75
Tinyllama F32 1.1B	64	10.76	3.74
Tinyllama F32 1.1B	128	10.86	3.73
Tinyllama F32 1.1B	256	10.94	3.68
Tinyllama F32 1.1B	512	11.12	3.79

Result: ~2x-3x speedup over vec_dot

Decode (GEMV)

Tokens / Second

Model	Decode Size (Prompt=32)	Repack GEMV (1x32)	Vec Dot
Tinyllama F16 1.1B	10	3.37	3.11
Tinyllama F16 1.1B	16	3.29	3.45
Tinyllama F16 1.1B	32	3.12	3.25
Tinyllama F16 1.1B	64	3.23	3.27
Tinyllama F16 1.1B	100	3.04	3.15
Tinyllama F16 1.1B	128	3.09	3.2
Tinyllama F16 1.1B	256	3.15	3.19

Model	Decode Size (Prompt=32)	Repack GEMV (1x32)	Vec Dot
Tinyllama F32 1.1B	10	1.66	1.74
Tinyllama F32 1.1B	16	1.73	1.63
Tinyllama F32 1.1B	32	1.81	1.68
Tinyllama F32 1.1B	64	1.61	1.69
Tinyllama F32 1.1B	100	1.72	1.75
Tinyllama F32 1.1B	128	1.76	1.72
Tinyllama F32 1.1B	256	1.75	1.69

Result: No noticeable improvement, as decode remains memory-bound.

Perplexity

Calculated on BananaPI-BPI-F3 with VLEN=256

/build/bin/llama-perplexity -f /path/to/wiki.test.raw -m models/tinyllama-1.1B-f16.gguf

Model	Master	Repack
Tinyllama 1.1B F16	8.6971 +/- 0.05412	8.6969 +/- 0.05412

Additional Notes

Current fallback model requires every architecture to have a scalar fallback for each implementation. This creates a clutter in arch-fallback.h as 7xMx1 is very RVV-specific tiling, and should not be used by other architectures.
GEMM reaches peak performance when the prompt is a multiple of 7 (for example, prompt=28). To handle leftovers, it defaults to GEMV, which impacts performance. Ideally, there should be leftover Nx32 kernels which handle each leftover case from 2-6 leftover tokens.

Future Work

Subsequent PRs plan to add RVV kernels for quantization types, as well as extend existing quantization support to other VLENs.

References

The selection of the tiling 7x32x1 is based off of the mmt4d kernel in IREE for RVV: iree-org/iree#20263

taimur-10x · 2026-01-14T15:52:57Z

@ggerganov, could this be reviewed please? Thank you.

ixgbe · 2026-01-21T01:13:21Z

@taimur-10x , Could you please provide the download links for Tinyllama F16 1.1B and Tinyllama F32 1.1B models? Also, it would be really helpful if you could share the exact commands or script used to obtain these performance numbers (especially for the Prefill / Prompt Processing benchmarks with GEMM). I'd like to reproduce the results.
Thanks in advance! 🙌

taimur-10x · 2026-01-21T10:36:08Z

@taimur-10x , Could you please provide the download links for Tinyllama F16 1.1B and Tinyllama F32 1.1B models? Also, it would be really helpful if you could share the exact commands or script used to obtain these performance numbers (especially for the Prefill / Prompt Processing benchmarks with GEMM). I'd like to reproduce the results. Thanks in advance! 🙌

Sure, here's the link for the model: https://huggingface.co/TinyLlama/TinyLlama_v1.1
I believe this is in fp32. This can be converted to fp16 through llama-quantize.

All benchmarking was done through llama-bench on BananaPI-BPI-F3.

llama-bench -m models/tinyllama-1.1B-f16.gguf -p 32,64,128,256,512 -n 10,16,32,64,100,128,256 -t 8 -r 10

ixgbe · 2026-02-06T02:12:44Z

Could you please share the PPL comparison against the master branch? I personally use the following command for verification:
./build/bin/llama-perplexity -f /path/to/wiki.test.raw -m models/tinyllama-1.1B-f16.gguf
Download link: https://cosmo.zip/pub/datasets/wikitext-2-raw/

taimur-10x · 2026-02-14T22:56:31Z

Could you please share the PPL comparison against the master branch? I personally use the following command for verification: ./build/bin/llama-perplexity -f /path/to/wiki.test.raw -m models/tinyllama-1.1B-f16.gguf Download link: https://cosmo.zip/pub/datasets/wikitext-2-raw/

@ixgbe, here are the perplexity numbers for fp16 (on BananaPI-BPI-F3 with VLEN=256):

/build/bin/llama-perplexity -f /path/to/wiki.test.raw -m models/tinyllama-1.1B-f16.gguf

Model	Master	Repack
Tinyllama 1.1B F16	8.6971 +/- 0.05412	8.6969 +/- 0.05412

taimur-10x · 2026-06-03T11:47:05Z

@ggml-org/ggml-riscv

taimur-10x requested a review from ggerganov as a code owner December 5, 2025 11:22

taimur-10x marked this pull request as draft December 5, 2025 11:24

loci-dev mentioned this pull request Dec 5, 2025

UPSTREAM PR #17791: ggml-cpu: add repack GEMM and GEMV for floating-point auroralabs-loci/llama.cpp#453

Open

github-actions Bot added the ggml changes relating to the ggml tensor library for machine learning label Dec 5, 2025

taimur-10x force-pushed the 10x-repack-fp branch 2 times, most recently from 299d1fc to de577c0 Compare December 23, 2025 12:38

taimur-10x marked this pull request as ready for review December 23, 2025 15:18

taimur-10x force-pushed the 10x-repack-fp branch from de577c0 to d3e2d79 Compare January 9, 2026 09:01

taimur-10x force-pushed the 10x-repack-fp branch from d3e2d79 to 28e07aa Compare January 27, 2026 11:59

loci-dev mentioned this pull request Feb 6, 2026

UPSTREAM PR #17791: ggml-cpu: add repack GEMM and GEMV for floating-point auroralabs-loci/llama.cpp#1155

Open

taimur-10x force-pushed the 10x-repack-fp branch 2 times, most recently from 84a71b3 to e8dd1b4 Compare February 14, 2026 17:22

taimur-10x force-pushed the 10x-repack-fp branch 4 times, most recently from b28b4c5 to 2db2e9f Compare March 4, 2026 17:50

taimur-10x force-pushed the 10x-repack-fp branch from 2db2e9f to fd94e4c Compare April 2, 2026 17:17

taimur-10x force-pushed the 10x-repack-fp branch from fd94e4c to 8856f8a Compare April 24, 2026 12:09

taimur-10x force-pushed the 10x-repack-fp branch 2 times, most recently from e897076 to d89139f Compare May 26, 2026 23:08

taimur-10x force-pushed the 10x-repack-fp branch from d89139f to 2f769a9 Compare June 3, 2026 11:44

taimur-10x force-pushed the 10x-repack-fp branch from 2f769a9 to d1954e1 Compare June 10, 2026 12:58

taimur-10x added 2 commits June 16, 2026 15:51

ggml-cpu: add repack GEMM and GEMV for floating-point

d2c2492

ggml-cpu: add repack GEMM and GEMV for floating-point (#4)

34a1bb0

taimur-10x force-pushed the 10x-repack-fp branch from d1954e1 to 34a1bb0 Compare June 16, 2026 10:51

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ggml-cpu: add repack GEMM and GEMV for floating-point#17791

ggml-cpu: add repack GEMM and GEMV for floating-point#17791
taimur-10x wants to merge 2 commits into
ggml-org:masterfrom
riseproject-dev:10x-repack-fp

taimur-10x commented Dec 5, 2025 •

edited

Loading

Uh oh!

taimur-10x commented Jan 14, 2026

Uh oh!

ixgbe commented Jan 21, 2026 •

edited

Loading

Uh oh!

taimur-10x commented Jan 21, 2026 •

edited

Loading

Uh oh!

ixgbe commented Feb 6, 2026 •

edited

Loading

Uh oh!

taimur-10x commented Feb 14, 2026

Uh oh!

taimur-10x commented Jun 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

taimur-10x commented Dec 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Key Changes

Tile Sizes

Testing

Benchmarking Results

Prefill / Prompt Processing (GEMM)

Tokens / Second

Decode (GEMV)

Tokens / Second

Perplexity

Additional Notes

Future Work

References

Uh oh!

taimur-10x commented Jan 14, 2026

Uh oh!

ixgbe commented Jan 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

taimur-10x commented Jan 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ixgbe commented Feb 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

taimur-10x commented Feb 14, 2026

Uh oh!

taimur-10x commented Jun 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

taimur-10x commented Dec 5, 2025 •

edited

Loading

ixgbe commented Jan 21, 2026 •

edited

Loading

taimur-10x commented Jan 21, 2026 •

edited

Loading

ixgbe commented Feb 6, 2026 •

edited

Loading