ggml-cpu: add repack GEMM and GEMV for floating-point#17791
Conversation
299d1fc to
de577c0
Compare
de577c0 to
d3e2d79
Compare
|
@ggerganov, could this be reviewed please? Thank you. |
|
@taimur-10x , Could you please provide the download links for Tinyllama F16 1.1B and Tinyllama F32 1.1B models? Also, it would be really helpful if you could share the exact commands or script used to obtain these performance numbers (especially for the Prefill / Prompt Processing benchmarks with GEMM). I'd like to reproduce the results. |
Sure, here's the link for the model: https://huggingface.co/TinyLlama/TinyLlama_v1.1 All benchmarking was done through |
d3e2d79 to
28e07aa
Compare
|
Could you please share the PPL comparison against the master branch? I personally use the following command for verification:
|
84a71b3 to
e8dd1b4
Compare
@ixgbe, here are the perplexity numbers for
|
b28b4c5 to
2db2e9f
Compare
2db2e9f to
fd94e4c
Compare
fd94e4c to
8856f8a
Compare
e897076 to
d89139f
Compare
|
@ggml-org/ggml-riscv |
2f769a9 to
d1954e1
Compare
d1954e1 to
34a1bb0
Compare

Summary
This PR adds repacking and GEMM/GEMV kernels for floating-point (FP16 and FP32) for RVV (with the
zvfhextension).Key Changes
7 x {16, 32, 64, 128}(selected based on VLEN)1 x {16, 32, 64, 128}(selected based on VLEN)ggml_quantize_mat_tis refactored toggml_repack_mat_tto allow for a common interface for both quantization and floating-point repacking.NB_ROWSadded to select the number of rows to interleave for repacking. Previously, this was fixed at4.Tile Sizes
The repack operation interleaves
Nrows ofactivationswith an interleave size ofK, andMcolumns ofweightswith an interleave size ofK.NxKis fixed at7x1. This introduces 7 accumulators withLMUL=4(7 x 4 = 28 registers), each accumulatingMresults.Mis varied based on the available VLEN:Mis the maximum number of values that can be loaded in (LMUL=2 for F16, LMUL=4 for F32).Testing
Kernels were functionally tested on QEMU for VLENs (128-bit, 256-bit, 512-bit and 1024-bit) for a range of input sizes.
Benchmarking Results
End-to-end benchmarking on
BananaPI-BPI F3 (VLEN=256)withllama-bench(Threads=8)).Prefill / Prompt Processing (GEMM)
Tokens / Second
Result: ~2x-3x speedup over
vec_dotDecode (GEMV)
Tokens / Second
Result: No noticeable improvement, as decode remains memory-bound.
Perplexity
Calculated on BananaPI-BPI-F3 with VLEN=256
Additional Notes
arch-fallback.has7xMx1is very RVV-specific tiling, and should not be used by other architectures.GEMMreaches peak performance when the prompt is a multiple of 7 (for example,prompt=28). To handle leftovers, it defaults toGEMV, which impacts performance. Ideally, there should be leftoverNx32kernels which handle each leftover case from2-6leftover tokens.Future Work
Subsequent PRs plan to add RVV kernels for quantization types, as well as extend existing quantization support to other VLENs.
References
The selection of the tiling
7x32x1is based off of themmt4dkernel in IREE for RVV: iree-org/iree#20263