Sync master with upstream release b8712 by jan-service-account · Pull Request #480 · janhq/llama.cpp

jan-service-account · 2026-04-09T00:44:52Z

Updates dev branch with latest release (b8712) from ggml-org/llama.cpp

…g#21513) * kv-cache : support attention rotation for heterogeneous iSWA * cont : remove assert

…rg#21558) This commit adds a missing comma in the vision encoder attention qkv block. The motivation for this change is that without the comma there will be a string concatenation of the Kimi-K2.5 and the Nemotron Nano v2 VL tensor mappings which will be broken.

* ds_read_b128 for q4_0 and q4_1 mmq kernels Current for loop generates ds_read_b32 instructions with hip compiler, the new solution generates ds_read_b128 instructions for the same operation, saving some LDS bandwidth. Tested on MI50 and RX6800XT, its faster on both. * Vectorized lds load update: used ggml_cuda_get_max_cpy_bytes and ggml_cuda_memcpy_1 functions for generic implementation * Explicit for loop in mmq, renamed vec into tmp * Fixed max_cpy usage in the loading loop * Fixed typo in q4_1 kernel * Update ggml/src/ggml-cuda/mmq.cuh Co-authored-by: Johannes Gäßler <johannesg@5d6.de> * Update ggml/src/ggml-cuda/mmq.cuh Co-authored-by: Johannes Gäßler <johannesg@5d6.de> * Update ggml/src/ggml-cuda/mmq.cuh Co-authored-by: Johannes Gäßler <johannesg@5d6.de> * Renoved trailing white line 500 * Update mmq.cuh removed other whitelines * Remove trailing whitespaces --------- Co-authored-by: iacopPBK <iacopPBK@users.noreply.github.com> Co-authored-by: Johannes Gäßler <johannesg@5d6.de> Co-authored-by: iacopPBK <iacop@deneb.com>

* CUDA: compute fast hash instead of expensive props check * use seen node * use memcp

…-org#21259) * Unified macOS release setup with strategy-matrix block * Added KleidiAI arm64 macOS release definition Change-Id: I05520889ffc646488a178d06817a17f29274465a Signed-off-by: Martin Klacer <martin.klacer@arm.com>

…nguages (ggml-org#21206) * webui: fix syntax highlighting lost for non-common languages after streaming rehype-highlight uses lowlight internally, which only bundles 37 "common" languages. The streaming code path uses highlight.js directly (192 languages), so languages like Haskell highlight correctly while streaming but lose all color once the code block closes. Pass the full lowlight language set to rehype-highlight so both paths support the same languages. * webui: rebuild static files after rebase

* feat: support step3-vl-10b * use fused QKV && mapping tensor in tensor_mapping.py * guard hardcoded params and drop crop metadata * get understand_projector_stride from global config * img_u8_resize_bilinear_to_f32 move in step3vl class * Apply suggestions from code review Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * fix the \r\n mess * add width and heads to MmprojModel.set_gguf_parameters --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

…`server` changes (ggml-org#21567)

This commit updates the debug example to not create the base_callback_data. The motivation for this is when using `--save-logits`, which is used by examples/model-conversion scripts, we often don't care about the tensor outputs and they just add noise to the output. This changes is quiet by default we can always remove --save-logits to get the tensor outputs when debugging.

) * gemma : reduce graph splits by keeping per-layer ops in the input layer * gemma : put the per-layer proj in the first layer * cont : move the projection before the layer loop

* initial Q1_0 Metal backend * tuning q1_0 metal kernels * add Q1_0 to test-backend-ops * add Q1_0<->F32 copy test * Apply suggestions from code review Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

ggerganov and others added 14 commits April 7, 2026 20:31

kv-cache : support attention rotation for heterogeneous iSWA (ggml-or…

4eb1951

…g#21513) * kv-cache : support attention rotation for heterogeneous iSWA * cont : remove assert

CUDA: make cuda graphs props check faster (ggml-org#21472)

c5ce4bc

* CUDA: compute fast hash instead of expensive props check * use seen node * use memcp

chore: Remove legacy files (ggml-org#21606)

ece522f

chore: Update labeler to have separate labels for server/webui and …

3bd9aa1

…`server` changes (ggml-org#21567)

tests : remove obsolete .mjs script (ggml-org#21615)

ae65fbd

parser: fix MiniMax handling (ggml-org#21573)

85d482e

gemma : perform per-layer projections in the first layer (ggml-org#21612

5764d7c

) * gemma : reduce graph splits by keeping per-layer ops in the input layer * gemma : put the per-layer proj in the first layer * cont : move the projection before the layer loop

jan-service-account merged commit 1877e6a into dev Apr 9, 2026
5 checks passed

jan-service-account deleted the update-dev-from-master-2026-04-09-00-44 branch April 9, 2026 01:13

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sync master with upstream release b8712#480

Sync master with upstream release b8712#480
jan-service-account merged 14 commits into
devfrom
update-dev-from-master-2026-04-09-00-44

jan-service-account commented Apr 9, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

11 participants

Conversation

jan-service-account commented Apr 9, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

11 participants