Sync master with upstream release b8641 by jan-service-account · Pull Request #474 · janhq/llama.cpp

jan-service-account · 2026-04-03T00:52:09Z

Updates dev branch with latest release (b8641) from ggml-org/llama.cpp

…ction with device launch bounds (ggml-org#21238) The conditions cc == GGML_CUDA_CC_VOLTA || cc >= GGML_CUDA_CC_ADA_LOVELACE and cc >= GGML_CUDA_CC_TURING match all non-nvidia devices. This causes us to attempt to launch the kernel for batch sizes with larger configurations than our launch bounds on HIP devices. This pr fixes the conditionals in get_mmvq_mmid_max_batch. Fixes ggml-org#21191

…l-org#21224) The hybrid memory paths (`llama-memory-hybrid.cpp` and `llama-memory-hybrid-iswa.cpp`) always used sequential equal split, ignoring the unified KV cache flag. This caused hellaswag, winogrande, and multiple-choice evaluations to fail on hybrid models (models with both attention and recurrent/SSM layers, such as Qwen3.5-35B-A3B) with: split_equal: sequential split is not supported when there are coupled sequences in the input batch (you may need to use the -kvu flag) PR ggml-org#19954 fixed this for `llama-kv-cache-iswa.cpp` by automatically enabling unified KV mode and setting n_parallel >= 4 for multi-choice eval tasks. However, the hybrid memory paths were not updated. This commit mirrors the iswa fix: use non-sequential split when KV cache is unified (n_stream == 1), which is automatically set by llama-perplexity for hellaswag/winogrande/multiple-choice since ggml-org#19954. Tested on Qwen3.5-35B-A3B (hybrid attention+SSM MoE model): - HellaSwag: 83.0% (400 tasks) - Winogrande: 74.5% (400 tasks) - MMLU: 41.2% - ARC-Challenge: 56.2% - TruthfulQA: 37.7% All previously failed with llama_decode() error.

* Introduced NVFP4 generic MMQ kernel * Added extra FP8 guard, hope to solve ci HIP failure * Rename tiles and use HIP_FP8_AVAILABLE * Removed remaning FP8 straggler and added const int * Const * Removed DECL_MMQ_CASE artifact * Removed newline * Removed space after else * Changed HIP FP8 NVFP4 conversion gate * Added new line to bottom of mmq.cu 270 * Removed extra spaces * Removed single space in front of else on line 814 * Added NVFP4 to generate cu script so HIP can see it, further tightened logic * Include generated mmq-instance-nvfp4.cu * Added NVFP4 mmq to HIP Check ignore list * Update ggml/src/ggml-cuda/mmq.cuh Changed to Q3_K tile to read MMQ_MMA_TILE_X_K_NVFP4 Co-authored-by: Johannes Gäßler <johannesg@5d6.de> * Update ggml/src/ggml-cuda/mmq.cuh Changed to Q3_K tile to read MMQ_MMA_TILE_X_K_NVFP4 in tile assert Co-authored-by: Johannes Gäßler <johannesg@5d6.de> * Update ggml/src/ggml-cuda/mmq.cuh Added function name ending for end if Co-authored-by: Johannes Gäßler <johannesg@5d6.de> * Added function names to closing endif Co-authored-by: Johannes Gäßler <johannesg@5d6.de> --------- Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

* scripts: add function call test script * add reasoning_content * fix lint

* llama : rotate activations for better quantization * cont : rotate V more + refactor * cont : rotate caches separately + support non-power-of-2 head sizes * cont : simplify * cont : add reference for V rotation * cont : refactor * cont : support context shift * cont : consolidate * cont : dedup + allow different types for the rotation matrix * cont : add env variable to disable rotation * cont : simplify attn rot kv cache logic + rename env * cont : pre-compute the Hadamard matrices

* fix: tool call parsing for LFM2 and LFM2.5 models' * refactor: add test / break out lfm2 and lfm2.5 parsing logic

* hexagon-rms_norm: fix RMS_NORM for non-aligned tensor sizes Co-authored-by: Krishna Sridhar <srsr@qti.qualcomm.com> * hexagon-div: perform DIV in fp16 domain for lower dsp archs --------- Co-authored-by: Krishna Sridhar <srsr@qti.qualcomm.com>

* Pin Dawn version * Update docs with new Dawn commit hash

* kleidiai: add cpu feature detection to CI run script Signed-off-by: Martin Klacer <martin.klacer@arm.com> Change-Id: I663adc3a7691a98e7dac5488962c13cc344f034a * kleidiai: revert unrelated requirements change Signed-off-by: Martin Klacer <martin.klacer@arm.com> * kleidiai: removed cpu feature detection from CI run script * As per the maintainers' suggestion, removed cpu feature detection from CI run script as CMake handles it already Signed-off-by: Martin Klacer <martin.klacer@arm.com> --------- Signed-off-by: Martin Klacer <martin.klacer@arm.com>

…l-org#21269) * fix: Bypass API Key validation for static bundle assets * refactor: All bypassed routes in `public_endpoints` * test: Update static assets API Key test

…gml-org#21270) * contrib : rewrite AGENTS.md, make it more clear about types of permitted AI usage * permit AI for writing code

* hexagon : add cumsum op support * hexagon: enable dma for cumsum op * Fix line-ending --------- Co-authored-by: Max Krasnyansky <maxk@qti.qualcomm.com>

…1283)

…ing (ggml-org#20804) * chat : add Granite 4.0 chat template with correct tool_call role mapping Introduce `LLM_CHAT_TEMPLATE_GRANITE_4_0` alongside the existing Granite 3.x template (renamed `LLM_CHAT_TEMPLATE_GRANITE_3_X`). The Granite 4.0 Jinja template uses `<tool_call>` XML tags and maps the `assistant_tool_call` role to `<|start_of_role|>assistant<|end_of_role|><|tool_call|>`. Without a matching C++ handler, the fallback path emits the literal role `assistant_tool_call` which the model does not recognize, breaking tool calling when `--jinja` is not used. Changes: - Rename `LLM_CHAT_TEMPLATE_GRANITE` to `LLM_CHAT_TEMPLATE_GRANITE_3_X` (preserves existing 3.x behavior unchanged) - Add `LLM_CHAT_TEMPLATE_GRANITE_4_0` enum, map entry, and handler - Detection: `<|start_of_role|>` + (`<tool_call>` or `<tools>`) → 4.0, otherwise → 3.x - Add production Granite 4.0 Jinja template - Add tests for both 3.x and 4.0 template paths (C++ and Jinja) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Code review: follow standard format and use common logic in test-chat-template.cpp * Rename custom_conversation variable for extra_conversation to give it a more meaningful name --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>

* Relax prefill parser to allow space. * Move changes from prefix() to parser generation * Only allow spaces if we're not having a pure content parser next

) * fix gguf conversion for audio/vision mmproj * fix test

…ghts (ggml-org#21182) * tests: allow exporting graph ops from HF file without downloading weights * use unique_ptr for llama_context in HF metadata case * fix missing non-required tensors falling back to type f32 * use unique pointers where possible * use no_alloc instead of fixing f32 fallback * fix missing space

* naive vectorized version * add vectorized flash attention * update vec version * remove unused path and shader * remove unused helper functions * add comments * remove pad path * ggml-webgpu: fix flash-attn vec nwg=1 path and tighten vec specialization * change back to vec4 * enable multi split * enable vec path when: - Q->ne[1] < 20 - Q->ne[0] % 32 == 0 - V->ne[0] % 4 == 0 - K->type == f16 * update flast_attn_vec_split.wgsl to reduce redundant workgroup barrier usage and use select * enable vec path for q4 and q8 * flash-attn vec nwg=1 fast path (skip tmp/reduce staging) * use packed f16 K loads in flash-attn vec split * use packed f16 K loads in flash-attn vec split on host side * tune flash-attn vec f16 VEC_NE by head dim * cleanup * cleanup * keep host side clean * cleanup host side * change back to original host wait/submit behavior * formatting * reverted param-buffer pool r ecfactor * add helper functions * ggml-webgpu: move flash-attn vec pipeline caching back into shader lib * ggml-webgpu: remove duplicate functions * ggml-webgpu: reserve flash-attn vec scratch in dst buffer allocation * ggml-webgpu: revert unrelated change * ggml-webgpu: revert deleted comment * disable uniformity check * remove unnecessary change * Update ggml/src/ggml-webgpu/wgsl-shaders/flash_attn_vec_split.wgsl * Update ggml/src/ggml-webgpu/ggml-webgpu.cpp --------- Co-authored-by: Reese Levine <reeselevine1@gmail.com>

) * Add unit test coverage for llama_tensor_get_type * Fix merge conflicts, add more schemas * clang formatter changes * Trailing whitespace * Update name * Start rebase * Updating files with upstream changes prior to rebase * Changes needed from rebase * Update attn_qkv schema, change throw behaviour * Fix merge conflicts * White space * Update with latest changes to state counters * Revert accidental personal CLAUDE.md changes * Change quotation mark * Reuse metadata.name since we have it * Move test-only stuff out of llama-quant.cpp * Hide the regex functionality back in llama-quant.cpp, use a unique pointer to a new struct 'compiled_tensor_type_patterns' which contains the patterns * cont : inital deslop guidelines * Cleanup based on review comments * Continue cleanup * Small cleanup * Manually set proper ordering of tensors, mostly applies to gemma * Formatting * Update tests/test-quant-type-selection.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Fix merge conflicts --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

IMbackK and others added 30 commits April 1, 2026 10:21

sycl : support nvfp4 type in mul_mat (ggml-org#21227)

6b949d1

ggml : bump version to 0.9.10 (ggml/1454)

296bc05

sync : ggml

6422036

scripts: add function call test script (ggml-org#21234)

0356e33

* scripts: add function call test script * add reasoning_content * fix lint

fix: tool call parsing for LFM2 and LFM2.5 models (ggml-org#21242)

1d6d4cf

* fix: tool call parsing for LFM2 and LFM2.5 models' * refactor: add test / break out lfm2 and lfm2.5 parsing logic

Update Dawn version in WebGPU CI (ggml-org#20784)

5a0ed51

* Pin Dawn version * Update docs with new Dawn commit hash

CUDA: fix FA kernel selection logic (ggml-org#21271)

86221cf

server: Bypass API Key validation for WebUI static bundle assets (ggm…

12dbf1d

…l-org#21269) * fix: Bypass API Key validation for static bundle assets * refactor: All bypassed routes in `public_endpoints` * test: Update static assets API Key test

opencl: fix leak in Adreno q8_0 path (ggml-org#21212)

95a6eba

contrib : rewrite AGENTS.md, make it more clear about project values (g…

c30e012

…gml-org#21270) * contrib : rewrite AGENTS.md, make it more clear about types of permitted AI usage * permit AI for writing code

hexagon : add cumsum op support (ggml-org#21246)

fbd441c

* hexagon : add cumsum op support * hexagon: enable dma for cumsum op * Fix line-ending --------- Co-authored-by: Max Krasnyansky <maxk@qti.qualcomm.com>

sycl : fix llama_kv_cache hang when kv_cache is huge: 5GB (ggml-org#2…

4888137

…1283)

ggml : bump version to 0.9.11 (ggml/1456)

bc07d55

sync : ggml

dae2bf4

Ignore Transfer-Encoding header. (ggml-org#20269)

d6dac92

kv-cache : do not quantize SWA KV cache (ggml-org#21277)

17193cc

Relax prefill parser to allow space. (ggml-org#21240)

e15efe0

* Relax prefill parser to allow space. * Move changes from prefix() to parser generation * Only allow spaces if we're not having a pure content parser next

common : add commentary rules for gpt-oss-20b (ggml-org#21286)

2233737

model, mtmd: fix gguf conversion for audio/vision mmproj (ggml-org#21309

63f8fe0

) * fix gguf conversion for audio/vision mmproj * fix test

fix: gemma 4 template (ggml-org#21326)

5208e2d

jan-service-account merged commit c12ad82 into dev Apr 3, 2026
5 checks passed

jan-service-account deleted the update-dev-from-master-2026-04-03-00-52 branch April 3, 2026 01:22

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sync master with upstream release b8641#474

Sync master with upstream release b8641#474
jan-service-account merged 30 commits into
devfrom
update-dev-from-master-2026-04-03-00-52

jan-service-account commented Apr 3, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants

Conversation

jan-service-account commented Apr 3, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants