Sync master with upstream release b8712#480
Merged
jan-service-account merged 14 commits intoApr 9, 2026
Merged
Conversation
…g#21513) * kv-cache : support attention rotation for heterogeneous iSWA * cont : remove assert
…rg#21558) This commit adds a missing comma in the vision encoder attention qkv block. The motivation for this change is that without the comma there will be a string concatenation of the Kimi-K2.5 and the Nemotron Nano v2 VL tensor mappings which will be broken.
* ds_read_b128 for q4_0 and q4_1 mmq kernels
Current for loop generates ds_read_b32 instructions with hip compiler, the new solution generates ds_read_b128 instructions for the same operation, saving some LDS bandwidth. Tested on MI50 and RX6800XT, its faster on both.
* Vectorized lds load update: used ggml_cuda_get_max_cpy_bytes and ggml_cuda_memcpy_1 functions for generic implementation
* Explicit for loop in mmq, renamed vec into tmp
* Fixed max_cpy usage in the loading loop
* Fixed typo in q4_1 kernel
* Update ggml/src/ggml-cuda/mmq.cuh
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
* Update ggml/src/ggml-cuda/mmq.cuh
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
* Update ggml/src/ggml-cuda/mmq.cuh
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
* Renoved trailing white line 500
* Update mmq.cuh removed other whitelines
* Remove trailing whitespaces
---------
Co-authored-by: iacopPBK <iacopPBK@users.noreply.github.com>
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
Co-authored-by: iacopPBK <iacop@deneb.com>
* CUDA: compute fast hash instead of expensive props check * use seen node * use memcp
…-org#21259) * Unified macOS release setup with strategy-matrix block * Added KleidiAI arm64 macOS release definition Change-Id: I05520889ffc646488a178d06817a17f29274465a Signed-off-by: Martin Klacer <martin.klacer@arm.com>
…nguages (ggml-org#21206) * webui: fix syntax highlighting lost for non-common languages after streaming rehype-highlight uses lowlight internally, which only bundles 37 "common" languages. The streaming code path uses highlight.js directly (192 languages), so languages like Haskell highlight correctly while streaming but lose all color once the code block closes. Pass the full lowlight language set to rehype-highlight so both paths support the same languages. * webui: rebuild static files after rebase
* feat: support step3-vl-10b * use fused QKV && mapping tensor in tensor_mapping.py * guard hardcoded params and drop crop metadata * get understand_projector_stride from global config * img_u8_resize_bilinear_to_f32 move in step3vl class * Apply suggestions from code review Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * fix the \r\n mess * add width and heads to MmprojModel.set_gguf_parameters --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
This commit updates the debug example to not create the base_callback_data. The motivation for this is when using `--save-logits`, which is used by examples/model-conversion scripts, we often don't care about the tensor outputs and they just add noise to the output. This changes is quiet by default we can always remove --save-logits to get the tensor outputs when debugging.
* initial Q1_0 Metal backend * tuning q1_0 metal kernels * add Q1_0 to test-backend-ops * add Q1_0<->F32 copy test * Apply suggestions from code review Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Updates dev branch with latest release (b8712) from ggml-org/llama.cpp