Releases: ggml-org/llama.cpp
Releases · ggml-org/llama.cpp
b4694
ggml : fix multi-threaded clamp_f32 (#11824) * Bug fix for clamp_f32 When using tensors larger than 1d clamp operation does not work due to the restriction of returning if ith is not 0. * Bug fix for clamp_f32 * Bug fix for clamp_f32
b4692
CUDA: fix CUDART_VERSION checks (#11821)
b4689
Fix #11802: Compile bug - RegQueryValueExA changed to RegQueryValueEx…
b4688
server : use common_token_to_piece instead of common_detokenize (#11740) * server : use common_token_to_piece instead of common_detokenize This commit replaces the call to common_detokenize with common_token_to_piece in the populate_token_probs. The motivation for this change is to avoid an issue where common_detokenize would remove the word boundary character for tokens, which caused a regression in the server generated token probabilities. Resolves: https://github.com/ggerganov/llama.cpp/issues/11728 * squash! server : use common_token_to_piece instead of common_detokenize Use common_token_to_piece for post_sampling_probs as well.
b4686
fix: typos in documentation files (#11791) * Update ggml.c * Update arg.cpp * Update speculative.h
b4683
llama-mmap: fix missing include (#11796) Technically the fixed width types come only from iostream and cstdint/stdint.h headers. memory and vector headers should not provide these. In GCC 15 the headers are cleaned up and you require the proper header cstdint. src/llama-mmap.h:26:5: error: ‘uint32_t’ does not name a type 26 | uint32_t read_u32() const; | ^~~~~~~~
b4682
server : correct signal handler (#11795)
b4681
sync: minja (https://github.com/google/minja/commit/a72057e5190de2c61…
b4679
vulkan: Make Vulkan optional at runtime (#11493). (#11494) Co-authored-by: Jeff Bolz <[email protected]>
b4678
vulkan: add environment variable GGML_VK_PREFER_HOST_MEMORY to avoid …