Description
Name and Version
.\llama-server.exe --version
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 2 CUDA devices:
Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes
Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
version: 4743 (d07c621)
built with MSVC 19.29.30158.0 for
Operating systems
Windows
Which llama.cpp modules do you know to be affected?
llama-server
Command line
> .\llama-server.exe -m .\models\unsloth\DeepSeek-R1-UD-IQ1_S-00001-of-00003.gguf --verbose-prompt --dump-kv-cache --log-timestamps --log-prefix --verbose --alias 'DeepSeek-R1-UD-IQ1_S' --log-file DeepSeek-R1-UD-IQ1_S.log
Problem description & steps to reproduce
When using the llama-server and its Web UI, sometimes parts of the KV cache are truncated when they shouldn't be. Steps to reproduce:
- Start llama-server with a command such as:
.\llama-server.exe -m .\models\unsloth\DeepSeek-R1-UD-IQ1_S-00001-of-00003.gguf --verbose-prompt --dump-kv-cache --log-timestamps --log-prefix --verbose --alias 'DeepSeek-R1-UD-IQ1_S' --log-file DeepSeek-R1-UD-IQ1_S.log
This is the 1.58bit quantized version of the DeepSeek-R1 model by unsloth. I've been able to reproduce the issue with the 2.22bit version too.
However, I've NOT been able to reproduce it with DeepSeek-R1-Distill-Qwen-7B-Q4_K_M.gguf and Qwen2.5-7B-Instruct-1M-Q4_K_M.gguf
- Open the Web UI, and disable "Exclude thought process when sending request to API (Recommended for DeepSeek-R1)"
In this way, the prompts sent should match the KV cache entirely within the same conversation, since the thinking that is included in the cache won't be excluded from the prompt.
Side note: In my opinion, including the thought processes in the prompts by the UI should be the default, as in my experience the quality of long conversations is negatively affected by excluding the thinking from the prompts. Also, not including the thinking means the cache needs to be recomputed starting with the end of the previous user input each time the user inputs something new in the chat, which slows down the assistant replies.
Basically this causes long pauses until the assistant starts generating new output after the new user input, as it needs to reprocess as a prompt the previous assistant output, without the thinking. In cases when the previous assistant reply is quite long, even without the thinking, this can take a long time (minutes, or even tens of minutes in extreme cases). I understand the advantages of removing thinking, as you can fit a long conversation in a smaller context if you keep removing the thinking from the context, but I'm not sure this outweighs the disadvantages.
- Start a new conversation from the Web UI and enter a user prompt that is likely to cause significant amounts of assistant output, including "thinking", for example:
Tell me a romantic story, please.
- Wait for the reply to be generated, and check the log to see how much information has accumulated in the cache. An example from my latest test:
8.45.064.978 D slot update_slots: id 0 | task 0 | slot decode token, n_ctx = 4096, n_past = 1185, n_cache_tokens = 1185, truncated = 0
So, in this case, the cache contained 1185 tokens after the assistant replied to my initial prompt.
- Add some new user input to the conversation. This time it doesn't necessarily need to generate a lot of output or cause a lot of thinking. For example:
Thank you, that's all.
- Check the log again, to see how much of the cache has been truncated, you will find something like this:
12.30.856.271 I slot update_slots: id 0 | task 1170 | kv cache rm [488, end)
This means that the cache from position 488 to position 1185 has been discarded, for some reason.
In my opinion, this shouldn't happen, it should keep the entire content of the cache and not remove anything, since the new prompt is a continuation of the same conversation.
During my test, I tried identifying exactly what was previously in the cache at position 488, and it was a word in a sentence towards the end of the thinking, but it doesn't seem special in any way. Just the word "vivid" before the end of a sentence, and that sentence wasn't even the last sentence in the thinking section of the reply:
4.11.215.220 D slot process_toke: id 0 | task 0 | n_decoded = 470, n_remaining = -1, next token: 850 ' more'
...
4.11.215.232 D slot update_slots: id 0 | task 0 | slot decode token, n_ctx = 4096, n_past = 487, n_cache_tokens = 487, truncated = 0
...
4.11.578.617 D slot process_toke: id 0 | task 0 | n_decoded = 471, n_remaining = -1, next token: 33949 ' vivid'
...
4.11.578.630 D slot update_slots: id 0 | task 0 | slot decode token, n_ctx = 4096, n_past = 488, n_cache_tokens = 488, truncated = 0
...
4.11.934.032 D slot process_toke: id 0 | task 0 | n_decoded = 472, n_remaining = -1, next token: 16 '.'
...
4.11.934.047 D slot update_slots: id 0 | task 0 | slot decode token, n_ctx = 4096, n_past = 489, n_cache_tokens = 489, truncated = 0
I've even coded my own command line API client in Go, and I was still able to replicate the issue. So it doesn't seem to be an bug in the Web UI, but an issue with the /v1/chat/completions API itself.
I have NOT been able to replicate this using llama-cli.exe, it works properly without discarding any parts of the cache during such conversations.
Currently, I'm forced to use this CLI, otherwise a 2 hour conversation with DeepSeek can easily turn into a 3-4 hour conversation, due to the caching issues.
I attached the log from my latest test.
First Bad Commit
No response
Activity