Skip to content

Misc. bug: The KV cache is sometimes truncated incorrectly when making v1/chat/completions API calls #11970

Open
@vnicolici

Description

Name and Version

.\llama-server.exe --version
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 2 CUDA devices:
Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes
Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
version: 4743 (d07c621)
built with MSVC 19.29.30158.0 for

Operating systems

Windows

Which llama.cpp modules do you know to be affected?

llama-server

Command line

> .\llama-server.exe -m .\models\unsloth\DeepSeek-R1-UD-IQ1_S-00001-of-00003.gguf --verbose-prompt --dump-kv-cache --log-timestamps --log-prefix --verbose --alias 'DeepSeek-R1-UD-IQ1_S' --log-file DeepSeek-R1-UD-IQ1_S.log

Problem description & steps to reproduce

When using the llama-server and its Web UI, sometimes parts of the KV cache are truncated when they shouldn't be. Steps to reproduce:

  1. Start llama-server with a command such as:
.\llama-server.exe -m .\models\unsloth\DeepSeek-R1-UD-IQ1_S-00001-of-00003.gguf --verbose-prompt --dump-kv-cache --log-timestamps --log-prefix --verbose --alias 'DeepSeek-R1-UD-IQ1_S' --log-file DeepSeek-R1-UD-IQ1_S.log

This is the 1.58bit quantized version of the DeepSeek-R1 model by unsloth. I've been able to reproduce the issue with the 2.22bit version too.
However, I've NOT been able to reproduce it with DeepSeek-R1-Distill-Qwen-7B-Q4_K_M.gguf and Qwen2.5-7B-Instruct-1M-Q4_K_M.gguf

  1. Open the Web UI, and disable "Exclude thought process when sending request to API (Recommended for DeepSeek-R1)"

In this way, the prompts sent should match the KV cache entirely within the same conversation, since the thinking that is included in the cache won't be excluded from the prompt.

Side note: In my opinion, including the thought processes in the prompts by the UI should be the default, as in my experience the quality of long conversations is negatively affected by excluding the thinking from the prompts. Also, not including the thinking means the cache needs to be recomputed starting with the end of the previous user input each time the user inputs something new in the chat, which slows down the assistant replies.

Basically this causes long pauses until the assistant starts generating new output after the new user input, as it needs to reprocess as a prompt the previous assistant output, without the thinking. In cases when the previous assistant reply is quite long, even without the thinking, this can take a long time (minutes, or even tens of minutes in extreme cases). I understand the advantages of removing thinking, as you can fit a long conversation in a smaller context if you keep removing the thinking from the context, but I'm not sure this outweighs the disadvantages.

  1. Start a new conversation from the Web UI and enter a user prompt that is likely to cause significant amounts of assistant output, including "thinking", for example:
Tell me a romantic story, please.
  1. Wait for the reply to be generated, and check the log to see how much information has accumulated in the cache. An example from my latest test:
8.45.064.978 D slot update_slots: id  0 | task 0 | slot decode token, n_ctx = 4096, n_past = 1185, n_cache_tokens = 1185, truncated = 0

So, in this case, the cache contained 1185 tokens after the assistant replied to my initial prompt.

  1. Add some new user input to the conversation. This time it doesn't necessarily need to generate a lot of output or cause a lot of thinking. For example:
Thank you, that's all.
  1. Check the log again, to see how much of the cache has been truncated, you will find something like this:
12.30.856.271 I slot update_slots: id  0 | task 1170 | kv cache rm [488, end)

This means that the cache from position 488 to position 1185 has been discarded, for some reason.

In my opinion, this shouldn't happen, it should keep the entire content of the cache and not remove anything, since the new prompt is a continuation of the same conversation.

During my test, I tried identifying exactly what was previously in the cache at position 488, and it was a word in a sentence towards the end of the thinking, but it doesn't seem special in any way. Just the word "vivid" before the end of a sentence, and that sentence wasn't even the last sentence in the thinking section of the reply:

4.11.215.220 D slot process_toke: id  0 | task 0 | n_decoded = 470, n_remaining = -1, next token:   850 ' more'
...
4.11.215.232 D slot update_slots: id  0 | task 0 | slot decode token, n_ctx = 4096, n_past = 487, n_cache_tokens = 487, truncated = 0
...
4.11.578.617 D slot process_toke: id  0 | task 0 | n_decoded = 471, n_remaining = -1, next token: 33949 ' vivid'
...
4.11.578.630 D slot update_slots: id  0 | task 0 | slot decode token, n_ctx = 4096, n_past = 488, n_cache_tokens = 488, truncated = 0
...
4.11.934.032 D slot process_toke: id  0 | task 0 | n_decoded = 472, n_remaining = -1, next token:    16 '.'
...
4.11.934.047 D slot update_slots: id  0 | task 0 | slot decode token, n_ctx = 4096, n_past = 489, n_cache_tokens = 489, truncated = 0

I've even coded my own command line API client in Go, and I was still able to replicate the issue. So it doesn't seem to be an bug in the Web UI, but an issue with the /v1/chat/completions API itself.

I have NOT been able to replicate this using llama-cli.exe, it works properly without discarding any parts of the cache during such conversations.

Currently, I'm forced to use this CLI, otherwise a 2 hour conversation with DeepSeek can easily turn into a 3-4 hour conversation, due to the caching issues.

I attached the log from my latest test.

DeepSeek-R1-UD-IQ1_S.zip

First Bad Commit

No response

Relevant log output

Activity

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions