Replies: 3 comments
-
|
Another idea by @aldehir here #21760 (comment) was to mask logits in adherence to a model format specific state machine. That could be a more model agnostic approach to what LiteRT-LM does? |
Beta Was this translation helpful? Give feedback.
-
|
@osanseviero ping |
Beta Was this translation helpful? Give feedback.
-
|
The KV cache "filtering" is how LiteRT-LM handles this part of the Gemma 4 documentation: Managing Thought Context Between Turns
The Gemma 4 chat template assumes the entire conversation history will be re-run on every turn. But LiteRT-LM maintains the KV cache from turn-to-turn, which will include tokens generated by the model. In order to go back and remove the thinking content from previous turns, we rewind the KV cache and re-prefill the model and tool turns, with thinking content stripped out. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Following from discussion in #21760
I'm analyzing Google's LiteRT-LM source code on how they handle things, to ensure llama.cpp handles the Gemma 4 model correctly.
They are "filtering" reasoning / thinking content from the KV cache, is llama.cpp doing the same?
Filtering as in, reverting checkpoints afaict?
And should llama.cpp be doing that? Or is this just a hack/awful workaround in LiteRT-LM?
https://github.com/google-ai-edge/LiteRT-LM/blob/176953bf882e25f67f2f7e089e9326f8ddd262f9/runtime/conversation/conversation.cc#L437-L448
https://github.com/google-ai-edge/LiteRT-LM/blob/176953bf882e25f67f2f7e089e9326f8ddd262f9/runtime/conversation/conversation.cc#L495-L506
https://github.com/google-ai-edge/LiteRT-LM/blob/176953bf882e25f67f2f7e089e9326f8ddd262f9/runtime/conversation/conversation.cc#L548-L558
https://github.com/google-ai-edge/LiteRT-LM/blob/176953bf882e25f67f2f7e089e9326f8ddd262f9/runtime/conversation/conversation.h#L169-L171
https://github.com/google-ai-edge/LiteRT-LM/blob/176953bf882e25f67f2f7e089e9326f8ddd262f9/runtime/conversation/conversation_test.cc#L312
https://github.com/google-ai-edge/LiteRT-LM/blob/176953bf882e25f67f2f7e089e9326f8ddd262f9/runtime/conversation/conversation_test.cc#L940
https://github.com/google-ai-edge/LiteRT-LM/blob/176953bf882e25f67f2f7e089e9326f8ddd262f9/runtime/conversation/conversation_test.cc#L1552
Source code comments that explain their thinking/reasoning checkpointing/reverting/filtering:
https://github.com/google-ai-edge/LiteRT-LM/blob/176953bf882e25f67f2f7e089e9326f8ddd262f9/runtime/conversation/conversation.cc#L495C10-L499C48
https://github.com/google-ai-edge/LiteRT-LM/blob/176953bf882e25f67f2f7e089e9326f8ddd262f9/runtime/conversation/conversation.cc#L548C17-L549C30
Beta Was this translation helpful? Give feedback.
All reactions