-
Notifications
You must be signed in to change notification settings - Fork 170
Description
What happened?
This is an example of what happens in ik_llama.cpp, K2 Thinking trying to run commands with "." prefix:
.git checkout -f
Even if I ask the model to correct, it can't:
token 9413: 'Let'
token 1019: ' me'
token 2284: ' run'
token 276: ' the'
token 6644: ' correct'
token 5850: ' command'
token 2932: ' without'
token 276: ' the'
token 8134: ' leading'
token 21089: ' dot'
token 25: ':'
token 163595: '<|tool_calls_section_begin|>'
token 163597: '<|tool_call_begin|>'
token 41937: 'functions'
token 20994: '.execute'
token 20975: '_command'
token 25: ':'
token 920: '19'
token 163598: '<|tool_call_argument_begin|>'
token 8264: '{"'
token 9106: 'command'
token 1289: '":'
token 6082: '".'
token 14284: 'git'
token 33490: ' checkout'
token 635: ' -'
token 69: 'f'
token 3923: '","'
token 80816: 'cwd'
token 1289: '":'
token 2796: '""'
token 92: '}'
token 163599: '<|tool_call_end|>'
token 163596: '<|tool_calls_section_end|>'
token 163586: '<|im_end|>'
In some cases, I was K2 Thinking in ik_llama.cpp type correct commands, for example here it successfully ran rm -rf llmcache_v2:
token 163595: '<|tool_calls_section_begin|>'
token 163597: '<|tool_call_begin|>'
token 41937: 'functions'
token 20994: '.execute'
token 20975: '_command'
token 25: ':'
token 23: '8'
token 163598: '<|tool_call_argument_begin|>'
token 8264: '{"'
token 9106: 'command'
token 1289: '":'
token 1: '"'
token 13119: 'rm'
token 635: ' -'
token 14373: 'rf'
token 15503: ' ll'
token 13347: 'mc'
token 1960: 'ache'
token 4231: '_v'
token 17: '2'
token 3923: '","'
token 80816: 'cwd'
token 1289: '":'
token 2796: '""'
token 92: '}'
token 163599: '<|tool_call_end|>'
token 163596: '<|tool_calls_section_end|>'
token 163586: '<|im_end|>'
But afterwards, it is typically goes back to adding the dot prefix to all prefix. For a while, I thought that this may be a model issue, but recently I was testing llama.cpp, and to my surprise the issue does not happen there, K2 Thinking seems to be reliably making correct tool calls.
It is worth mentioning that not just execute command tool calls are effected, but others too - however, like I mentioned, sometimes the model succeeds to correctly type the tool call.
This is llama.cpp command I tested with:
numactl --cpunodebind=0 --interleave=all /home/lissanro/pkgs/llama.cpp/build/bin/llama-server \
--model /mnt/neuro/models/Kimi-K2-Thinking-Q8_0-Q4_0.gguf \
--ctx-size 163840 --n-gpu-layers 62 --tensor-split 15,27,30,28 -ctk q8_0 -ctv q8_0 -b 4096 -ub 4096 \
-ot "blk\.(3)\.ffn_.*=CUDA0" \
-ot "blk\.(4)\.ffn_.*=CUDA1" \
-ot "blk\.(5)\.ffn_.*=CUDA2" \
-ot "blk\.(6)\.ffn_.*=CUDA3" \
-ot exps=CPU \
--threads 64 --host 0.0.0.0 --port 5000 \
--jinja --chat-template-file /home/lissanro/pkgs/llama.cpp/models/templates/Kimi-K2-Thinking.jinja --special
And this is my ik_llama.cpp command:
numactl --cpunodebind=0 --interleave=all /home/lissanro/pkgs/ik_llama.cpp/build/bin/llama-server \
--model /mnt/neuro/models/Kimi-K2-Thinking-Q8_0-Q4_0.gguf \
--ctx-size 163840 --n-gpu-layers 62 --tensor-split 25,22,28,25 -mla 3 -ctk q8_0 -amb 512 -b 4096 -ub 4096 \
-ot "blk\.(3)\.ffn_.*=CUDA0" \
-ot "blk\.(4)\.ffn_.*=CUDA1" \
-ot "blk\.(5)\.ffn_.*=CUDA2" \
-ot "blk\.(6)\.ffn_.*=CUDA3" \
-ot exps=CPU \
--threads 64 --host 0.0.0.0 --port 5000 \
--jinja --chat-template-file /home/lissanro/pkgs/ik_llama.cpp/models/templates/Kimi-K2-Thinking.jinja --special
I tried to compare ik_llama.cpp and llama.cpp chat templates for K2 Thinking, but they are identical:
diff -u \
/home/lissanro/pkgs/llama.cpp/models/templates/Kimi-K2-Thinking.jinja \
/home/lissanro/pkgs/ik_llama.cpp/models/templates/Kimi-K2-Thinking.jinja
(empty output)
The main issue is here, the wrong tool call:
token 163598: '<|tool_call_argument_begin|>'
token 8264: '{"'
token 9106: 'command'
token 1289: '":'
token 6082: '".'
Correct tool call:
token 163598: '<|tool_call_argument_begin|>'
token 8264: '{"'
token 9106: 'command'
token 1289: '":'
token 1: '"'
My understanding, the sequence of tokens always should be with token 1 after token 1289. I remember that in one bug reports here it was mentioned that ik_llama.cpp does not correctly force grammar, so maybe it is still the case?
An example of different incorrect tool call by ik_llama.cpp (the model wanted to check llmcache_v2 directory, but instead mistyped .cache_v2):
token 9413: 'Let'
token 1019: ' me'
token 2598: ' check'
token 276: ' the'
token 7942: ' existing'
token 1268: ' `'
token 930: 'll'
token 13347: 'mc'
token 1960: 'ache'
token 4231: '_v'
token 17: '2'
token 63: '`'
token 9003: ' directory'
token 7828: ' structure'
token 25: ':'
token 163595: '<|tool_calls_section_begin|>'
token 163597: '<|tool_call_begin|>'
token 41937: 'functions'
token 14026: '.list'
token 20350: '_files'
token 25: ':'
token 2466: '23'
token 163598: '<|tool_call_argument_begin|>'
token 8264: '{"'
token 4953: 'path'
token 1289: '":'
token 6082: '".'
token 14466: 'cache'
token 4231: '_v'
token 17: '2'
token 3923: '","'
token 88997: 'recursive'
token 1289: '":'
token 4130: 'true'
token 92: '}'
token 163599: '<|tool_call_end|>'
token 163596: '<|tool_calls_section_end|>'
token 163586: '<|im_end|>'
Notice how the issue occurs in this part:
token 8264: '{"'
token 4953: 'path'
token 1289: '":'
token 6082: '".'
token 14466: 'cache'
It generated tokens: {"path":. instead of "path":" - basically, every time : is followed by . instead of " in a tool call, the model starts to misbehave.
But, many tool calls still can succeed, some types of tool calls less likely to trigger the issue. Here is a correct example of tool call ik_llama.cpp managed to make:
token 163595: '<|tool_calls_section_begin|>'
token 163597: '<|tool_call_begin|>'
token 41937: 'functions'
token 9189: '.write'
token 4585: '_to'
token 6101: '_file'
token 25: ':'
token 18: '3'
token 163598: '<|tool_call_argument_begin|>'
token 8264: '{"'
token 4953: 'path'
token 1289: '":'
token 1: '"'
token 930: 'll'
token 13347: 'mc'
token 1960: 'ache'
token 4231: '_v'
token 17: '2'
token 94246: '/__'
token 9885: 'main'
token 32394: '__.'
token 8374: 'py'
token 665: '",'
token 1: '"'
token 4204: 'content'
token 7471: '":"'
...
One more correct tool call by ik_llama.cpp with file array:
oken 163595: '<|tool_calls_section_begin|>'
token 163597: '<|tool_call_begin|>'
token 41937: 'functions'
token 8827: '.read'
token 6101: '_file'
token 25: ':'
token 15: '0'
token 163598: '<|tool_call_argument_begin|>'
token 8264: '{"'
token 12481: 'files'
token 1289: '":'
token 81103: '[{"'
token 4953: 'path'
token 7471: '":"'
token 2113: 'ref'
token 692: 'act'
token 4715: 'ored'
token 30247: '_project'
token 71887: '_structure'
token 6847: '.md'
token 69622: '"},{"'
token 4953: 'path'
token 7471: '":"'
token 3879: 'function'
token 2700: '_h'
token 24822: 'ierarchy'
token 6847: '.md'
token 69622: '"},{"'
...
I would appreciate any ideas how to make the tool calls more reliable in ik_llama.cpp or at least where to look to debug this further.
I am testing in Roo Code with the latest PR RooCodeInc/Roo-Code#10236 that enabled support for native K2 Thinking tool calls. The reason why I think the issue is with ik_llama.cpp, because it does not seem to happen in the mainline llama.cpp as far as I can tell.
Name and Version
The latest git
What operating system are you seeing the problem on?
Linux