Open
Description
In a multi-turn conversation I see that the combination of llama-cpp-python and llama-cpp-agent is much slower on the second prompt than the python bindings of gpt4all. See the 2 screenshots below. The evaluation of the first prompt is faster, probably due to the recent speed improvements for prompt processing which have not yet been adopted in gpt4all. When I reply to that first answer from the AI the second reply of gpt4all comes much faster than the first whereas llama-cpp-python/llama-cpp-agent are even slower than on the first prompt. My setup is CPU only.
Do you have an idea why this is the case? Do they handle memory in a more efficient way?
Llama-3-8b-instruct Q8
Prompt processing
round gpt4all llama-cpp-python/agent
1 12.03 s 7.17 s
2 3.73 s 8.46 s
Metadata
Assignees
Labels
No labels
Activity