Skip to content

strange text duplication from llama-server to llama-cpp-agent #86

Open
@rpdrewes

Description

I am getting occasional duplicated text in responses using llama-cpp-agent to talk to a llama-server on a remote host. This does not seem to be a token repetition issue that might be solved with repeat-penalty. This seems to be a disagreement between client and server about when a data "chunk" is complete. It looks like this (Q: is the query sent to the server, A: is the answer):

Q:What is the tallest mountain in Europe? Be brief.
A:Mount Elbrus, located in the Caucas Caucasus range, Russia Russia, is the tallest mountain in Europe, with a a height of 5,642 meters (18,510 feet).

Note the duplication of "Caucas Caucasus" and "Russia Russia," in the response!

Looking at verbose output on the server side (llama-server -v) you can see that the server is actually sending " Caucas" in one message, followed by " Caucasus", and then later " Russia" immediately followed by " Russia," with a comma after it:

data stream, to_send: data: {"index":0,"content":"\n\n","tokens":[271],"stop":false,"id_slot":-1,"tokens_predicted":1,"tokens_evaluated":31}
data stream, to_send: data: {"index":0,"content":"Mount","tokens":[16683],"stop":false,"id_slot":-1,"tokens_predicted":2,"tokens_evaluated":31}
data stream, to_send: data: {"index":0,"content":" El","tokens":[4072],"stop":false,"id_slot":-1,"tokens_predicted":3,"tokens_evaluated":31}
data stream, to_send: data: {"index":0,"content":"br","tokens":[1347],"stop":false,"id_slot":-1,"tokens_predicted":4,"tokens_evaluated":31}
data stream, to_send: data: {"index":0,"content":"us","tokens":[355],"stop":false,"id_slot":-1,"tokens_predicted":5,"tokens_evaluated":31}
data stream, to_send: data: {"index":0,"content":",","tokens":[11],"stop":false,"id_slot":-1,"tokens_predicted":6,"tokens_evaluated":31}
data stream, to_send: data: {"index":0,"content":" located","tokens":[7559],"stop":false,"id_slot":-1,"tokens_predicted":7,"tokens_evaluated":31}
data stream, to_send: data: {"index":0,"content":" in","tokens":[304],"stop":false,"id_slot":-1,"tokens_predicted":8,"tokens_evaluated":31}
data stream, to_send: data: {"index":0,"content":" the","tokens":[279],"stop":false,"id_slot":-1,"tokens_predicted":9,"tokens_evaluated":31}
data stream, to_send: data: {"index":0,"content":" Caucas","tokens":[60532],"stop":false,"id_slot":-1,"tokens_predicted":10,"tokens_evaluated":31}
data stream, to_send: data: {"index":0,"content":" Caucasus","tokens":[355],"stop":false,"id_slot":-1,"tokens_predicted":11,"tokens_evaluated":31}
data stream, to_send: data: {"index":0,"content":" range","tokens":[2134],"stop":false,"id_slot":-1,"tokens_predicted":12,"tokens_evaluated":31}
data stream, to_send: data: {"index":0,"content":",","tokens":[11],"stop":false,"id_slot":-1,"tokens_predicted":13,"tokens_evaluated":31}
data stream, to_send: data: {"index":0,"content":" Russia","tokens":[8524],"stop":false,"id_slot":-1,"tokens_predicted":14,"tokens_evaluated":31}
data stream, to_send: data: {"index":0,"content":" Russia,","tokens":[11],"stop":false,"id_slot":-1,"tokens_predicted":15,"tokens_evaluated":31}
data stream, to_send: data: {"index":0,"content":" is","tokens":[374],"stop":false,"id_slot":-1,"tokens_predicted":16,"tokens_evaluated":31}
data stream, to_send: data: {"index":0,"content":" the","tokens":[279],"stop":false,"id_slot":-1,"tokens_predicted":17,"tokens_evaluated":31}
...

It is as if the server expects the client to know it should not emit the first " Russia" because it is superseded by the more complete next transmission " Russia,".

The above test is with the agent established using:

agent = LlamaCppAgent(provider, predefined_messages_formatter_type=MessagesFormatterType.LLAMA_3)

The llama-server is indeed using a llama3.2 model, so I think that is the correct FormatterType.

However, if instead I set up the agent not specifying any MessagesFormatterType, then I do not see the duplications in text coming from the server! (But there are other problems as you might expect, like <|im_end|> appearing in the response text, because there is not agreement between client and server on the end of message indication, presumably.) Surprisingly, with the (incorrect) default FormatterType, the server does not e.g. send " Caucas" followed by " Caucasus". It sends " Caucas" then "us". It is not the case that the client is treating the response differently--the server does not send duplicate data with the default FormatterType. So, there must be something different in the setup of the two chats that prevents the server from sending these duplications in this second case. I have looked a bit at the chat setup in the server logs and I have some ideas but if anyone knows what is going on here or how to fix it, please save me some time!

Activity

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions