Due diligence
Topic
Paper / Data design
Question
Hi, I have been experimenting finetuning moshi with different data and have some questions.
I am using the same training config provided but using custom data.
I am having two problems:
-
Whenever I finetune the model (including the daily-talk data example) the time-to-respond of the system degrades. This means, it takes so much time to respond comparing to the behavior observed in default moshi. I have tried biasing the pad_mul and after that the time-to-respond improves, but the quality of the generation is a little degraded, and I think it will be much better to avoid touching that parameter. I tried to include overlaps between speech turns, it seem to improve a little, but it didn't fix it.
Do you have any tips on how to design the data to fine tune and prevent that behavior?
-
The second problem is with the begging of the conversations. After fine tuning, I loss the Moshi's "Hello, whats goin on?" or other hellos. This happens even my data has a similar hello message at the beggining for every data example. What I observe is the systems start speaking from anyplace in a conversation similir to the topics on the data.
Is there any other thing to try besides training with dialogues that alwas start with the same "hello" message? My first thoughrs were training with random pieces of conversation first, and the use the final batches to put a lot of the welcome part.
Thanks, awesome project.
Due diligence
Topic
Paper / Data design
Question
Hi, I have been experimenting finetuning moshi with different data and have some questions.
I am using the same training config provided but using custom data.
I am having two problems:
Whenever I finetune the model (including the daily-talk data example) the time-to-respond of the system degrades. This means, it takes so much time to respond comparing to the behavior observed in default moshi. I have tried biasing the pad_mul and after that the time-to-respond improves, but the quality of the generation is a little degraded, and I think it will be much better to avoid touching that parameter. I tried to include overlaps between speech turns, it seem to improve a little, but it didn't fix it.
Do you have any tips on how to design the data to fine tune and prevent that behavior?
The second problem is with the begging of the conversations. After fine tuning, I loss the Moshi's "Hello, whats goin on?" or other hellos. This happens even my data has a similar hello message at the beggining for every data example. What I observe is the systems start speaking from anyplace in a conversation similir to the topics on the data.
Is there any other thing to try besides training with dialogues that alwas start with the same "hello" message? My first thoughrs were training with random pieces of conversation first, and the use the final batches to put a lot of the welcome part.
Thanks, awesome project.