SFT degrades audio quality and introduces redundant non-text speech.

### Self Checks

- [x] This template is only for bug reports. For questions, please visit [Discussions](https://github.com/fishaudio/fish-speech/discussions).
- [x] I have thoroughly reviewed the project documentation (installation, training, inference) but couldn't find information to solve my problem. [English](https://speech.fish.audio/) [中文](https://speech.fish.audio/zh/) [日本語](https://speech.fish.audio/ja/) [Portuguese (Brazil)](https://speech.fish.audio/pt/)
- [x] I have searched for existing issues, including closed ones. [Search issues](https://github.com/fishaudio/fish-speech/issues)
- [x] I confirm that I am using English to submit this report (我已阅读并同意 [Language Policy](https://github.com/fishaudio/fish-speech/issues/515)).
- [x] [FOR CHINESE USERS] 请务必使用英文提交 Issue，否则会被关闭。谢谢！:）
- [x] Please do not modify this template and fill in all required fields.

### Cloud or Self Hosted

Self Hosted (Source)

### Environment Details

ubuntu 22.04 Python 3.10.15, torch==2.4.1, gradio==5.16.0

### Steps to Reproduce

1.Follow steps on 'https://speech.fish.audio/zh/#linux' to setup linux env.
2.Follow steps on 'https://speech.fish.audio/zh/finetune/' to do SFT.I use 'edgetts' to generate about 1500 hour audio(24000Hz )-lab pairs, which correspond to about 1.5 hour duration, with 'xiaoxiao' voice(chinese).And then normalize these audios with "fap" tool.
3.Use the finetuned llama checkpoint to run webui:
python tools/run_webui.py \
    --llama-checkpoint-path checkpoints/fish-speech-1.5-yth-lora \
    --decoder-checkpoint-path checkpoints/fish-speech-1.5/firefly-gan-vq-fsq-8x1024-21hz-generator.pth \
    --compile

### ✔️ Expected Behavior

1.Better TTS audio than using basic llama checkpoint.

### ❌ Actual Behavior

1.The pronunciation not as clear as before SFT.
2.There are some redundant non-text speech gengerated in the among the normal sentences.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SFT degrades audio quality and introduces redundant non-text speech. #888

Self Checks

Cloud or Self Hosted

Environment Details

Steps to Reproduce

✔️ Expected Behavior

❌ Actual Behavior

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

SFT degrades audio quality and introduces redundant non-text speech. #888

Description

Self Checks

Cloud or Self Hosted

Environment Details

Steps to Reproduce

✔️ Expected Behavior

❌ Actual Behavior

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions