Description
Self Checks
- This template is only for bug reports. For questions, please visit Discussions.
- I have thoroughly reviewed the project documentation (installation, training, inference) but couldn't find information to solve my problem. English 中文 日本語 Portuguese (Brazil)
- I have searched for existing issues, including closed ones. Search issues
- I confirm that I am using English to submit this report (我已阅读并同意 Language Policy).
- [FOR CHINESE USERS] 请务必使用英文提交 Issue,否则会被关闭。谢谢!:)
- Please do not modify this template and fill in all required fields.
Cloud or Self Hosted
Self Hosted (Source)
Environment Details
WSL2 on Windows 11
python=3.10
64G RAM
RTX 4000 SFF ADA 20G VRAM
Steps to Reproduce
python -m tools.run_webui
--llama-checkpoint-path "checkpoints/fish-speech-1.5"
--decoder-checkpoint-path "checkpoints/fish-speech-1.5/firefly-gan-vq-fsq-8x1024-21hz-generator.pth"
--decoder-config-name firefly_gan_vq --compile
use these text :
Fish Agent V0.1 3B is a groundbreaking Voice-to-Voice model capable of capturing and generating environmental audio information with unprecedented accuracy. What sets it apart is its semantic-token-free architecture, eliminating the need for traditional semantic encoders/decoders like Whisper and CosyVoice.
Additionally, it stands as a state-of-the-art text-to-speech (TTS) model, trained on an extensive dataset of 700,000 hours of multilingual audio content.
This model is a continue-pretrained version of Qwen-2.5-3B-Instruct for 200B voice & text tokens.
Supported Languages
The model supports the following languages with their respective training data sizes:
English (en): ~300,000 hours
Chinese (zh): ~300,000 hours
German (de): ~20,000 hours
Japanese (ja): ~20,000 hours
French (fr): ~20,000 hours
Spanish (es): ~20,000 hours
Korean (ko): ~20,000 hours
Arabic (ar): ~20,000 hours
For detailed information and implementation guidelines, please visit our Fish Speech GitHub repository.
✔️ Expected Behavior
Generate smooth speech
❌ Actual Behavior
when reading this part:
English (en): ~300,000 hours
Chinese (zh): ~300,000 hours
German (de): ~20,000 hours
Japanese (ja): ~20,000 hours
French (fr): ~20,000 hours
Spanish (es): ~20,000 hours
Korean (ko): ~20,000 hours
Arabic (ar): ~20,000 hours
it becomes many noisy or irregular sounds, how to resolve it?