Skip to content

becomes many noisy or irregular sounds, how to resolve it? #944

Open
@jumpfox3049

Description

@jumpfox3049

Self Checks

  • This template is only for bug reports. For questions, please visit Discussions.
  • I have thoroughly reviewed the project documentation (installation, training, inference) but couldn't find information to solve my problem. English 中文 日本語 Portuguese (Brazil)
  • I have searched for existing issues, including closed ones. Search issues
  • I confirm that I am using English to submit this report (我已阅读并同意 Language Policy).
  • [FOR CHINESE USERS] 请务必使用英文提交 Issue,否则会被关闭。谢谢!:)
  • Please do not modify this template and fill in all required fields.

Cloud or Self Hosted

Self Hosted (Source)

Environment Details

WSL2 on Windows 11
python=3.10
64G RAM
RTX 4000 SFF ADA 20G VRAM

Steps to Reproduce

python -m tools.run_webui
--llama-checkpoint-path "checkpoints/fish-speech-1.5"
--decoder-checkpoint-path "checkpoints/fish-speech-1.5/firefly-gan-vq-fsq-8x1024-21hz-generator.pth"
--decoder-config-name firefly_gan_vq --compile

use these text :

Fish Agent V0.1 3B is a groundbreaking Voice-to-Voice model capable of capturing and generating environmental audio information with unprecedented accuracy. What sets it apart is its semantic-token-free architecture, eliminating the need for traditional semantic encoders/decoders like Whisper and CosyVoice.

Additionally, it stands as a state-of-the-art text-to-speech (TTS) model, trained on an extensive dataset of 700,000 hours of multilingual audio content.

This model is a continue-pretrained version of Qwen-2.5-3B-Instruct for 200B voice & text tokens.

Supported Languages
The model supports the following languages with their respective training data sizes:

English (en): ~300,000 hours
Chinese (zh): ~300,000 hours
German (de): ~20,000 hours
Japanese (ja): ~20,000 hours
French (fr): ~20,000 hours
Spanish (es): ~20,000 hours
Korean (ko): ~20,000 hours
Arabic (ar): ~20,000 hours
For detailed information and implementation guidelines, please visit our Fish Speech GitHub repository.

✔️ Expected Behavior

Generate smooth speech

❌ Actual Behavior

when reading this part:
English (en): ~300,000 hours
Chinese (zh): ~300,000 hours
German (de): ~20,000 hours
Japanese (ja): ~20,000 hours
French (fr): ~20,000 hours
Spanish (es): ~20,000 hours
Korean (ko): ~20,000 hours
Arabic (ar): ~20,000 hours

it becomes many noisy or irregular sounds, how to resolve it?

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions