Skip to content

Commit ae0af4f

Browse files
LazyBusyYangfatchordmanmay-nakhashiZohaibAhmedTediPapajorgji
authored
chore(sync): merge upstream resemble-ai/chatterbox (Turbo 350M, v3 multilingual, transformers 5.x) (#3)
* Make russian-text-stresser optional (resemble-ai#376) * Update pyproject.toml * add automated install check * udpate to check master - not main * remove russian-stresser from dependency (resemble-ai#377) * Chatterbox Turbo 350M Model (resemble-ai#380) * add gpt2-medium config * setup turbo-specific hp * whitespace * t3 constructor logic * update gpt2 config values * fix t3 constructor * t3 safetensor loading correctly * everything loading ok * update from_pretrained method * inference working * new inference_turbo method * avoid tokenizer missing pad error * connect sampling params to .generate * add min_p to unused warning * inference with s3gen distilled * change to distilled model in from_local * norm loudness - relax punc norm * add gradio app * add silence to end of output wav * add events to gradio app * change default text * new dependencies * move advanced options to the righthand side of gradio app * remove russian-text-stresser and bump version * update readme * update readme * resize headings * links to sections * typo * update space link * chessy emojiis (might remove these later) * change to single emojii * remove most emojiis * Update README.md with new image * Update README.md remove extra char * add dedicated turbo example script --------- Co-authored-by: Zohaib Ahmed <zohaib@resemble.ai> * Fix ReadMe Banner Image (resemble-ai#381) * Add Podonos evaluation section and acknowledgement to README (resemble-ai#456) * Add Podonos evaluation links and acknowledgement * move evaluaton secion above acknowledgements * Update README.md (resemble-ai#441) fix bug in multilingual example * fix typo in README (resemble-ai#475) * fix: add map_location to torch.load for CPU/MPS support in multilingual model (resemble-ai#410) Add map_location parameter to torch.load() calls in ChatterboxMultilingualTTS to properly support CPU and MPS devices when loading CUDA-saved models. Changes: - Add map_location logic in from_local() for non-CUDA devices - Pass map_location to ve.pt and s3gen.pt loading - Pass map_location to Conditionals.load() for conds.pt - Add MPS availability check in from_pretrained() - Add str to torch.device conversion in Conditionals.load() Fixes resemble-ai#351, resemble-ai#357 Signed-off-by: majiayu000 <1835304752@qq.com> * Update Transformers + Gradio (resemble-ai#491) * update versions * fix token missing crash + wav path crash * fix crash in alignment stream analyzer * use latest perth watermark * Broaden dependency version ranges for Python 3.13+ compatibility (resemble-ai#495) * Add opt-in v3 multilingual checkpoint, skip analyzer for v3 (resemble-ai#516) * Add opt-in v3 multilingual checkpoint, skip analyzer for v3 * Remove alignment analyzer; lower rep_penalty default to 1.2; trim final speech token artifact * Refactor import statements in mtl_tts.py to remove duplicate import of SPEECH_VOCAB_SIZE --------- Signed-off-by: majiayu000 <1835304752@qq.com> Co-authored-by: Ollie McCarthy <fatchord@tutanota.com> Co-authored-by: manmay nakhashi <manmay.nakhashi@gmail.com> Co-authored-by: Zohaib Ahmed <zohaib@resemble.ai> Co-authored-by: Tedi Papajorgji <tedi.papajorgji@hotmail.com> Co-authored-by: Nicholas Loo <149620965+ResembleNick@users.noreply.github.com> Co-authored-by: Nicolas Müller <nicolas@resemble.ai> Co-authored-by: Matt <33789207+Kettukaa@users.noreply.github.com> Co-authored-by: lif <1835304752@qq.com> Co-authored-by: zihanjin428 <zihan@resemble.ai>
1 parent 24fcba4 commit ae0af4f

23 files changed

Lines changed: 1171 additions & 544 deletions
Lines changed: 23 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,23 @@
1+
name: Test Installation
2+
3+
on:
4+
push:
5+
branches: [ "master" ]
6+
pull_request:
7+
branches: [ "master" ]
8+
9+
jobs:
10+
build:
11+
runs-on: ubuntu-latest
12+
13+
steps:
14+
- uses: actions/checkout@v3
15+
16+
- name: Set up Python 3.10
17+
uses: actions/setup-python@v4
18+
with:
19+
python-version: "3.10"
20+
21+
- name: Test Standard Install
22+
run: |
23+
pip install -e .

Chatterbox-Turbo.jpg

763 KB
Loading

README.md

Lines changed: 82 additions & 22 deletions
Original file line numberDiff line numberDiff line change
@@ -1,18 +1,20 @@
1+
![Chatterbox Turbo Image](./Chatterbox-Turbo.jpg)
12

2-
<img width="1200" height="600" alt="Chatterbox-Multilingual" src="https://www.resemble.ai/wp-content/uploads/2025/09/Chatterbox-Multilingual-1.png" />
33

44
# Chatterbox TTS
55

6-
[![Alt Text](https://img.shields.io/badge/listen-demo_samples-blue)](https://resemble-ai.github.io/chatterbox_demopage/)
7-
[![Alt Text](https://huggingface.co/datasets/huggingface/badges/resolve/main/open-in-hf-spaces-sm.svg)](https://huggingface.co/spaces/ResembleAI/Chatterbox)
6+
[![Alt Text](https://img.shields.io/badge/listen-demo_samples-blue)](https://resemble-ai.github.io/chatterbox_turbo_demopage/)
7+
[![Alt Text](https://huggingface.co/datasets/huggingface/badges/resolve/main/open-in-hf-spaces-sm.svg)](https://huggingface.co/spaces/ResembleAI/chatterbox-turbo-demo)
88
[![Alt Text](https://static-public.podonos.com/badges/insight-on-pdns-sm-dark.svg)](https://podonos.com/resembleai/chatterbox)
99
[![Discord](https://img.shields.io/discord/1377773249798344776?label=join%20discord&logo=discord&style=flat)](https://discord.gg/rJq9cRJBJ6)
1010

11-
_Made with ♥️ by <a href="https://resemble.ai" target="_blank"><img width="100" alt="resemble-logo-horizontal" src="https://github.com/user-attachments/assets/35cf756b-3506-4943-9c72-c05ddfa4e525" /></a>
11+
*Made with ♥️ by* <a href="https://resemble.ai" target="_blank"><img width="100" alt="resemble-logo-horizontal" src="https://github.com/user-attachments/assets/35cf756b-3506-4943-9c72-c05ddfa4e525" /></a>
1212

13-
We're excited to introduce **Chatterbox Multilingual**, [Resemble AI's](https://resemble.ai) first production-grade open source TTS model supporting **23 languages** out of the box. Licensed under MIT, Chatterbox has been benchmarked against leading closed-source systems like ElevenLabs, and is consistently preferred in side-by-side evaluations.
13+
**Chatterbox** is a family of three state-of-the-art, open-source text-to-speech models by Resemble AI.
1414

15-
Whether you're working on memes, videos, games, or AI agents, Chatterbox brings your content to life across languages. It's also the first open source TTS model to support **emotion exaggeration control** with robust **multilingual zero-shot voice cloning**. Try the english only version now on our [English Hugging Face Gradio app.](https://huggingface.co/spaces/ResembleAI/Chatterbox). Or try the multilingual version on our [Multilingual Hugging Face Gradio app.](https://huggingface.co/spaces/ResembleAI/Chatterbox-Multilingual-TTS).
15+
We are excited to introduce **Chatterbox-Turbo**, our most efficient model yet. Built on a streamlined 350M parameter architecture, **Turbo** delivers high-quality speech with less compute and VRAM than our previous models. We have also distilled the speech-token-to-mel decoder, previously a bottleneck, reducing generation from 10 steps to just **one**, while retaining high-fidelity audio output.
16+
17+
**Paralinguistic tags** are now native to the Turbo model, allowing you to use `[cough]`, `[laugh]`, `[chuckle]`, and more to add distinct realism. While Turbo was built primarily for low-latency voice agents, it excels at narration and creative workflows.
1618

1719
If you like the model but need to scale or tune it for higher accuracy, check out our competitively priced TTS service (<a href="https://resemble.ai">link</a>). It delivers reliable performance with ultra-low latency of sub 200ms—ideal for production use in agents, applications, or interactive media.
1820

@@ -40,13 +42,19 @@ Arabic (ar) • Danish (da) • German (de) • Greek (el) • English (en) •
4042
- Ensure that the reference clip matches the specified language tag. Otherwise, language transfer outputs may inherit the accent of the reference clip’s language. To mitigate this, set `cfg_weight` to `0`.
4143
- The default settings (`exaggeration=0.5`, `cfg_weight=0.5`) work well for most prompts across all languages.
4244
- If the reference speaker has a fast speaking style, lowering `cfg_weight` to around `0.3` can improve pacing.
45+
<img width="1200" height="600" alt="Podonos Turbo Eval" src="https://storage.googleapis.com/chatterbox-demo-samples/turbo/podonos_turbo.png" />
4346

44-
- **Expressive or Dramatic Speech:**
45-
- Try lower `cfg_weight` values (e.g. `~0.3`) and increase `exaggeration` to around `0.7` or higher.
46-
- Higher `exaggeration` tends to speed up speech; reducing `cfg_weight` helps compensate with slower, more deliberate pacing.
47+
### ⚡ Model Zoo
48+
49+
Choose the right model for your application.
4750

51+
| Model | Size | Languages | Key Features | Best For | 🤗 | Examples |
52+
|:----------------------------------------------------------------------------------------------------------------| :--- | :--- |:--------------------------------------------------------|:---------------------------------------------|:--------------------------------------------------------------------------| :--- |
53+
| **Chatterbox-Turbo** | **350M** | **English** | Paralinguistic Tags (`[laugh]`), Lower Compute and VRAM | Zero-shot voice agents, Production | [Demo](https://huggingface.co/spaces/ResembleAI/chatterbox-turbo-demo) | [Listen](https://resemble-ai.github.io/chatterbox_turbo_demopage/) |
54+
| Chatterbox-Multilingual [(Language list)](#supported-languages) | 500M | 23+ | Zero-shot cloning, Multiple Languages | Global applications, Localization | [Demo](https://huggingface.co/spaces/ResembleAI/Chatterbox-Multilingual-TTS) | [Listen](https://resemble-ai.github.io/chatterbox_demopage/) |
55+
| Chatterbox [(Tips and Tricks)](#original-chatterbox-tips) | 500M | English | CFG & Exaggeration tuning | General zero-shot TTS with creative controls | [Demo](https://huggingface.co/spaces/ResembleAI/Chatterbox) | [Listen](https://resemble-ai.github.io/chatterbox_demopage/) |
4856

49-
# Installation
57+
## Installation
5058
```shell
5159
pip install chatterbox-tts
5260
```
@@ -62,8 +70,31 @@ pip install -e .
6270
```
6371
We developed and tested Chatterbox on Python 3.11 on Debian 11 OS; the versions of the dependencies are pinned in `pyproject.toml` to ensure consistency. You can modify the code or dependencies in this installation mode.
6472

65-
# Usage
73+
## Usage
74+
75+
##### Chatterbox-Turbo
76+
77+
```python
78+
import torchaudio as ta
79+
import torch
80+
from chatterbox.tts_turbo import ChatterboxTurboTTS
81+
82+
# Load the Turbo model
83+
model = ChatterboxTurboTTS.from_pretrained(device="cuda")
84+
85+
# Generate with Paralinguistic Tags
86+
text = "Hi there, Sarah here from MochaFone calling you back [chuckle], have you got one minute to chat about the billing issue?"
87+
88+
# Generate audio (requires a reference clip for voice cloning)
89+
wav = model.generate(text, audio_prompt_path="your_10s_ref_clip.wav")
90+
91+
ta.save("test-turbo.wav", wav, model.sr)
92+
```
93+
94+
##### Chatterbox and Chatterbox-Multilingual
95+
6696
```python
97+
6798
import torchaudio as ta
6899
from chatterbox.tts import ChatterboxTTS
69100
from chatterbox.mtl_tts import ChatterboxMultilingualTTS
@@ -77,9 +108,11 @@ ta.save("test-english.wav", wav, model.sr)
77108

78109
# Multilingual examples
79110
multilingual_model = ChatterboxMultilingualTTS.from_pretrained(device=device)
111+
# v2 remains the default. To use the v3 multilingual checkpoint:
112+
# multilingual_model = ChatterboxMultilingualTTS.from_pretrained(device=device, t3_model="v3")
80113

81114
french_text = "Bonjour, comment ça va? Ceci est le modèle de synthèse vocale multilingue Chatterbox, il prend en charge 23 langues."
82-
wav_french = multilingual_model.generate(spanish_text, language_id="fr")
115+
wav_french = multilingual_model.generate(french_text, language_id="fr")
83116
ta.save("test-french.wav", wav_french, model.sr)
84117

85118
chinese_text = "你好,今天天气真不错,希望你有一个愉快的周末。"
@@ -93,14 +126,21 @@ ta.save("test-2.wav", wav, model.sr)
93126
```
94127
See `example_tts.py` and `example_vc.py` for more examples.
95128

96-
# Acknowledgements
97-
- [Cosyvoice](https://github.com/FunAudioLLM/CosyVoice)
98-
- [Real-Time-Voice-Cloning](https://github.com/CorentinJ/Real-Time-Voice-Cloning)
99-
- [HiFT-GAN](https://github.com/yl4579/HiFTNet)
100-
- [Llama 3](https://github.com/meta-llama/llama3)
101-
- [S3Tokenizer](https://github.com/xingchensong/S3Tokenizer)
129+
## Supported Languages
130+
Arabic (ar) • Danish (da) • German (de) • Greek (el) • English (en) • Spanish (es) • Finnish (fi) • French (fr) • Hebrew (he) • Hindi (hi) • Italian (it) • Japanese (ja) • Korean (ko) • Malay (ms) • Dutch (nl) • Norwegian (no) • Polish (pl) • Portuguese (pt) • Russian (ru) • Swedish (sv) • Swahili (sw) • Turkish (tr) • Chinese (zh)
131+
132+
## Original Chatterbox Tips
133+
- **General Use (TTS and Voice Agents):**
134+
- Ensure that the reference clip matches the specified language tag. Otherwise, language transfer outputs may inherit the accent of the reference clip’s language. To mitigate this, set `cfg_weight` to `0`.
135+
- The default settings (`exaggeration=0.5`, `cfg_weight=0.5`) work well for most prompts across all languages.
136+
- If the reference speaker has a fast speaking style, lowering `cfg_weight` to around `0.3` can improve pacing.
137+
138+
- **Expressive or Dramatic Speech:**
139+
- Try lower `cfg_weight` values (e.g. `~0.3`) and increase `exaggeration` to around `0.7` or higher.
140+
- Higher `exaggeration` tends to speed up speech; reducing `cfg_weight` helps compensate with slower, more deliberate pacing.
141+
102142

103-
# Built-in PerTh Watermarking for Responsible AI
143+
## Built-in PerTh Watermarking for Responsible AI
104144

105145
Every audio file generated by Chatterbox includes [Resemble AI's Perth (Perceptual Threshold) Watermarker](https://github.com/resemble-ai/perth) - imperceptible neural watermarks that survive MP3 compression, audio editing, and common manipulations while maintaining nearly 100% detection accuracy.
106146

@@ -128,11 +168,31 @@ print(f"Extracted watermark: {watermark}")
128168
```
129169

130170

131-
# Official Discord
171+
## Official Discord
132172

133173
👋 Join us on [Discord](https://discord.gg/rJq9cRJBJ6) and let's build something awesome together!
134174

135-
# Citation
175+
## Evaluation
176+
Chatterbox Turbo was evaluated using Podonos, a platform for reproducible subjective speech evaluation.
177+
178+
We compared Chatterbox Turbo to competitive TTS systems using Podonos' standardized evaluation suite, focusing on overall preference, naturalness, and expressiveness.
179+
180+
Evaluation reports:
181+
- [Chatterbox Turbo vs ElevenLabs Turbo v2.5](https://podonos.com/resembleai/chatterbox-turbo-vs-elevenlabs-turbo)
182+
- [Chatterbox Turbo vs Cartesia Sonic 3](https://podonos.com/resembleai/chatterbox-turbo-vs-cartesia-sonic3)
183+
- [Chatterbox Turbo vs VibeVoice 7B](https://podonos.com/resembleai/chatterbox-turbo-vs-vibevoice7b)
184+
185+
These evaluations were conducted under identical conditions and are publicly accessible via Podonos.
186+
187+
## Acknowledgements
188+
- [Podonos](https://podonos.com) — for supporting reproducible subjective speech evaluation
189+
- [Cosyvoice](https://github.com/FunAudioLLM/CosyVoice)
190+
- [Real-Time-Voice-Cloning](https://github.com/CorentinJ/Real-Time-Voice-Cloning)
191+
- [HiFT-GAN](https://github.com/yl4579/HiFTNet)
192+
- [Llama 3](https://github.com/meta-llama/llama3)
193+
- [S3Tokenizer](https://github.com/xingchensong/S3Tokenizer)
194+
195+
## Citation
136196
If you find this model useful, please consider citing.
137197
```
138198
@misc{chatterboxtts2025,
@@ -143,5 +203,5 @@ If you find this model useful, please consider citing.
143203
note = {GitHub repository}
144204
}
145205
```
146-
# Disclaimer
206+
## Disclaimer
147207
Don't use this model to do bad things. Prompts are sourced from freely available data on the internet.

example_tts.py

Lines changed: 7 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,6 @@
11
import torchaudio as ta
22
import torch
3+
from pathlib import Path
34
from chatterbox.tts import ChatterboxTTS
45
from chatterbox.mtl_tts import ChatterboxMultilingualTTS
56

@@ -20,12 +21,16 @@
2021
ta.save("test-1.wav", wav, model.sr)
2122

2223
multilingual_model = ChatterboxMultilingualTTS.from_pretrained(device=device)
24+
# v2 is the default. Pass t3_model="v3" to use the v3 multilingual checkpoint.
2325
text = "Bonjour, comment ça va? Ceci est le modèle de synthèse vocale multilingue Chatterbox, il prend en charge 23 langues."
2426
wav = multilingual_model.generate(text, language_id="fr")
2527
ta.save("test-2.wav", wav, multilingual_model.sr)
2628

2729

2830
# If you want to synthesize with a different voice, specify the audio prompt
2931
AUDIO_PROMPT_PATH = "YOUR_FILE.wav"
30-
wav = model.generate(text, audio_prompt_path=AUDIO_PROMPT_PATH)
31-
ta.save("test-3.wav", wav, model.sr)
32+
if Path(AUDIO_PROMPT_PATH).exists():
33+
wav = model.generate(text, audio_prompt_path=AUDIO_PROMPT_PATH)
34+
ta.save("test-3.wav", wav, model.sr)
35+
else:
36+
print(f"Warning: audio prompt file '{AUDIO_PROMPT_PATH}' not found, skipping voice cloning example.")

example_tts_turbo.py

Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,14 @@
1+
import torchaudio as ta
2+
import torch
3+
from chatterbox.tts_turbo import ChatterboxTurboTTS
4+
5+
# Load the Turbo model
6+
model = ChatterboxTurboTTS.from_pretrained(device="cuda")
7+
8+
# Generate with Paralinguistic Tags
9+
text = "Oh, that's hilarious! [chuckle] Um anyway, we do have a new model in store. It's the SkyNet T-800 series and it's got basically everything. Including AI integration with ChatGPT and all that jazz. Would you like me to get some prices for you?"
10+
11+
# Generate audio (requires a reference clip for voice cloning)
12+
# wav = model.generate(text, audio_prompt_path="your_10s_ref_clip.wav")
13+
wav = model.generate(text)
14+
ta.save("test-turbo.wav", wav, model.sr)

0 commit comments

Comments
 (0)