Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
38 commits
Select commit Hold shift + click to select a range
c601bb3
add gpt2-medium config
fatchord Dec 3, 2025
4b9f63a
setup turbo-specific hp
fatchord Dec 3, 2025
295f46a
whitespace
fatchord Dec 3, 2025
6b9e2de
t3 constructor logic
fatchord Dec 3, 2025
ca80ea3
update gpt2 config values
fatchord Dec 3, 2025
079554e
fix t3 constructor
fatchord Dec 3, 2025
1697285
t3 safetensor loading correctly
fatchord Dec 3, 2025
57aeeed
everything loading ok
fatchord Dec 3, 2025
01f8206
update from_pretrained method
fatchord Dec 3, 2025
7155a1c
inference working
fatchord Dec 4, 2025
0ec2ae0
new inference_turbo method
fatchord Dec 4, 2025
e9734b9
avoid tokenizer missing pad error
fatchord Dec 4, 2025
448d658
connect sampling params to .generate
fatchord Dec 4, 2025
ada10bf
add min_p to unused warning
fatchord Dec 4, 2025
68f531f
inference with s3gen distilled
fatchord Dec 4, 2025
bebdc7d
change to distilled model in from_local
fatchord Dec 4, 2025
96318cc
norm loudness - relax punc norm
fatchord Dec 5, 2025
9c06565
add gradio app
fatchord Dec 5, 2025
8e85b40
add silence to end of output wav
fatchord Dec 5, 2025
2706ab3
add events to gradio app
fatchord Dec 5, 2025
ba7138b
change default text
fatchord Dec 5, 2025
7c676a9
new dependencies
fatchord Dec 5, 2025
5cc4ab2
move advanced options to the righthand side of gradio app
fatchord Dec 5, 2025
eca0801
remove russian-text-stresser and bump version
fatchord Dec 14, 2025
7e9530f
update readme
fatchord Dec 14, 2025
36f097a
update readme
fatchord Dec 14, 2025
f15fc6a
resize headings
fatchord Dec 14, 2025
5f4e893
links to sections
fatchord Dec 14, 2025
c12d7f4
typo
fatchord Dec 14, 2025
44ed94b
update space link
fatchord Dec 14, 2025
e21bf50
chessy emojiis (might remove these later)
fatchord Dec 14, 2025
b35a652
change to single emojii
fatchord Dec 14, 2025
b7f863b
remove most emojiis
fatchord Dec 14, 2025
673820b
Update README.md with new image
ZohaibAhmed Dec 14, 2025
9ccd7f9
Update README.md remove extra char
ZohaibAhmed Dec 14, 2025
d4ef654
Resolve merge conflict in pyproject.toml
fatchord Dec 15, 2025
7041a07
Merge branch 'turbo-350m' of https://github.com/resemble-ai/chatterbo…
fatchord Dec 15, 2025
2e39886
add dedicated turbo example script
fatchord Dec 15, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
103 changes: 65 additions & 38 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,46 +1,36 @@
![resemble+cbturbo-1600x900](https://github.com/user-attachments/assets/09395613-2532-455d-a26f-373f1cc160f1)

<img width="1200" height="600" alt="Chatterbox-Multilingual" src="https://www.resemble.ai/wp-content/uploads/2025/09/Chatterbox-Multilingual-1.png" />

# Chatterbox TTS

[![Alt Text](https://img.shields.io/badge/listen-demo_samples-blue)](https://resemble-ai.github.io/chatterbox_demopage/)
[![Alt Text](https://huggingface.co/datasets/huggingface/badges/resolve/main/open-in-hf-spaces-sm.svg)](https://huggingface.co/spaces/ResembleAI/Chatterbox)
[![Alt Text](https://img.shields.io/badge/listen-demo_samples-blue)](https://resemble-ai.github.io/chatterbox_turbo_demopage/)
[![Alt Text](https://huggingface.co/datasets/huggingface/badges/resolve/main/open-in-hf-spaces-sm.svg)](https://huggingface.co/spaces/ResembleAI/chatterbox-turbo-demo)
[![Alt Text](https://static-public.podonos.com/badges/insight-on-pdns-sm-dark.svg)](https://podonos.com/resembleai/chatterbox)
[![Discord](https://img.shields.io/discord/1377773249798344776?label=join%20discord&logo=discord&style=flat)](https://discord.gg/rJq9cRJBJ6)

_Made with ♥️ by <a href="https://resemble.ai" target="_blank"><img width="100" alt="resemble-logo-horizontal" src="https://github.com/user-attachments/assets/35cf756b-3506-4943-9c72-c05ddfa4e525" /></a>

We're excited to introduce **Chatterbox Multilingual**, [Resemble AI's](https://resemble.ai) first production-grade open source TTS model supporting **23 languages** out of the box. Licensed under MIT, Chatterbox has been benchmarked against leading closed-source systems like ElevenLabs, and is consistently preferred in side-by-side evaluations.
**Chatterbox** is a family of three state-of-the-art, open-source text-to-speech models by Resemble AI.

Whether you're working on memes, videos, games, or AI agents, Chatterbox brings your content to life across languages. It's also the first open source TTS model to support **emotion exaggeration control** with robust **multilingual zero-shot voice cloning**. Try the english only version now on our [English Hugging Face Gradio app.](https://huggingface.co/spaces/ResembleAI/Chatterbox). Or try the multilingual version on our [Multilingual Hugging Face Gradio app.](https://huggingface.co/spaces/ResembleAI/Chatterbox-Multilingual-TTS).
We are excited to introduce **Chatterbox-Turbo**, our most efficient model yet. Built on a streamlined 350M parameter architecture, **Turbo** delivers high-quality speech with less compute and VRAM than our previous models. We have also distilled the speech-token-to-mel decoder, previously a bottleneck, reducing generation from 10 steps to just **one**, while retaining high-fidelity audio output.

**Paralinguistic tags** are now native to the Turbo model, allowing you to use `[cough]`, `[laugh]`, `[chuckle]`, and more to add distinct realism. While Turbo was built primarily for low-latency voice agents, it excels at narration and creative workflows.

If you like the model but need to scale or tune it for higher accuracy, check out our competitively priced TTS service (<a href="https://resemble.ai">link</a>). It delivers reliable performance with ultra-low latency of sub 200ms—ideal for production use in agents, applications, or interactive media.

# Key Details
- Multilingual, zero-shot TTS supporting 23 languages
- SoTA zeroshot English TTS
- 0.5B Llama backbone
- Unique exaggeration/intensity control
- Ultra-stable with alignment-informed inference
- Trained on 0.5M hours of cleaned data
- Watermarked outputs
- Easy voice conversion script
- [Outperforms ElevenLabs](https://podonos.com/resembleai/chatterbox)

# Supported Languages
Arabic (ar) • Danish (da) • German (de) • Greek (el) • English (en) • Spanish (es) • Finnish (fi) • French (fr) • Hebrew (he) • Hindi (hi) • Italian (it) • Japanese (ja) • Korean (ko) • Malay (ms) • Dutch (nl) • Norwegian (no) • Polish (pl) • Portuguese (pt) • Russian (ru) • Swedish (sv) • Swahili (sw) • Turkish (tr) • Chinese (zh)
# Tips
- **General Use (TTS and Voice Agents):**
- Ensure that the reference clip matches the specified language tag. Otherwise, language transfer outputs may inherit the accent of the reference clip’s language. To mitigate this, set `cfg_weight` to `0`.
- The default settings (`exaggeration=0.5`, `cfg_weight=0.5`) work well for most prompts across all languages.
- If the reference speaker has a fast speaking style, lowering `cfg_weight` to around `0.3` can improve pacing.
<img width="1200" height="600" alt="Podonos Turbo Eval" src="https://storage.googleapis.com/chatterbox-demo-samples/turbo/podonos_turbo.png" />

- **Expressive or Dramatic Speech:**
- Try lower `cfg_weight` values (e.g. `~0.3`) and increase `exaggeration` to around `0.7` or higher.
- Higher `exaggeration` tends to speed up speech; reducing `cfg_weight` helps compensate with slower, more deliberate pacing.
### ⚡ Model Zoo

Choose the right model for your application.

# Installation
| Model | Size | Languages | Key Features | Best For | 🤗 | Examples |
|:----------------------------------------------------------------------------------------------------------------| :--- | :--- |:--------------------------------------------------------|:---------------------------------------------|:--------------------------------------------------------------------------| :--- |
| **Chatterbox-Turbo** | **350M** | **English** | Paralinguistic Tags (`[laugh]`), Lower Compute and VRAM | Zero-shot voice agents, Production | [Demo](https://huggingface.co/spaces/ResembleAI/chatterbox-turbo-demo) | [Listen](https://resemble-ai.github.io/chatterbox_turbo_demopage/) |
| Chatterbox-Multilingual [(Language list)](#supported-languages) | 500M | 23+ | Zero-shot cloning, Multiple Languages | Global applications, Localization | [Demo](https://huggingface.co/spaces/ResembleAI/Chatterbox-Multilingual-TTS) | [Listen](https://resemble-ai.github.io/chatterbox_demopage/) |
| Chatterbox [(Tips and Tricks)](#original-chatterbox-tips) | 500M | English | CFG & Exaggeration tuning | General zero-shot TTS with creative controls | [Demo](https://huggingface.co/spaces/ResembleAI/Chatterbox) | [Listen](https://resemble-ai.github.io/chatterbox_demopage/) |

## Installation
```shell
pip install chatterbox-tts
```
Expand All @@ -56,8 +46,31 @@ pip install -e .
```
We developed and tested Chatterbox on Python 3.11 on Debian 11 OS; the versions of the dependencies are pinned in `pyproject.toml` to ensure consistency. You can modify the code or dependencies in this installation mode.

# Usage
## Usage

##### Chatterbox-Turbo

```python
import torchaudio as ta
import torch
from chatterbox.tts_turbo import ChatterboxTurboTTS

# Load the Turbo model
model = ChatterboxTurboTTS.from_pretrained(device="cuda")

# Generate with Paralinguistic Tags
text = "Hi there, Sarah here from MochaFone calling you back [chuckle], have you got one minute to chat about the billing issue?"

# Generate audio (requires a reference clip for voice cloning)
wav = model.generate(text, audio_prompt_path="your_10s_ref_clip.wav")

ta.save("test-turbo.wav", wav, model.sr)
```

##### Chatterbox and Chatterbox-Multilingual

```python

import torchaudio as ta
from chatterbox.tts import ChatterboxTTS
from chatterbox.mtl_tts import ChatterboxMultilingualTTS
Expand Down Expand Up @@ -87,14 +100,21 @@ ta.save("test-2.wav", wav, model.sr)
```
See `example_tts.py` and `example_vc.py` for more examples.

# Acknowledgements
- [Cosyvoice](https://github.com/FunAudioLLM/CosyVoice)
- [Real-Time-Voice-Cloning](https://github.com/CorentinJ/Real-Time-Voice-Cloning)
- [HiFT-GAN](https://github.com/yl4579/HiFTNet)
- [Llama 3](https://github.com/meta-llama/llama3)
- [S3Tokenizer](https://github.com/xingchensong/S3Tokenizer)
## Supported Languages
Arabic (ar) • Danish (da) • German (de) • Greek (el) • English (en) • Spanish (es) • Finnish (fi) • French (fr) • Hebrew (he) • Hindi (hi) • Italian (it) • Japanese (ja) • Korean (ko) • Malay (ms) • Dutch (nl) • Norwegian (no) • Polish (pl) • Portuguese (pt) • Russian (ru) • Swedish (sv) • Swahili (sw) • Turkish (tr) • Chinese (zh)

## Original Chatterbox Tips
- **General Use (TTS and Voice Agents):**
- Ensure that the reference clip matches the specified language tag. Otherwise, language transfer outputs may inherit the accent of the reference clip’s language. To mitigate this, set `cfg_weight` to `0`.
- The default settings (`exaggeration=0.5`, `cfg_weight=0.5`) work well for most prompts across all languages.
- If the reference speaker has a fast speaking style, lowering `cfg_weight` to around `0.3` can improve pacing.

# Built-in PerTh Watermarking for Responsible AI
- **Expressive or Dramatic Speech:**
- Try lower `cfg_weight` values (e.g. `~0.3`) and increase `exaggeration` to around `0.7` or higher.
- Higher `exaggeration` tends to speed up speech; reducing `cfg_weight` helps compensate with slower, more deliberate pacing.


## Built-in PerTh Watermarking for Responsible AI

Every audio file generated by Chatterbox includes [Resemble AI's Perth (Perceptual Threshold) Watermarker](https://github.com/resemble-ai/perth) - imperceptible neural watermarks that survive MP3 compression, audio editing, and common manipulations while maintaining nearly 100% detection accuracy.

Expand Down Expand Up @@ -122,11 +142,18 @@ print(f"Extracted watermark: {watermark}")
```


# Official Discord
## Official Discord

👋 Join us on [Discord](https://discord.gg/rJq9cRJBJ6) and let's build something awesome together!

# Citation
## Acknowledgements
- [Cosyvoice](https://github.com/FunAudioLLM/CosyVoice)
- [Real-Time-Voice-Cloning](https://github.com/CorentinJ/Real-Time-Voice-Cloning)
- [HiFT-GAN](https://github.com/yl4579/HiFTNet)
- [Llama 3](https://github.com/meta-llama/llama3)
- [S3Tokenizer](https://github.com/xingchensong/S3Tokenizer)

## Citation
If you find this model useful, please consider citing.
```
@misc{chatterboxtts2025,
Expand All @@ -137,5 +164,5 @@ If you find this model useful, please consider citing.
note = {GitHub repository}
}
```
# Disclaimer
## Disclaimer
Don't use this model to do bad things. Prompts are sourced from freely available data on the internet.
14 changes: 14 additions & 0 deletions example_tts_turbo.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
import torchaudio as ta
import torch
from chatterbox.tts_turbo import ChatterboxTurboTTS

# Load the Turbo model
model = ChatterboxTurboTTS.from_pretrained(device="cuda")

# Generate with Paralinguistic Tags
text = "Oh, that's hilarious! [chuckle] Um anyway, we do have a new model in store. It's the SkyNet T-800 series and it's got basically everything. Including AI integration with ChatGPT and all that jazz. Would you like me to get some prices for you?"

# Generate audio (requires a reference clip for voice cloning)
# wav = model.generate(text, audio_prompt_path="your_10s_ref_clip.wav")
wav = model.generate(text)
ta.save("test-turbo.wav", wav, model.sr)
186 changes: 186 additions & 0 deletions gradio_tts_turbo_app.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,186 @@
import random
import numpy as np
import torch
import gradio as gr
from chatterbox.tts_turbo import ChatterboxTurboTTS

DEVICE = "cuda" if torch.cuda.is_available() else "cpu"

EVENT_TAGS = [
"[clear throat]", "[sigh]", "[shush]", "[cough]", "[groan]",
"[sniff]", "[gasp]", "[chuckle]", "[laugh]"
]

# --- REFINED CSS ---
# 1. tag-container: Forces the row to wrap items instead of scrolling. Removes borders/backgrounds.
# 2. tag-btn: Sets the specific look (indigo theme) and stops them from stretching.
CUSTOM_CSS = """
.tag-container {
display: flex !important;
flex-wrap: wrap !important; /* This fixes the one-per-line issue */
gap: 8px !important;
margin-top: 5px !important;
margin-bottom: 10px !important;
border: none !important;
background: transparent !important;
}

.tag-btn {
min-width: fit-content !important;
width: auto !important;
height: 32px !important;
font-size: 13px !important;
background: #eef2ff !important;
border: 1px solid #c7d2fe !important;
color: #3730a3 !important;
border-radius: 6px !important;
padding: 0 10px !important;
margin: 0 !important;
box-shadow: none !important;
}

.tag-btn:hover {
background: #c7d2fe !important;
transform: translateY(-1px);
}
"""

INSERT_TAG_JS = """
(tag_val, current_text) => {
const textarea = document.querySelector('#main_textbox textarea');
if (!textarea) return current_text + " " + tag_val;

const start = textarea.selectionStart;
const end = textarea.selectionEnd;

let prefix = " ";
let suffix = " ";

if (start === 0) prefix = "";
else if (current_text[start - 1] === ' ') prefix = "";

if (end < current_text.length && current_text[end] === ' ') suffix = "";

return current_text.slice(0, start) + prefix + tag_val + suffix + current_text.slice(end);
}
"""


def set_seed(seed: int):
torch.manual_seed(seed)
torch.cuda.manual_seed(seed)
torch.cuda.manual_seed_all(seed)
random.seed(seed)
np.random.seed(seed)


def load_model():
print(f"Loading Chatterbox-Turbo on {DEVICE}...")
model = ChatterboxTurboTTS.from_pretrained(DEVICE)
return model


def generate(
model,
text,
audio_prompt_path,
temperature,
seed_num,
min_p,
top_p,
top_k,
repetition_penalty,
norm_loudness
):
if model is None:
model = ChatterboxTurboTTS.from_pretrained(DEVICE)

if seed_num != 0:
set_seed(int(seed_num))

wav = model.generate(
text,
audio_prompt_path=audio_prompt_path,
temperature=temperature,
min_p=min_p,
top_p=top_p,
top_k=int(top_k),
repetition_penalty=repetition_penalty,
norm_loudness=norm_loudness,
)
return (model.sr, wav.squeeze(0).numpy())


with gr.Blocks(title="Chatterbox Turbo", css=CUSTOM_CSS) as demo:
gr.Markdown("# ⚡ Chatterbox Turbo")

model_state = gr.State(None)

with gr.Row():
with gr.Column():
text = gr.Textbox(
value="Oh, that's hilarious! [chuckle] Um anyway, we do have a new model in store. It's the SkyNet T-800 series and it's got basically everything. Including AI integration with ChatGPT and um all that jazz. Would you like me to get some prices for you?",
label="Text to synthesize (max chars 300)",
max_lines=5,
elem_id="main_textbox"
)

# --- Event Tags ---
# Switched back to Row, but applied specific CSS to force wrapping
with gr.Row(elem_classes=["tag-container"]):
for tag in EVENT_TAGS:
# elem_classes targets the button specifically
btn = gr.Button(tag, elem_classes=["tag-btn"])

btn.click(
fn=None,
inputs=[btn, text],
outputs=text,
js=INSERT_TAG_JS
)

ref_wav = gr.Audio(
sources=["upload", "microphone"],
type="filepath",
label="Reference Audio File",
value="https://storage.googleapis.com/chatterbox-demo-samples/prompts/female_random_podcast.wav"
)

run_btn = gr.Button("Generate ⚡", variant="primary")

with gr.Column():
audio_output = gr.Audio(label="Output Audio")

with gr.Accordion("Advanced Options", open=False):
seed_num = gr.Number(value=0, label="Random seed (0 for random)")
temp = gr.Slider(0.05, 2.0, step=.05, label="Temperature", value=0.8)
top_p = gr.Slider(0.00, 1.00, step=0.01, label="Top P", value=0.95)
top_k = gr.Slider(0, 1000, step=10, label="Top K", value=1000)
repetition_penalty = gr.Slider(1.00, 2.00, step=0.05, label="Repetition Penalty", value=1.2)
min_p = gr.Slider(0.00, 1.00, step=0.01, label="Min P (Set to 0 to disable)", value=0.00)
norm_loudness = gr.Checkbox(value=True, label="Normalize Loudness (-27 LUFS)")

demo.load(fn=load_model, inputs=[], outputs=model_state)

run_btn.click(
fn=generate,
inputs=[
model_state,
text,
ref_wav,
temp,
seed_num,
min_p,
top_p,
top_k,
repetition_penalty,
norm_loudness,
],
outputs=audio_output,
)

if __name__ == "__main__":
demo.queue(
max_size=50,
default_concurrency_limit=1,
).launch(share=True)
4 changes: 3 additions & 1 deletion pyproject.toml
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
[project]
name = "chatterbox-tts"
version = "0.1.5"
version = "0.1.6"
description = "Chatterbox: Open Source TTS and Voice Conversion by Resemble AI"
readme = "README.md"
requires-python = ">=3.10"
Expand All @@ -22,6 +22,8 @@ dependencies = [
"spacy-pkuseg",
"pykakasi==2.3.0",
"gradio==5.44.1",
"pyloudnorm",
"omegaconf"
]

[project.urls]
Expand Down
1 change: 1 addition & 0 deletions src/chatterbox/models/s3gen/const.py
Original file line number Diff line number Diff line change
@@ -1 +1,2 @@
S3GEN_SR = 24000
S3GEN_SIL = 4299
Loading