Chatterbox TTS

_Made with ♥️ by

We're excited to introduce Chatterbox Multilingual, Resemble AI's first production-grade open source TTS model supporting 23 languages out of the box. Licensed under MIT, Chatterbox has been benchmarked against leading closed-source systems like ElevenLabs, and is consistently preferred in side-by-side evaluations.

Whether you're working on memes, videos, games, or AI agents, Chatterbox brings your content to life across languages. It's also the first open source TTS model to support emotion exaggeration control with robust multilingual zero-shot voice cloning. Try the english only version now on our English Hugging Face Gradio app.. Or try the multilingual version on our Multilingual Hugging Face Gradio app..

If you like the model but need to scale or tune it for higher accuracy, check out our competitively priced TTS service (link). It delivers reliable performance with ultra-low latency of sub 200ms—ideal for production use in agents, applications, or interactive media.

Key Details

Multilingual, zero-shot TTS supporting 23 languages
SoTA zeroshot English TTS
0.5B Llama backbone
Unique exaggeration/intensity control
Ultra-stable with alignment-informed inference
Trained on 0.5M hours of cleaned data
Watermarked outputs
Easy voice conversion script
Outperforms ElevenLabs

Supported Languages

Arabic (ar) • Danish (da) • German (de) • Greek (el) • English (en) • Spanish (es) • Finnish (fi) • French (fr) • Hebrew (he) • Hindi (hi) • Italian (it) • Japanese (ja) • Korean (ko) • Malay (ms) • Dutch (nl) • Norwegian (no) • Polish (pl) • Portuguese (pt) • Russian (ru) • Swedish (sv) • Swahili (sw) • Turkish (tr) • Chinese (zh)

Tips

General Use (TTS and Voice Agents):
- Ensure that the reference clip matches the specified language tag. Otherwise, language transfer outputs may inherit the accent of the reference clip’s language. To mitigate this, set cfg_weight to 0.
- The default settings (exaggeration=0.5, cfg_weight=0.5) work well for most prompts across all languages.
- If the reference speaker has a fast speaking style, lowering cfg_weight to around 0.3 can improve pacing.
Expressive or Dramatic Speech:
- Try lower cfg_weight values (e.g. ~0.3) and increase exaggeration to around 0.7 or higher.
- Higher exaggeration tends to speed up speech; reducing cfg_weight helps compensate with slower, more deliberate pacing.

Installation

pip install chatterbox-tts

Alternatively, you can install from source:

# conda create -yn chatterbox python=3.11
# conda activate chatterbox

git clone https://github.com/resemble-ai/chatterbox.git
cd chatterbox
pip install -e .

We developed and tested Chatterbox on Python 3.11 on Debian 11 OS; the versions of the dependencies are pinned in pyproject.toml to ensure consistency. You can modify the code or dependencies in this installation mode.

Usage

import torchaudio as ta
from chatterbox.tts import ChatterboxTTS
from chatterbox.mtl_tts import ChatterboxMultilingualTTS

# English example
model = ChatterboxTTS.from_pretrained(device="cuda")

text = "Ezreal and Jinx teamed up with Ahri, Yasuo, and Teemo to take down the enemy's Nexus in an epic late-game pentakill."
wav = model.generate(text)
ta.save("test-english.wav", wav, model.sr)

# Multilingual examples
multilingual_model = ChatterboxMultilingualTTS.from_pretrained(device=device)

french_text = "Bonjour, comment ça va? Ceci est le modèle de synthèse vocale multilingue Chatterbox, il prend en charge 23 langues."
wav_french = multilingual_model.generate(french_text, language_id="fr")
ta.save("test-french.wav", wav_french, model.sr)

chinese_text = "你好，今天天气真不错，希望你有一个愉快的周末。"
wav_chinese = multilingual_model.generate(chinese_text, language_id="zh")
ta.save("test-chinese.wav", wav_chinese, model.sr)

# If you want to synthesize with a different voice, specify the audio prompt
AUDIO_PROMPT_PATH = "YOUR_FILE.wav"
wav = model.generate(text, audio_prompt_path=AUDIO_PROMPT_PATH)
ta.save("test-2.wav", wav, model.sr)

See example_tts.py and example_vc.py for more examples.

Fine-tuning Guide

This repository includes tools for fine-tuning Chatterbox Multilingual TTS using LoRA (Low-Rank Adaptation). Follow this guide to train the model on your custom dataset.

Overview

The fine-tuning process involves two main scripts:

lora.py - Main training script with LoRA fine-tuning
fix_merged_model.py - Converts the trained model to the correct format

Step 1: Prepare Your Dataset

Create the following directory structure:

audio_data/
├── metadata.csv
└── audio/
    ├── utterance_0001.wav
    ├── utterance_0002.wav
    └── ...

metadata.csv format:

file_name,transcription,duration_seconds
audio/utterance_0001.wav,Your transcription text here,3.45
audio/utterance_0002.wav,Another transcription,2.87

Required columns:

file_name - Relative path to audio file (e.g., audio/utterance_0001.wav)
transcription - Text transcription of the audio
duration_seconds - (Optional) Duration in seconds for faster loading

Audio requirements:

Format: WAV files
Duration: Between 1-400 seconds (configurable)
Sample rate: Any (will be resampled automatically)
Quality: Clean speech with accurate transcriptions

Step 2: Configure Training Parameters

Edit the configuration section at the top of lora.py:

# Data and paths
AUDIO_DATA_DIR = "./audio_data"           # Path to your dataset
CHECKPOINT_DIR = "checkpoints_lora"       # Where to save checkpoints

# Training hyperparameters
BATCH_SIZE = 1                            # Batch size (1 for most GPUs)
EPOCHS = 50                               # Number of training epochs
LEARNING_RATE = 2e-5                      # Learning rate
GRADIENT_ACCUMULATION_STEPS = 8           # Accumulate gradients over N steps

# LoRA parameters
LORA_RANK = 32                            # LoRA rank (lower = fewer parameters)
LORA_ALPHA = 64                           # LoRA alpha (scaling factor)
LORA_DROPOUT = 0.05                       # Dropout rate

# Audio constraints
MAX_AUDIO_LENGTH = 400.0                  # Max audio length in seconds
MIN_AUDIO_LENGTH = 1.0                    # Min audio length in seconds
MAX_TEXT_LENGTH = 1000                    # Max text length in characters

# Checkpointing
SAVE_EVERY_N_STEPS = 200                  # Save checkpoint every N steps
VALIDATION_SPLIT = 0.1                    # 10% of data for validation

Language Configuration: By default, the script trains on Arabic (language_id='ar'). To change the language, edit line 1079 in lora.py:

language_id='ar'  # Change to: 'en', 'fr', 'zh', 'es', etc.

Step 3: Run Training

Start the training process:

python lora.py

What happens during training:

Loads the Chatterbox Multilingual TTS model
Injects LoRA adapters into transformer layers
Trains only the LoRA parameters (efficient fine-tuning)
Saves checkpoints every 200 steps
Generates real-time training metrics visualization (training_metrics.png)
Creates a merged model at the end

Training outputs:

checkpoints_lora/checkpoint_epochX_stepY.pt - Training checkpoints
checkpoints_lora/final_lora_adapter.pt - Final LoRA weights
checkpoints_lora/merged_model/ - Merged model (base + LoRA)
training_metrics.png - Real-time training visualization

Training metrics: The script generates a live dashboard showing:

Training and validation loss
Learning rate schedule
Gradient norms
Recent batch losses
Loss variance
Time per training step

GPU requirements:

Minimum: 16GB VRAM (NVIDIA GPU)
Recommended: 24GB+ VRAM for faster training
CPU training is supported but significantly slower

Step 4: Convert the Merged Model

After training completes, convert the model to the correct format:

python fix_merged_model.py

This converts the PyTorch .pt files to .safetensors format required by ChatterboxMultilingualTTS.from_local().

Output:

checkpoints_lora/merged_model/
├── ve.pt
├── t3_mtl23ls_v2.pt
├── t3_mtl23ls_v2.safetensors  ← Created by fix_merged_model.py
├── s3gen.pt
├── grapheme_mtl_merged_expanded_v1.json
└── conds.pt

Step 5: Test Your Fine-tuned Model

Option A: Load the Merged Model

import torchaudio as ta
from chatterbox.mtl_tts import ChatterboxMultilingualTTS

# Load your fine-tuned model
model = ChatterboxMultilingualTTS.from_local(
    "./checkpoints_lora/merged_model",
    device="cuda"
)

# Generate speech with your fine-tuned voice
text = "مرحبا، هذا نموذج الصوت المخصص الخاص بي"  # Arabic example
wav = model.generate(text, language_id="ar")
ta.save("finetuned_output.wav", wav, model.sr)

Option B: Load Base Model + LoRA Adapter

from chatterbox.mtl_tts import ChatterboxMultilingualTTS
from lora import load_lora_adapter

# Load base model
model = ChatterboxMultilingualTTS.from_pretrained(device="cuda")

# Load your LoRA adapter
lora_layers = load_lora_adapter(
    model,
    "./checkpoints_lora/final_lora_adapter.pt",
    device="cuda"
)

# Generate speech
text = "Your text here"
wav = model.generate(text, language_id="ar")

Troubleshooting

Common Issues

"No valid audio samples found"

Check that AUDIO_DATA_DIR in lora.py matches your dataset location
Verify metadata.csv exists and has the correct format
Ensure audio files are in the audio/ subdirectory

"CUDA out of memory"

Reduce BATCH_SIZE to 1
Reduce MAX_AUDIO_LENGTH to 200 or less
Reduce LORA_RANK to 16 or 8
Use gradient checkpointing (advanced)

"Loss is NaN or not decreasing"

Lower LEARNING_RATE (try 1e-5)
Check that transcriptions match audio content
Ensure audio quality is good (no noise/corruption)
Increase WARMUP_STEPS to 1000

"Training is very slow"

Reduce MAX_AUDIO_LENGTH to filter long samples
Use a GPU instead of CPU
Increase GRADIENT_ACCUMULATION_STEPS and BATCH_SIZE

Dataset Quality Tips

For best results:

Accurate transcriptions - Ensure text exactly matches spoken audio
Clean audio - Remove background noise, music, and overlapping speech
Consistent speaker - Use recordings from the same speaker
Sufficient data - Aim for at least 30-60 minutes of audio
Diverse content - Include varied vocabulary and sentence structures

Advanced Configuration

Target Modules

LoRA is applied to these transformer layers (line 745 in lora.py):

target_modules = ["q_proj", "v_proj", "k_proj", "o_proj",
                  "gate_proj", "up_proj", "down_proj"]

To fine-tune fewer layers (faster, less overfitting):

target_modules = ["q_proj", "v_proj"]  # Only query and value projections

Resume Training from Checkpoint

To resume training from a checkpoint, modify line 741 in lora.py:

# Replace:
model = ChatterboxMultilingualTTS.from_pretrained(device=DEVICE)

# With:
model = ChatterboxMultilingualTTS.from_local(
    "./checkpoints_lora/merged_model",
    device=DEVICE
)

Multi-Language Fine-tuning

To train on multiple languages, modify the load_audio_samples() function to read language IDs from metadata.csv:

Add a language_id column to metadata.csv:

file_name,transcription,duration_seconds,language_id
audio/file1.wav,Hello world,2.5,en
audio/file2.wav,Bonjour monde,2.3,fr

Update line 1079 in lora.py:

# Replace:
language_id='ar'

# With:
language_id=row.get('language_id', 'ar')  # Read from metadata

Acknowledgements

Built-in PerTh Watermarking for Responsible AI

Every audio file generated by Chatterbox includes Resemble AI's Perth (Perceptual Threshold) Watermarker - imperceptible neural watermarks that survive MP3 compression, audio editing, and common manipulations while maintaining nearly 100% detection accuracy.

Watermark extraction

You can look for the watermark using the following script.

import perth
import librosa

AUDIO_PATH = "YOUR_FILE.wav"

# Load the watermarked audio
watermarked_audio, sr = librosa.load(AUDIO_PATH, sr=None)

# Initialize watermarker (same as used for embedding)
watermarker = perth.PerthImplicitWatermarker()

# Extract watermark
watermark = watermarker.get_watermark(watermarked_audio, sample_rate=sr)
print(f"Extracted watermark: {watermark}")
# Output: 0.0 (no watermark) or 1.0 (watermarked)

Official Discord

👋 Join us on Discord and let's build something awesome together!

Citation

If you find this model useful, please consider citing.

@misc{chatterboxtts2025,
  author       = {{Resemble AI}},
  title        = {{Chatterbox-TTS}},
  year         = {2025},
  howpublished = {\url{https://github.com/resemble-ai/chatterbox}},
  note         = {GitHub repository}
}

Disclaimer

Don't use this model to do bad things. Prompts are sourced from freely available data on the internet.

Name		Name	Last commit message	Last commit date
Latest commit History 34 Commits
src/chatterbox		src/chatterbox
.gitignore		.gitignore
Chatterbox-Multilingual.png		Chatterbox-Multilingual.png
LICENSE		LICENSE
README.md		README.md
diagnose_and_fix.py		diagnose_and_fix.py
example_for_mac.py		example_for_mac.py
example_tts.py		example_tts.py
example_vc.py		example_vc.py
fix_merged_model.py		fix_merged_model.py
fix_tokenizer_corrupted.py		fix_tokenizer_corrupted.py
gradio_tts_app.py		gradio_tts_app.py
gradio_vc_app.py		gradio_vc_app.py
lora.py		lora.py
multilingual_app.py		multilingual_app.py
pyproject.toml		pyproject.toml
test.py		test.py
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Chatterbox TTS

Key Details

Supported Languages

Tips

Installation

Usage

Fine-tuning Guide

Overview

Step 1: Prepare Your Dataset

Step 2: Configure Training Parameters

Step 3: Run Training

Step 4: Convert the Merged Model

Step 5: Test Your Fine-tuned Model

Option A: Load the Merged Model

Option B: Load Base Model + LoRA Adapter

Troubleshooting

Common Issues

Dataset Quality Tips

Advanced Configuration

Target Modules

Resume Training from Checkpoint

Multi-Language Fine-tuning

Acknowledgements

Built-in PerTh Watermarking for Responsible AI

Watermark extraction

Official Discord

Citation

Disclaimer

About

Uh oh!

Releases

Packages

Languages

License

Ahmed-Ezzat20/chatterbox-finetuning-multilingual

Folders and files

Latest commit

History

Repository files navigation

Chatterbox TTS

Key Details

Supported Languages

Tips

Installation

Usage

Fine-tuning Guide

Overview

Step 1: Prepare Your Dataset

Step 2: Configure Training Parameters

Step 3: Run Training

Step 4: Convert the Merged Model

Step 5: Test Your Fine-tuned Model

Option A: Load the Merged Model

Option B: Load Base Model + LoRA Adapter

Troubleshooting

Common Issues

Dataset Quality Tips

Advanced Configuration

Target Modules

Resume Training from Checkpoint

Multi-Language Fine-tuning

Acknowledgements

Built-in PerTh Watermarking for Responsible AI

Watermark extraction

Official Discord

Citation

Disclaimer

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages